JSON Changelog with SQLite

derefr · on Aug 28, 2018

I've seen this pattern a couple of times (e.g. for broadcasting changes using Postgres's LISTEN/NOTIFY.)

Each time, I wonder why people are recreating the database's journal, within the database, where this meta-journal will itself be getting journaled.

Why not just consume the actual DB journal?

I'm not sure how that'd work for SQLite—it'd probably require a hack of some sort, since SQLite's journal is an ephemeral part of the DB file. But for more full-fledged RDBMSes, this would just be a matter of hijacking a streaming-replication-based backup system to act as a store of record for events. E.g., in the case of Postgres, you'd just have an ETL pipeline pointed at your WAL-E bucket, parsing through your WAL files either to load into a separate OLAP store, or for stateless map-reduction whenever you need to answer a question.

anarazel · on Aug 28, 2018

> Why not just consume the actual DB journal?

It's decidedly non-trivial to do so, as usually the journal doesn't contain all the information to do so. Journal writes are often a bottleneck so DB engines try to restrict their contents to just the necessary parts. Typically it'll e.g. not contain information about table schemas etc.

You can do so, see e.g. postgresql's logical decoding, but it's plenty of additional work.

derefr · on Aug 28, 2018

> Journal writes are often a bottleneck

The point I was trying to make is that the alternative—writing all your DB activity into the DB—means that you're already, 100% guaranteed, creating more overhead than the heaviest possible journal-safety properties would create. You're writing unabridged journal entries into the DB, where they then get even more journal attached.

There is no world where it makes more sense to write a copy of everything you do into an "everything I did" table, than it does to use an append-only / immutable / log-centric DBMS (where "Postgres with an ETL pipeline pointed at its WAL logs" is a rather hacky immutable DBMS.)

If you know, before you ever start coding, that you will need to know everything that ever happened, writing everything twice—when your DBMS is already writing everything twice—is a rather silly way to choose to go about solving your problems. You're already in a world where you need different guarantees than a traditional (ephemeral) RDBMS gives you; you just haven't realized it yet.

cm2187 · on Aug 28, 2018

Perhaps it’s a dev vs ops problem. Only ops see the journal, understand it, access it. If the dev want a journal they end up building their own.

WorldMaker · on Aug 28, 2018

It's also a public versus private API issue. Most databases consider their journal a private implementation detail and don't recommend programming directly against it, as implementation can change as the developers change internal details from database software version to version, or even on a runtime whim as the system's query optimizer/write buffer state/etc needs change over time in some of the databases.

For instance, Microsoft SQL Server has a built-in public change tracking API, which you can tell is a leaky abstraction over top of the journal, but SQL Server makes an API guarantee when using that specific change tracking API whereas (at least the last time I looked) it will never make API guarantees about anything lower level than that that accesses its journal. The Change Tracking API documentation even warns you that it locks the journal into certain deoptimizations and that there are clear performance trade-offs, which also hints that using the lower level journal API more directly would likely hit optimization problems and data loss issues.

As a dev, it's rarely a good idea to program against APIs your vendor intentionally deems to be private/implementation details that can change between versions/may be unstable in runtime operation. If the database doesn't offer a public journaling API, then yes, sadly, sometimes the best or at least least-worst answer is to build a sub-par journal inside the database itself, with of course its own tradeoffs in performance.

anarazel · on Aug 29, 2018

> For instance, Microsoft SQL Server has a built-in public change tracking API, which you can tell is a leaky abstraction over top of the journal,

Postgres' logical decoding [1] provides something very similar.

[1]: https://www.postgresql.org/docs/current/static/logicaldecodi...

piaste · on Aug 29, 2018

> There is no world where it makes more sense to write a copy of everything you do into an "everything I did" table, than it does to use an append-only / immutable / log-centric DBMS (where "Postgres with an ETL pipeline pointed at its WAL logs" is a rather hacky immutable DBMS.)

There is one: the world where append-only / immutable / log-centric DBMS do not offer the same advantages as mainstream RDBMSs - scalability, stability, performance, language support, open sourcing, active development, ubiquitous SQL knowledge, etc. etc.

The downside of bolting an event-sourcing or changelog system onto an established RDBMS like Postgres are (a) the cost of writing them, and (b) the performance / disk space cost. Both of those may be easier to deal with than, say, Datomic's licensing and Clojure-oriented APIs, especially if you can use an existing library like Marten.

What immutable DBMS would you recommend?

zapita · on Aug 28, 2018

> writing everything twice—when your DBMS is already writing everything twice—is a rather silly way to choose to go about solving your problems

The GP specifically said the DBMS journal does not write everything: it writes as little as it can get away with, for performance reasons. Since you quoted GP very selectively, you didn’t address what they actually said.

mamcx · on Aug 28, 2018

Because most of a db engine is a black box. I can't just say:

   select * from journal

Neither stuff like

   select * from index_customer

and many other things. This is good and bad. Good because is less likely to people to mess with the engine and bad when you think in your DB as more that a restricted data store.

My dream is build one more flexible, starting with a relational lang:

https://bitbucket.org/tablam/tablam/wiki/Home

But this is still in prototyping.

willvarfar · on Aug 28, 2018

Yes I agree with you. If databases exposed their change-logs and audit-logs as queriable schemas then a lot of the application-level event-queues-in-the-database would not be necessary, or greatly simplified.

I have lots of wishes in that direction. For example, I wish that my programs could ask 'has this table been written to since I last checked' instead of having to read all the price lists into the application cache every n minutes whether there have been any changes or not etc.

mxuribe · on Aug 28, 2018

+1

I only want READ access to the change logs and audit logs...if databases only did this, it would be so much better.

tracker1 · on Aug 28, 2018

Isn't there usually a typical part of cleanup where journal entries/logs are rolled up/truncated?

Though, I get what you mean... the format in question would allow for querying changes against a given table for the last 5. With your solution, you'd have to go through the whole journal to track all the changes for a given record, and present it in a history format, with say diffing.

hiccuphippo · on Aug 28, 2018

Meteor.js used the MongoDB's oplog, which is used to replicate changes to other nodes, to provide real time update to it's clients.

You can query the oplog yourself easily [0]. I also remember someone doing a version for MySql using the same approach.

[0] http://www.briancarpio.com/2012/04/21/mongodb-the-oplog-expl...

goliatone · on Aug 28, 2018

Mongo has now change streams[1] which use the aggregation framework so you can filter for specific updates or do transformations. You need v3.6, to have a replica set/cluster and run it with a flag to enable the behavior tho.

[1] https://docs.mongodb.com/manual/changeStreams/

daurnimator · on Aug 28, 2018

You could create the log table as "unlogged" so that it doesn't go to WAL.

DelightOne · on Aug 28, 2018

This sounds complicated. Could be the reason for the pattern. And if a feature is missing in the complicated setup what do you do?

cannadayr · on Aug 28, 2018

I solved this problem by wrapping sqldiff (included w/ sqlite src) as a custom git diff driver.

https://github.com/cannadayr/git-sqlite

radiospiel · on Aug 28, 2018

The sqlite sessions extension seems to be a good foundation for getting changelogs out of a sqlite database. It is sadly not too well known: https://www.sqlite.org/sessionintro.html

groue · on Aug 28, 2018

Related article: "Automatic Undo/Redo Using SQLite" https://www.sqlite.org/undoredo.html

iffycan · on Aug 28, 2018

Thank you for posting this. A former version of my post included a reference to this because it's a great read, too.

Scaevolus · on Aug 27, 2018

Neat! If you wanted to extend this more (string diffs, non-json encodings), a custom function would be the next logical step. In Python it might be:

    import sqlite3

    def delta(*args):
        ret = {}
        for name, old, new in zip(args[::3], args[1::3], args[2::3]):
            if old != new:
                ret[name] = old
        if ret:
            return ret

    db = sqlite3.connect(':memory:')
    db.create_function('delta1', delta, 3)
    db.create_function('delta2', delta, 6)
    db.create_function('delta3', delta, 9)

    print db.execute('select delta2("b", "20", "20", "c", 3.4, 3.6)').fetchone()
    >> (u'{"c": 3.4}',)

swuecho · on Aug 28, 2018

in python 3.6

    import sqlite3
    import json

    def delta(*args):
        ret = {}
        for name, old, new in zip(args[::3], args[1::3], 
            args[2::3]):
            if old != new:
               ret[name] = old
        if ret:
           return json.dumps(ret)

    db = sqlite3.connect(':memory:')

    db.create_function('delta1',3, delta)
    db.create_function('delta2',6, delta)
    db.create_function('delta3',9, delta)
    print(db.execute('select delta2("b", 20, 20, "c", 3.4, 
    3.6)').fetchone())
    # ('{"c": 3.4}',)

janemanos · on Aug 28, 2018

ArangoDB also has a first tool to listen to changes in the database and execute actions or queries. It is limited to a single node. Question is how scalable all of these tools are https://www.arangodb.com/2017/03/arangochair-tool-listening-...