Hacker News new | past | comments | ask | show | jobs | submit login
The Value of Bitemporality – Whose Time Is It Anyway? (juxt.pro)
91 points by tosh on April 21, 2019 | hide | past | favorite | 37 comments



There are a couple of good books on bitemporality, Richard Snodgrass' "Developing time-oriented database applications in SQL" is pretty much the bible on the subject and goes into great detail. "Managing Time in Relational Databases" by Tom Johnston and Randall Weis is another great book, a little more hands on with the code.

I'd love to see someone translate the concepts in both books to a useful postgres package. As Snodgrass points out, "Unfortunately, due to the noncompliance of all existing DBMSs, a few of these fragments run on no existing platform" and thus only discusses how they can be applied to DB2, MSSQL, Sybase, Oracle8 and UniSQL.


At my place of work, we leverage temporal and bitemporal tables in PostgreSQL extensively (I’d estimate that ⅔ of our Schemas are either temporal or bitemporal).

We have a few custom tools to make them more ergonomic to create and update, but the core functionality is derived from an open source extension, I believe. Although, googling for it now, while away from my work computer, I can’t seem to find it... :/

... after looking it up on my work computer, it turns out we wrote our own implementation. It’s less than 200 lines of SQL, and uses three relatively common extensions. I wish it was open source, because it’d be super helpful to have in the future. But hey, at least it’s good to know that it can be/has been practically done?


Anyone know whether the [pg_bitemporal](https://github.com/scalegenius/pg_bitemporal) project is still being developed? There's a talk on YouTube which showed promise but the Github repo seems pretty dead.


YES! And moreover, I am actively using it in production, we've added lots of new functions, and made it PG 10 comatiable


Oh great! I enjoyed your talk and look forward to trying it out. I just briefly browsed the repo and didn't see much in terms of documentation. Where do you recommend I should start?



This one requires to create historical tables for each "regular" table, like in SQL 2011, in AVF you keep everything in one table, this feature provides a unique level of querying flexibility for temporal queries.


Our implementation creates two views derived from each bitemporal table (_history and _full, which is a union between the base table and _history).

It's definitely "heavier" than it needs to be – we store a lot of redundant data, but it hasn't been a huge issue.

But we have tools that build the tables/views, and generate sqlalchemy code, so that it's not difficult to manage, all told.


From my humble experience, performance of such views will be suboptimal. When we started our implementation of AVF (Johnson and Weis), we just decided to believe the authors, that people only "think" that storing all of the history in the same table as current data will make the said tables growing too fast and inefficient, that people just presume and do not try. So we decided to believe and to try, and it worked.

Our implementation is based on GIST with exclusion, and we have it in production for over a year by now. Zero performance problems, lots of gains from the business perspective.

Our code is here: https://github.com/scalegenius/pg_bitemporal


That’s awesome! I’m inclined to agree with you, that our solution’s suboptimal. We’re using btree_gist, plpgsql (fun), and hstore as extensions, if that tidbit is interesting to you.

Another totally valid complaint of our implementation is that it makes Schema migrations a major pain. I wasn’t around when we made our extension, but if I had been, I probably would have recommended using yours :). I suspect that we did it our way because it was some combination of (1) the most straightforward to migrate our existing data towards (2) it was the easiest to hack together.

Anyways, thank you for sharing your project! I’m excited to use it at home, and maybe in the future at work!


I am shamelessly promoting it, because we have no bandwith to make it to the extension, and for the past almost three years we've been promised by several people/organizations to create an extension or even to incorporate it into "next" version of Postgres (and I am not pointing fingers :)). Still hoping somebody will be willing to work with us to make it happen. First time I talked about it with Bruce exactly 3 years ago, he told me there should be a proven need. So... we have a business case :). but still ...


We definitely have a big business case (>$1bn/year flows through our bitemporal data). Happy to vouch for you if I can. I work in the insurance industry, and daily face questions like:

- What was X's address 2 years ago?

- What did our database think X's address was 2 years ago, 3 months ago?

It's fundamentally impossible to answer those kinds of queries in a scalable way (parsing application logs is not a scalable way) without bi-temporal data.

I'd readily acknowledge that there are some types of data, for which having the capability to answer those queries is overkill. But, in the insurance industry, this kind of introspection is essential from both a compliance and correctness point of view.

Personally, if I were an auditor/government regulator for this industry, I would raise a big red flag over any critical data that wasn't bitemporal.

I would definitely vouch for its business necessity – so please, keep up your advocacy!! :)


So more like Oracle Flashback?


A huge practical challenge when implementing bitemporality is how to deal with schema migrations efficiently whilst supporting arbitrary time-travel queries. When we were building Crux, which is referenced at the end of the post, we took the view that this is a hard problem to solve in a generic way, hence why we have opted for a schemaless document-oriented design.

I really hope to see other databases make it easier to use bitemporality in the near future, but I suspect that any DBMS which mandates a schema is fighting an up-hill battle.

Disclosure: working on Crux at JUXT :)


You could consider ‘partitioning’ your DB when you migrate your schema so each DB instance only deals with a single bitemporal slice eg after 5 migrations you have 5 distinct database instances. This approach avoids bloat assuming the vast majority of queries would be served by the most recent instance, while not precluding serving verbatim responses from earlier instances.


I wonder why so few people seem to think about persistent data structures (persistent in the functional sense, that is immutability -- copy-on-write). Basically we implemented a form of hash array based tries for https://sirix.io ... or it's just like how ZFS stores objects adding levels of indirect blocks/pages on demand with bitsets storing which indexes are really set to avoid having arrays with a lot of null references. For sure you have to copy the path to the root, just like snapshots work in ZFS or Btrfs, but I think that data is neglegable when you basically also do versioning on a per record-level/per revision level. At least thus we can restore each revision in the same time, you do not have to have a "history table", which is inefficient to query in contrast to the most recent revision table.


But it's schema-less ;-)


I always wonder why not going one step further? Nowadays random reads on SSDs are so much cheaper and I don't think the end is just a huge B+- or LSM tree, which has to cluster everything and store so much data redundantly (transaction time). I would say that restoring a specific revision isn't really efficient.

I'm still working on a project which had been started around 2006 by Marc Kramis (his Ph.D. work) and where I began work on around 2007 :-) we borrowed quiet some ideas from ZFS mainly (as well as from Git now) and putted them to test on a sub-file level and added our own stuff as for instance record-level versioning via a sliding snapshot algorithm.

There's still a lot to do, but I'm able to store and query revisions of both XML as well as JSON now via XQuery and I'll look into (cost based) query optimizations and partitioning/replication next. I know that it has been probably crazy to write a storage manager from scratch, but I think Marc's ideas are pretty good and I added my own ideas and Sebastian Graf, another Ph.D. student back then also did a lot of work on the project, just as many other students. Maybe I'm just crazy to keep working on it almost daily now besides my day to day software engineering job, but yeah... I guess you have to be a bit too convinced and too dedicated to something, maybe (even though sadly I don't know if anyone tried it lately) ;-) maybe I need to contact Marc after all this time again :-)

https://sirix.io/concepts.html

http://pubsys.mmsp-kn.de/pubsys/publishedFiles/Kramis2014.pd...


> I'd love to see someone translate the concepts in both books to a useful postgres package.

This is some work on this. [1] is an amazing patch by some researchers in Europe that has already had several rounds of review. Their first published work was against Postgres 9.1! Their solution is way better than SQL:2011, but could be used behind-the-scenes to implement the features from the standard. Essentially they implement a temporal variant for every operator in the relational algebra (very elegantly btw, by combining standard operators with just two new functions). That approach solves some of the problems that led to Snodgrass's original temporal proposal getting rejected in the late 90s (and which are still holding back SQL:2011 IMO). Whereas the standard only gives temporal INNER JOIN, their work gives every join type, as well as aggregates (including a way to scale the inputs if the group only includes part of their time interval). It also lets queries compose better through subqueries, views, CTEs, set-returning functions, etc. In my opinion their work would make Postgres the most advanced temporal database in the world. I hope it gets the attention it needs!

I also have a very modest WIP patch myself at [2], which adds temporal-aware primary and foreign keys.

This is all based on the excellent work on ranges and exclusion constraints by Jeff Davis and others. Postgres has an excellent foundation, but needs the higher-level concepts so that users can more easily set up temporal tables and use them to answer queries and record changes.

If you'd like to learn more about temporal databases, I wrote a summary of the research at [3]. I'm also giving a talk at the Postgres conference in Ottawa next month.... :-)

EDIT: Btw, if all you need is transaction time (usually included for auditing and compliance), then the temporal_tables extension [4] should be all you need. Maybe also consider the pgaudit extension too [5].

[1] https://commitfest.postgresql.org/22/2045/

[2] https://commitfest.postgresql.org/22/2048/

[3] https://illuminatedcomputing.com/posts/2017/12/temporal-data...

[4] https://github.com/arkhipov/temporal_tables

[5] https://github.com/pgaudit/pgaudit


I wrote a Ruby library[1] that builds on Active Record and implements SCD-Type 2 using updatable views and INSTEAD OF triggers.

It has its pain points, mostly related to how hackish the AR insertion is, but it works quite well and the approach ensures that any INSERT/UPDATE/DELETE triggers bookkeeping of the history tables.

[1] https://github.com/ifad/chronomodel


Oh I really admire Chronomodel! Actually I link to it in my annotated bibliography ([3] above). Thanks for all your work on that! (Also: I exchanged some emails with one of the Chronomodel devs a few months ago. Perhaps that was you?)

For valid-time DML, I think the most useful operation is an upsert/merge. Tom Johnston calls this an `INSERT WHENEVER` I think. I have a not-yet-published Rails project that also uses INSTEAD OF triggers to convert AR saves to valid-time upserts. Hopefully I can get that published soon. I would love to even contribute it to Chronomodel eventually if that makes sense, so that it can be bitemporal.

Although there is tons of research in temporal relational database design, the rest of the stack really needs attention. What does temporal mean for REST and CRUD? Or your ORM? What are good UX patterns for viewing and editing a valid-time history? Can you make save-as-you-type work or do you need a Save button? I'd love to see people start working on that! My own Rails experiment landed on one way, and for me it validated that the db-level features would actually be useful, but there are other approaches possible.


Wow great summary, thank you!


A toy example of this would be keeping a diary in a text file in a git repo. The journal entry dates in the text file are independent of the commit dates.


How does Crux compare to Datomic?


Crux is quite a different beast to Datomic, in that it is schema-less, designed to work with Kafka, bi-temporal and some other architectural differences. Similarities are that they are both built (mainly) in Clojure and support Datalog queries (albeit different dialects)

Disclosure: I work at JUXT but not directly on Crux


Thanks for the Datomic link, didn't know of that database system :-)


> Why would you stop at two time axes? Why not go for three, or four, or N many?

This question is confused, because it's taking a derived property as fundamental. It takes as fundamental the fact that there are two associated times, rather than as fundamental the fact that there are these particular associated times, of which there happen to be two. If you can't think of a particular third associated time to add, the idea of "going to three" is meaningless on its own.


My mind immediately jumps to thinking about branching valid time timelines, but having thought about this a _lot_ over the last few months I think actually all other kinds of timelines can be implemented using valid time + additional indexing.


I thought about it, too, but I guess then you really dive into the field of versioning systems and probably it won't be a database system anymore. That is also manual conflict handling...

I think branching/merging itself wouldn't be that hard to implement, at least if you have a versioned index at the very core (disclaimer: I'm also developing an Open Source temporal storage system). But then you'd have checkouts, handling conflicts...


Additional time axes allow you to separate transactions from different sources of input. E.g. in a collaborative system it could be helpful to ask questions while rewinding the actions of a particular user.

But a different question is whether the system can do something useful for you with first-class time co-ordinates, compared to just stuffing additional timestamps into your data. (something useful being clever indexing, compaction, maybe more?)


Second that :-) transactional time and valid time, what else could there be?


The point is that you might have multiple aspects of valid time:

    SELECT person AS OF t1 FROM upstream AS OF t2
Which reconstructs the view that our upstream source would have constructed for a particular time (t1) from its perspective at a different time (t2).

    JOIN event_feed_1 AS OF t1 AND event_feed_2 AS OF t2
There’s various reasons you need to offset the comparison of two event feeds, such as their clocks not matching.


We've implemented a full Asserted Versioning Framework (Johnson and Weis) for Postgres, and use it in production. Using it for about 18 months for both OLPT and OLAP. works beautifully :)



My takeaway is (for applications that need this):

#1: Keep rows immutable. Don't use UPDATE or DELETE.

#2: Store the time columns you need, e.g. insertion or transaction time, and/or event generation time, etc.


Hm, and why storing so much information redundantly? Why not providing a tailored index structure... basically snapshotting page-trees efficiently :-)


If the db internally supports efficient temporal tracking, then yes by all means. If it doesn't, then I don't see a way for the developer around the redundancy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: