Hacker News new | past | comments | ask | show | jobs | submit login

We're building something similar to this at Splitgraph, at least in the sense that we have immutable data in a Postgres-compatible DB with point-in-time queries across versioned, addressable snapshots. In our case, we apply the idea of immutability to "data images" that are analogous to Docker images. You build and push them in the same way, and then you can reference any "image" (version) [0] of data by addressing it with the correct tag.

For example, here is a link to a live query on our Data Delivery Network (DDN) that runs a JOIN on two daily snapshots (20200809 and 20200810). [1] In this case, these images are the result of a daily script that builds and pushes a new image each day. The storage costs are minimal, as each new image only needs to store the changed rows, rather than a duplicative snapshot.

Each immutable image is comprised of a set of small content-addressable cstore fragments uploaded to object storage, which we only load into the database when they become necessary to satisfy a query. When a query arrives at the DDN, we intercept it at the network level by scripting PgBouncer with embedded Python to orchestrate the infrastructure required to answer the query. The embedded code parses the AST of the query for table references, which it uses to "mount" a temporary schema for serving the query. The temporary schema includes an FDW that implements a "layered querying" protocol (think AUFS) to lazily download only the fragments required to satisfy the query.

(Also, we support live data. But that's for another time!)

[0] https://www.splitgraph.com/docs/concepts/images

[1] https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: