Hacker News new | past | comments | ask | show | jobs | submit login
Immudb 1.0 – open-source, immutable database with SQL and verified timetravel (codenotary.com)
296 points by vchain-dz on May 25, 2021 | hide | past | favorite | 144 comments



> Data stored in immudb is cryptographically coherent and verifiable, just like blockchains, but without all the complexity. Unlike blockchains, immudb can handle millions of transactions per second, and can be used both as a lightweight service or embedded in your application as a library.

> Companies use immudb to protect credit card transactions and to secure processes by storing digital certificates and checksums.

This explanation is available on their github repo [1]. It has been a common refrain on Hacker News that you don't need a blockchain and instead can just use a database, but this product may actually fill the gap where tamper resistance is desired.

[1] https://github.com/codenotary/immudb


I don't quite understand how something you run yourself on your own hardware can be tamper-proof (digitally, not physically). If you're running the software you can modify it, so no matter how many processes there are in place for resisting mutability, you'll always be able to find some way to mutate it.

Compared to blockchain which is running on X number of nodes that you'd have to have access to in order to modify something, immudb doesn't actually seem to replace the use case when you need something actually tamper-proof.


You can build a Merkle tree from any data that is append-only; Git does this, as do ZFS and Dat/Hypercore. That lets you make strong assertions about data integrity, even without blocking local writes.

Now add a mirror: git upstream, FS snapshot, immudb replica, etc…or even just an outside log of the merkle proofs themselves. Then, if your database ever fails a check against that proof, you know the data has been modified, not just appended to.

To use a familiar Git workflow example: you can do whatever local writes you want, but if you disallow force pushes no one can erase history on the upstream repo.

Put another way: if you have immutable backups you don’t need a blockchain to ensure data integrity. OTOH, if you can’t trust your own infrastructure even as far as a secure remote backup you have other problems that a blockchain won’t solve either.


Again though, the entire point of blockchain tech is that it's decentralized. In the git analogy, someone is "disallowing force pushes", but the person that is disallowing can modify the database.


You can use git in a secure decentralized way - you have people sign commits.


by the way, you can play with immudb Merkle tree visually on https://play.codenotary.com


Blockchains like Bitcoin are actually not tamper-proof. They can be attacked by 51% attacks where you can even rewrite history if you have enough hashpower. The protocol is explicitly designed to always follow the longest chain thus the only defense is to hash faster than the attackers. This might vary for other blockchains but the biggest and most mentioned is particularly unsafe in that regard.


There are wo things you skip over:

1. You can't rewrite everything. Given enough hash power to create a longer chain, you can create a block that a) removes any transactions from a block in the original chain and b) contains new valid transactions (must be signed by you so you must be the owner of the Bitcoins used in the tx), allowing you to double spend your coins, but you can't change other people's transactions.

2. With each new block changing a past one becomes harder, while you make it sound as if you could arbitrarily rewrite history. Merchants usually wait several blocks before accepting your on-chain payment. Exchanges wait 6 blocks as that's seen as infeasible to change a block that's buried under 5 other blocks for non-nation state actors.

TL;DR: 1) Other people's transactions can at most be removed but not be changed and 2) data on the Bitcoin blockchain is tamper proof after x blocks.


This is ignoring that if you own 51% of the network, you are free to just rewind the chain back to whatever you think is suitable, and rebuild it from that point on.

Sure, someone, somewhere will know the truth but who cares - it's all about information control, and you control 51% of the network. Now your opponents have to scramble with social countermeasures to try and discredit your chain. A politically savvy attacker will simply buy off the main social nodes favorably.

EDIT: The problem with "oh but people will know" argument is that it assumes credibility outside the network, the thing which cryptocurrencies like to claim is unnecessary. In reality, if the network is 51% attacked, your entire recovery strategy is assuming enough people believe you about what's real or should be real. This isn't even theoretical: this is literally what happened to Ethereum.


> This is ignoring that if you own 51% of the network...

Are you talking about some kind of node sybil attack? Because the only way I can make sense of what you've said is if a system had client implementations maintaining zero state - relying completely on unsecured network consensus. But in that scenario hashing power wouldn't enter the equation. I don't know of any cryptocurrency that operates that way, and doubt that one ever has outside of a home lab.

Or are you talking about someone trying to hijack a cryptocurrency by creating brand confusion and convincing people to run modified code, like the failed bcash campaign against bitcoin?

A 51% attack on a blockchain has no rewrite ability past the point that where an attacker establishes a longer chain. That is why it is a "chain"... where do you think the hashing function inputs come from that are being fed into the adversarial mining hardware? An earlier block - the one where the attack started, you can't go back further than that without having to throw out and rehash your entire attack fork.


You are both mostly right, as I understand it.

If I had >50% hashing power of BTC(for exactly 1 transaction), I can make the current chain say anything I want, like give me all the BTC, and it would become valid and "permanent". To re-write actual history takes a lot more work, and wouldn't be possible with 51% for 1 transaction.. If I was able to maintain 51% control for a long time, then I can do anything I want for as long as I have 51% control. Though I imagine after that very 1st transaction, the world of BTC would blow up and everyone would stop hashing BTC as there would be zero point as the current chain is now effectively useless.

This is the real issue that I see, 1 transaction of 51% power is enough to permanently wipe out all of BTC's worth. As far as I'm aware every crypto currency basically has this same problem.


> You are both mostly right, as I understand it.

> If I had >50% hashing power of BTC(for exactly 1 transaction), I can make the current chain say anything I want, like give me all the BTC, and it would become valid and "permanent".

You clearly don't understand it, and it is kind of amazing that you think such a design would survive for as long as bitcoin has (going on 12 years) - by relying on altruistic miners not taking advantage of such a silly flaw. The Satoshi white paper is 9 pages long and written in very plain language... do like everyone use to do back in the day: read the paper and take a peek at the source, it goes a long way in inoculating you from misinformation that you subsequently repeat, accidentally (?) misinforming others.


BTC has changed a bit since the original white paper, and I have read the white paper(though it's been a while). Back it up with sources to convince me I'm wrong, don't just say I am with nothing but YOU ARE WRONG, that doesn't accomplish anything.

To help, here is my understanding.

In a < 51%(i.e. normal transaction), if I win the BTC mining lottery and get to write the next transaction, I can make it say whatever I want, but then 51% of the mining network has to agree that what I said was sane(as defined by the current mining software).

If I control 51%, then I can make the transaction say whatever I want AND I can make everyone else accept it as sane, because a majority of the network agrees with me.

This is how BTC changes over time, > 51% of the network agree that they will accept X as the new reality, and it then becomes the new reality.


The old question about how one eats an elephant comes to mind... you've expressed such a fundamental misunderstanding that the only sensible correction is: YOU ARE WRONG, START OVER. For example, this:

> ..make the current chain say anything I want, like give me all the BTC..

You know that transactions are cryptographically secured by public and private keys, right? That would be like saying "I can h4x0r all the hotmails and rewrite every PGP armored message to say whatever I want!" Do you think that miners, upon building the next block, have the opportunity to ignore all the rules with regard to PKI?

> ..and it would become valid..

There is a very straightforward block validation sequence, your attack would impotently collapse against it for any number of reasons: https://en.bitcoin.it/wiki/Protocol_rules

> I have read the white paper.

No you didn't. Sorry to have to put it so bluntly, but you couldn't have read it and still be so comically wrong - like I said, it is written in very plain language.


I don't think you understand software like you think you do. Those protocol rules do exist, and yes cryptography is involved, obviously, but those rules exist because the miners all agree on it, see: BIP2: https://github.com/bitcoin/bips/blob/master/bip-0002.mediawi...

I.e. These rules CHANGE, and since they change(and have changed in the past) if you can convince a majority, you can change the rules to be whatever you want.

Also see: https://en.bitcoin.it/wiki/Economic_majority

Which is exactly what I said. If I have 51% of mining I can make BTC do whatever I want, but that doesn't mean the majority(or any) of the exchanges will accept it, which is basically what the above is saying.

Also see: https://en.bitcoin.it/wiki/Bitcoin_is_not_ruled_by_miners

Again, what they are saying is what I've said, they just put flowery language around it saying, see, miners can't do EVEYRTHING, which isn't technically true, but practically true. Miners can technically do whatever the hell they want, assuming they have the majority, but that doesn't mean exchanges like Coinbase will accept it and exchange the BTC for USD.

so what I said above is generally and technically true, see my other comment in this thread as well, where I said it would be a giant mess and likely ruin BTC forever if someone ever did execute a 51% attack. So there is little incentive(financial or otherwise) to do so.

The closest real-life example we have(that I'm aware of) is the BTC cash stuff, where it hard-forked and became it's own crypto currency because they couldn't get a majority to agree, but enough agreed to fork themselves.


> I don't think you understand software like you think you do.

Sorry, can't hear you above the noise of your furious backpedaling. If somebody rewrites the protocol rules in their software to allow anything approaching what you've described (lol, "yes cryptography is involved"), then why are they worried about btc hashing power? I mean, you specifically said:

> If I had >50% hashing power of BTC(for exactly 1 transaction), I can make the current chain say anything I want, like give me all the BTC, and it would become valid and "permanent".

I'll tell you what it would look like to the rest of the network if anybody enacted your diabolical plot: suddenly somebody starts submitting invalid blocks to the network at regular intervals and only those who are monitoring for weird traffic like that even notice, then the block solve rate slightly sags until the difficulty automatically adjusts. Congratulations, you've turned your massive hashing advantage into an incredibly expensive joke.


> I.e. These rules CHANGE, and since they change(and have changed in the past) if you can convince a majority, you can change the rules to be whatever you want.

False, everyone running a node is enforcing whatever rules their node is written to enforce.

You could have 99% of the hashpower, if you generated a block saying that Coinbase is handing over all their Bitcoin to you (with an invalid signature because you don't have their private key), their node and the nodes of most/all exchanges and businesses would just go "lol, wtf is this shit, not a valid block because it has a transaction with an invalid signature, ignored".

A majority of hashpower is not the same as a majority of economic participants.


> If I had >50% hashing power of BTC(for exactly 1 transaction), I can make the current chain say anything I want, like give me all the BTC, and it would become valid and "permanent".

No, this is a big misunderstanding of how it works. Having more than 50% doesn't give you superpowers, it just means that you're able to create VALID blocks faster than the rest of the network combined (creating invalid blocks is useless), which allows you to censor transactions (by not including them in your blocks, which will be the chain with more work because you have 50%+) and you can TRY to do double-spends.

A double-spend gets exponentially harder and more expensive to pull off the longer your target waits before considering the original transaction final, because you'll have to rebuild all the blocks generated since that transaction was included in the chain, so that your parallel chain that doesn't include that transaction becomes the chain with the most work that everyone follows.


What I'm saying is that, if enough big players like Coinbase et al. started saying "oh there's a BTC glitch, you need to pull a new state file..." then a huge number of people would do it. Not all, but you don't need all - just enough.


Again, if they don't control 51%, their new statefile would be pointless, unless they did a hard-fork, but if I have the resources for a 51% attack on the current fork, chances are I have the same capability on the new fork.

The only way what you propose would work would be if they could cobble together enough resources to break my 51% control.

So BTC would go 100% bust if someone managed a 51% attack. The question is, can someone with 1 transaction of 51% get enough converted to USD/etc before enough people noticed. Otherwise the financial incentive isn't there to try. I'd guess no matter what, it would be a huge fricking mess and if you did it in a country that didn't like you, it probably wouldn't end well for you years later when whatever govt you live in gets around to ruining your life, even if you managed to extract a few billion.

Because you know the exchanges like Coinbase as soon as they noticed would do their best to stop you(as it's in their best interest).


> This is ignoring that if you own 51% of the network, you are free to just rewind the chain back to whatever you think is suitable, and rebuild it from that point on.

Not exactly, for every block you rewind there is extra work you have to do, to the point where it may take you close to 2 years to rewind 1 year's worth of blocks (because while you're doing that the rest of the network is still creating new blocks).

It gets exponentially harder/more expensive the further back you want to go.


You can't rewrite the history of blocks that have already been distributed. You may fool SPV nodes but any node with a copy of the blockchain (even if pruned) will reject your version.


This is false. Please read the white paper. It clearly states in section 4:

> The majority decision is represented by the longest chain, which has the greatest proof-of-work effort invested in it. If a majority of CPU power is controlled by honest nodes, the honest chain will grow the fastest and outpace any competing chains.

If the majority of the hashpower (not nodes!) is dishonest, you can rewrite history. It's the reason why the current difficulty is part of a block.

This has been done numerous times in the past for small but highly traded altcoins.


You can still have the extra verifier nodes, but those don't have to be on the critical read/write path.

Presumably you can create a config where have your "main" beefy server where all the activity is -- which is backed up, redundant, etc... And a bunch of "client" servers, which just pull and verify the data all the time. And the client servers notify if there are any errors using some out-of-band channel, probably using the same system you use for general server health monitoring.

So you are getting same security guarantees as "private blockchain", but with drastically higher performance, and only needing one beefy server. And the downside is that you won't auto-stop all operations on tampering, you'll only get an alert for it.


> immudb doesn't actually seem to replace the use case when you need something actually tamper-proof.

I think that's an unrealistic requirement. There's tamper-evident and tamper-resistant but AFAIK nothing is tamper proof. Best you can do is an HSM with a tamper resistant HMAC with keys and a running checksum in unrecoverable ROM coupled to the packaging.


> nothing is tamper proof

I beg the differ.

If I place a signed message in the Bitcoin chain, can you then modify that message?

If you can prove that you can somehow modify the message, I'll give you $1,000,000 USD tomorrow.


That's "tamper proof" in the colloquial sense. As a term of art, it means something very specific. For example, see FIPS 140-2/3 [1]

It makes no sense to say that the blockchain is tamper proof because the blockchain is just information. Tamper "proofness"/resistance is first a property of the devices storing the information - once you get into custody chains, provenance documents, etc. that's when a system becomes tamper resistant. At best the blockchain as a system is "tamper evident" in the colloquial sense because the network of all the other nodes decides which bits of information form the "real" blockchain. However, without verifying the (physical) identity and data integrity of the devices that run (at least?) 50%+1 of the nodes, you have no idea whether the system has been tampered with.

[1] https://en.wikipedia.org/wiki/FIPS_140


If you want to raise the bet to a few billion dollars I'll happily take you on.

But that's just a question of scale - if you have a rando-blockchain you use for immutability internally then how trivial would it be for me to spin up five servers to outhash you and rewrite history?


Not after you post it, but by infecting your device before you make that message, and tampering when you place it in the Bitcoin chain


So you agree, once it's on the blockchain, it's tamper-proof?


the entire state of the database gets captured by a hash value. By having light-weight clients (or auditors) keeping track of it is how tampering is detected in despite of where the database server is running


This is insufficient. The strongest guarantee you can get without consensus is that the state of the DB you see on the client is/was a correct state at some point, it doesn't provide freshness/rollback attack prevention, aka that the state you see is in fact the latest one.

Keeping track of the "HEAD" hash on the clients is what consensus protocols achieve. You can also achieve it with trusted counters like the one SGX provides (depends on Intel ME so not exactly recommended, also most probably switched off in cloud environments). Alternative is an implementation of something like https://dl.acm.org/doi/10.5555/3241189.3241289.

You can of course say that it's the clients' responsibility to do this, but in practice they won't and they'll implicitly trust the server state.

Having said this, the project does look promising, we may end up using it in a confidential compute setting where clients can verify the server code running, and we'll add rollback protection on top


> aka that the state you see is in fact the latest one.

This is an impossible guarantee. Suppose the state that is sent to you from the server needs some time to get to you. meanwhile the state on the server could have changed. You don't even need a remote server to have this issue. Your thread (where you see the latest state) is put to sleep for a while (sheduler, os, ...) It wakes up. Is the state it observes still the latest? That's impossible to know. The only thing you can do is to refuse future updates if the state they were built upon is not the current state of the database.


That's what I meant. If you have many transactions building on a state hash X racing to be committed, only one of them will succeed.

With crypto protocols in general "guarantees" are always prefixed with "if the protocol completed successfully, then ...". For example authenticated DH + e2e encryption guarantees that you will send data to the intended participant only. But an attacker can still disrupt the network packets, so the true guarantee is "if the protocol completed successfully, then you have sent the data to the intended participant only".

Same thing here, you cannot of course guarantee the "latest state", if we want to go into the extreme, one could even argue that actually time doesn't work like that because of relativity/speed of light limitations:D. What you can guarantee is that if your commit protocol succeeded, then it updated the latest state at the time of consensus/monotonic counter update.


  > That's what I meant.
Good. Sorry to be picky, but wording is important here and you don't wanna know how often I failed to convince people of the impossibility of exactly that guarantee.

  >  time doesn't work like that because of relativity/speed of light.
You're right.


I see. It's a blockchain without calling it a blockchain, so people who hate blockchain can use it without having to realize they use a blockchain.


Blockchain is just a special case of Merkle trees, there isn't anything original about them other than that Bitcoin served as a marketing engine for the term blockchain because some people made a ton of money with it.

https://en.wikipedia.org/wiki/Merkle_tree


No block chains are different than Merkle trees entirely. Block chains include previous hashes in each block, whereas Merkle trees, as the name implies are trees of hashes. In Merkle trees blocks do not include the previous block's hash.


In git each "block" includes the previous "block"(s)' hash. Is it a blockchain or a hash tree?

I would say that in practice what differentiates a blockchain from other applications of hash trees is a mechanism for consensus, not whether the blocks being formed into a trees conceptually represent time or not.


Git is a block chain. Block chains require a previous block's hash to denote sequence. The hash for any block of data can be appended to Merkle tree, even duplicate blocks.

In a block chain it is easy to find history because the link to the history is included. Merkle trees require n-1 additional hashes.


Disagree. There are not formal definitions for these terms, but I don't think your definition here is what most people would think of as a blockchain.


Looking at the wikipedia article I can see where one might be confused.

    A blockchain is a growing list of records,
    called blocks, that are linked together using 
    cryptography. Each block contains a cryptographic
    hash of the previous block, a timestamp, and
    transaction data (generally represented as a
    Merkle tree).
The Merkle tree referenced here is with respect to the organization of the transaction data contained within a block, not the blockchain itself.

Merkle trees are used in various ways within cryptography in general and cryptocurrencies specifically, but blockchains and Merkle trees are distinct data structures with different uses. The colloquial use of "blockchain" has perhaps made the word somewhat ambiguous in some contexts but not in the context of cryptographic data structures, and Merkle trees are in fact formally defined.


You can visualize how immudb Merkle tree grows as you insert data on https://play.codenotary.com


> there isn't anything original about them

Cool, so let's say we're using this to track financial transactions.

Server A has an Immudb at some state x and server B has an Immudb at some state y. Which one is correct, how do I decide?


Rekor is just that. It's a merkle tree implementation (with extras such as timestamping)

https://github.com/sigstore/rekor


And with git being the most superior blockchain of them all.


Except it has no way to achieve consensus automatically. That's left as an exercise to the reader.


It's only the actually-useful bits of a "blockchain" without the planet-cooking proof-of-waste consensus algorithm brute-forcing sha256 over and over again.


And also without the useful "automatically achieve consensus between untrusted parties" bit.


you can do something quite simple like posting a tweet or inserting something into a public chain, like Etherium. Then follow that back to the private immudb hash.


Was there a gap in the first place? We could design temper-proof data storage since way before the blockchain. All you need is checksums, public key cryptography and a way to publish your signed checksums.

I'm not saying that this isn't a good project but it's a bit strange to frame it as if it was a major technical breakthrough.

If anything what catches my eye in this announcement is the "time travel" feature as well as the wire protocol compatibility with Postgres, that's pretty cool.


Getting it all for free in the product is kind of a huge win though. It's always been possible with other systems, but you still have to implement it.


> All you need is checksums, public key cryptography and a way to publish your signed checksums.

I agree that’s sufficient.

One non technical addition I see in Bitcoin is the incentive to verify the checksums.


Does immudb offer mechanisms for distributed consensus, because that is one of the top features in blockchains, they do this while remaining P2P.


the order of changes is not subject to consensus, but clients have the tools to ensure no history rewrite happened


sounds like git :)


I think both blockchains and git are based on the concept of merkle trees, so that sounds about right.

https://en.m.wikipedia.org/wiki/Merkle_tree


immudb has a website where you can visualize the Merkle tree in real time as you insert data: https://play.codenotary.com/


Can Immudb work in a decentralized network while remaining secure from attacks in such networks or is Immudb meant for centralized systems if so I think you cannot compare it to Blockchains. Maybe a better comparison is Git.


immudb is not meant for public decentralized networks, although it might be possible to use embedded immudb to build a public blockchain... but that's a different story. immudb server is tailored to provide a database where any tampering will be subject to detection by any single client application consuming its data.


Yeah, this interests me because I'm thinking about how to use grafeas - it's role is critical for reliable software development going forward - but storing it's data in a backend like this would add one more layer of trust and verifiability to a software supply chain. There are some interesting possibilities with making e.g. public software repos' metadata clonable verifiable and queryable via local immutable copies.


Maybe take a look at rekor, part of the sigstore project, it's built specifically for software supply chain transparency (disclaimer I am one of the community):

https://github.com/sigstore/rekor


Maybe take a look at rekor, part of the sigstore project, it's built specifically for software supply chain transparency (disclaimer I am one of the community). Being a transparency log, you get much better guarantees around inclusion proof (it uses a merkle tree):

https://github.com/sigstore/rekor


It sounds a lot like ZFS.

What I really want is a way to get a hash of a root node / snapshot.


How does this compare feature wise to https://aws.amazon.com/qldb/


there are many differences (as immudb contributor): - immudb can be used embedded or client-server database while qldb is a aws service - immudb behaves as a key-value store but also provides SQL support while qldb provides a document-like data model with PartiQL language - immudb provides time travel features - immudb it's faster, built-in with a mode of operation designed for fast writes which works with eventual indexing.

Finally but super important, immudb can be deployed anywhere and it's open source!


QLDB provides time travel features, too (if by "time travel" you mean being able to query the state of a record at an arbitrary point in the past): https://docs.aws.amazon.com/qldb/latest/developerguide/worki...


immudb already included history support for key-value entries in previous releases. But since v1.0.0, immudb provides query resolution at a given point, using the current data on that specific moment but also being able to combine data at different points in time on the same query. Is not clear to me if it's something that can be achieved with “SELECT * FROM history”, it requires up most one result per different entry (the most recent one)


QLDB is a document DB, so you are limited to a single point or range per query. Also keep in mind `history` in QLDB is a function, not just a store of previous values; given a table "foo" and a key "bar", getting its immutable state from last Tuesday at 4 PM EDT would be:

SELECT * FROM history('foo', `2021-05-18T20:00:00`, `2021-05-18T20:00:00`) as t WHERE t.metadata.id = 'bar';


temporal features provided in immudb allows query (and subquery) resolution based on older states of the database. So for instance, it can be thought as retrieving the documents on its current state in a given time range. Querying the history of changes of a given key or document is slightly different and it's also covered with history operation in immudb.


Ok, that sounds extremely similar to the history function in QLDB.

In the examples shown in the AWS docs, the results of a historical query are not changes made to the document, but the fully resolved state of a document at the requested timestamp (or within the timestamp range). Like other threads on this page mention, this is an unusual but not uncommon DB feature these days.



> this product may actually fill the gap where tamper resistance is desired

I think in the future, all enterprise storage solutions will be append-only by default. To protect against cryptolocker malware. But also with isolated functionality for actually deleting data, for example because of GDPR requests or because of malware that tries to fill all writable storage with garbage. So that data can still be deleted, but not from any of the regular servers that are reading and appending data to the system. Instead from separate servers that are isolated and for data storage management only.


> immudb is the first database which allows you to do queries across time.

I don't think it is

e.g. Datomic already had this for a long time, no?


Several databases (MS SQL Server, MariaDB, Postgres with appropriate extension) support system versioned temporal tables (added in the SQL2011 standard, though I don't know if any DB entirely follows the standard) which I'm pretty sure counts as "queries across time".

Maybe they are claiming to be the first with it built-in as a core part of the engine that it is specifically optimised for, but even that might not be true.


> even that might not be true

It's not. For example, see SAP HANA's "Timeline Index" https://websci.informatik.uni-freiburg.de/publications/sigmo...


Oracle has had flashback queries for a long time.

Though this does not do what immudb claims:

> immudb is the first database to provide tamper-evident data management, immutable history and client-cryptographic proof.

And:

> Clients do not need to trust the server and every new client adds trust to the deployment


We're building something similar to this at Splitgraph, at least in the sense that we have immutable data in a Postgres-compatible DB with point-in-time queries across versioned, addressable snapshots. In our case, we apply the idea of immutability to "data images" that are analogous to Docker images. You build and push them in the same way, and then you can reference any "image" (version) [0] of data by addressing it with the correct tag.

For example, here is a link to a live query on our Data Delivery Network (DDN) that runs a JOIN on two daily snapshots (20200809 and 20200810). [1] In this case, these images are the result of a daily script that builds and pushes a new image each day. The storage costs are minimal, as each new image only needs to store the changed rows, rather than a duplicative snapshot.

Each immutable image is comprised of a set of small content-addressable cstore fragments uploaded to object storage, which we only load into the database when they become necessary to satisfy a query. When a query arrives at the DDN, we intercept it at the network level by scripting PgBouncer with embedded Python to orchestrate the infrastructure required to answer the query. The embedded code parses the AST of the query for table references, which it uses to "mount" a temporary schema for serving the query. The temporary schema includes an FDW that implements a "layered querying" protocol (think AUFS) to lazily download only the fragments required to satisfy the query.

(Also, we support live data. But that's for another time!)

[0] https://www.splitgraph.com/docs/concepts/images

[1] https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...


Doesn't Bigtable, according to the 2006 paper, allow for this too?

> Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent realtime in microseconds...

https://research.google/pubs/pub27898.pdf


Datomic does, so does Oracle, Snowflake and BigQuery.


Teradata Vantage, too.


Yeah, that line was a real head scratcher. I think someone in the marketing department got a bit ahead of their reality.


It’s not marketing as much as it is “try to get a patent for something that’s been done for decades by doing it slightly differently”.


Yeah, it’s not. Which makes the rest suspect.



There seems to be a growth in the number of time traveling immutable-first databases available. We have OpenCrux, Datomic, TerminusDB, Noms, Dolt, and now Immudb. Three using datalog for query and two forcing SQL (not sure about Noms).

What sort of use cases are most common? GitHub repository says:

> Companies use immudb to protect credit card transactions and to secure processes by storing digital certificates and checksums.

But I am not sure how people are building that into their architecture to be honest.


I have used "immutable" schema designs when there were strong requirements for full audit needs over time. It works very well even in a normal RDBMs system. It also allows some very neat reporting e.g. compare the same report at different points in time.

The basic idea was that every operation (create, update, delete) are actually normal SQL inserts and all reads are against views defined such that the most recent tuples are returned unless they are flagged as "deleted."

I have typically used these types of designs in mostly simple applications with tables where the row counts are in the low millions of tuples. Dealing with this design in the billions of tuples (probably sharded somehow) might have motivated us out of normal RDBMs and into one of the specialized immutable DBs mentioned.


Thanks. Can you comment on how this differs from a “mutable” RDBMS model but one with automatic history based on triggers for example?


I would imagine there are a few differences. Events are the primary entity, and the current state is simply a projection of that not the other way around. Those events may come from other systems and are often defined in business terms, not SQL terms. For example an event may also constitute a business update which can update one or many tables. Think of a transaction event updating a balance for two accounts.

TL;DR My thinking it allows you to capture more the intent of that event. Although I'm not sure you need an immutable database to do this from scratch - I've seen this in schema designs in the past?


Reminds me a lot of Fluree[0], an immutable, cryptographically verifiable, temporal database, but with RDF as a query language, which I think is very nice. SQL is nice because it's familiar but it's honestly not that hard to improve on.

[0]https://flur.ee/


So is this something I would want to use for a basic CRUD application, and reap the benefits of time travel and immutability?

Or are there downsides that would relegate it to specific use cases? A what would those use cases be?


It wouldn't be suitable for any application where you care about GDPR (i.e. you store personal information and have users in the EU)

The "right to be forgotten" is not compatible with immutable data. You can't simply need to mark data as deleted, you need to 'purge' it from your system (and possible backups, depending on how long you keep historic backups) - that isn't possible in a system with immutable data.


I mean there are solutions for this. About CQRS/Event sourcing I've read that it's possible to solve it by encrypting the data with different keys and then rotating/throwing away the keys every now and then. Seems a bit hacky but probably there are more elegant approaches.


What happens if you have to delete some data e.g. due to law?


You have several options here:

- store the data encrypted using a secondary protocol, lose the key

- rewrite the whole db

If either of these is not feasible then you should have thought longer about what tech is suitable for which application. Operating your company in a legal manner is a pretty strong factor when making such choices.


Is losing the key sufficient to comply with the law? "We didn't actually delete anything but I promise I don't remember how to decrypt it" would be acceptable for the court to not e.g. seize your drives?


It's the same as "we actually deleted the data and I promise we didn't keep any backup copies", except it's probably even easier to enforce, since you already to have to secure the key instead of the whole database.


IANAL With GDPR right to forget you need to get rid of any identifable subject information. If you can't tell a subject from data then you comply. Encrypted data without a key is just a noise.

You are allowed to keep aggregations and hashes of data. These shouldn't allow to identify a subject. E.g. you can keep list of banned emails as MD5s to verify on sign up etc.


In this situation though, any client who still knows the key can access the data, since there is no way to remove data from the database server, or make it unavailable at the server level.

Assuming the clients and server are operated by different entities (otherwise the immutability and verifiability are not that interesting), if someone comes to the server operator with a court order and ask that data be removed, it seems like there is nothing they can do.


You can’t do much of anything if you’ve already given away the information in question — the same is true if someone copied the data itself.

You have to not give away the key in the first place, at least not to any clients that you don’t own.

E.g. following the rule “any problem can be solved with a level of indirection”, external clients get some Auth key A, which they feed to internal client, who internally maps it to some data key B, and decrypts the data and hands it back to the external client.

When the data is removed, you delete the mapping from your internal client.


> store the data encrypted using a secondary protocol, lose the key

Thing is that you have to do this upfront. I think it's very possible to get into a situation where the data you have to delete is in plaintext. Dropping the whole DB and recreate it from scratch is a bit hefty.


I love what you’ve done. I think you may have an issue with the TimeTravel trademark however. Snowflake uses it in your exact market segment (not to mention where else it may be used in a similar context). Good stuff though, I’ll be checking it out.


I would like to have such a database based on git. Where every change is a git commit. This should then work with things like github where you can connect to your database via github api. The db git repositories could be either private or even public. You can then deploy a serverless webpage to gh-pages and use a serverless gh-gitdb as storage.

serverless := you don't have to operate the infrastructure yourself


You should check out https://www.dolthub.com/ then. They are working on something very similar.



It seems like this is somewhat in that direction. It looks like it is using merkle trees to store the history.


Seems like a database that stores content hashes. Very cool but what makes it better than simply adding a table to my database (or a DB specifically for this) and running `insert into content_hashes...`?

The above approach also allows me to choose any database because I can model this data however I want.


immudb can hold the actual data. An equivalent approach using an existent database without this features will involve creating a cryptographic data structure which captures not only individual content but the entire history of changes. Also having the functionality to construct and verify the cryptographic proofs to validate read data


How is this any different than taking every mutation, signing it using whatever signing mechanism you'd like and adding a column, in addition to the ones you'd like with the hash.

Then, if anything changes you know it's been mutated because the computed signature has changed.


In some way, it’s basically that but on steroids… Note that if the signature includes the previous one then you are protecting the history of changes. However, this simple approach may not scale when dealing with considerable amount of data, proving some older entry was not tampered may require to validate all signatures from that point up to the latest one. immudb employs hash trees to optimise these proofs.


Microsoft recently announced in preview Azure SQL Database Ledger with similar mechanism as you have suggested.

https://docs.microsoft.com/en-au/azure/azure-sql/database/le...


Your solution wouldn't handle the case of row deletion.

It's a little harder than you might think to make a database with tamper resistance.


Oh I'm sure - but without delving into philosophy, how would you know that something was deleted and tampered with vs. Immudb (for example) being compromised and turns out it's possible to delete something without you knowing vs. it never existed to begin with?

In my mind the only way to guarantee is to maintain a copy yourself and check against the "original", but if you're going to do that, then what I described is sufficient, no?

I only mention this because the project mentions that the history is protected by clients, which I imagine is similar to what I'm describing, e.g. copying and checking against the original.


> In my mind the only way to guarantee is to maintain a copy yourself and check against the "original", but if you're going to do that, then what I described is sufficient, no?

The attacker in that case could update your copy. But you have somewhat started to fix the issue.

To cover the case where a bad admin has access to the DB and any copies, you need to send a hash every so often to an outside source. In this case they use clients (I'm not sure exactly how they do this).

In fact you need a list of hashes one for every 100 rows for example. Re-generated the hashes and checking against an external source should detect a tamper.

In the case of Bitcoin (which is extremely tamper resistant) every node operator is a validator. The hashes are stored in a merkle tree.


According to its own description, this database does not support deletion at all.

"You can [...] never change or delete records."


Aha, can one take the nodes offline or if I have PBs of data, it all has to stay online, always?


If this is deployed in a situation where record volumes are large, example: recording credit card transactions, there is going to have to be a process to "retire" old records (and perhaps, move them to external archives). The alternative is endlessly growing storage, and the resulting performance degradation.

At a first glance, I don't see anything like that in there.


The team will host a release party on Monday, May 31st at 6pm CET (18:00) - 10:00 AM PDT.

If you have questions about immudb, you are welcome to join us!

https://www.codenotary.com/blog/immudb-release-1-0-release-p...


Can someone ELI5 what is an "immutable database"? If you can add to the table, that means mutation, right? I am missing something...


> immudb is the first database to provide tamper-evident data management, immutable history and client-cryptographic proof. Every change is preserved and can't be changed without clients noticing.

Sounds like they are recording all changes (like SQL2011's system versioned tables, as implemented more-or-less by several common DB engines) but with some sort of hash-chain ledger so that history can be verified and therefore any tampering detected.

> If you can add to the table, that means mutation, right?

It isn't keeping the current view of the data immutable, but is keeping an immutable history of the data. It is immutable in the sense that nothing written to it is ever lost, and you can use the "time-travel" query functions (like SELECT stuff FROM atable FOR SYSTEM_TIME AS OF '2021-03-05') to retrieve it even if it looks to have been completely mangled or deleted if you use a non-time-travelling query.


It basically means "append only". You can add new data to the database, but you can't change or delete existing data.


It's immutable in the same sense a purely functional data structure is immutable. You represent mutation by making a new version of the data structure. Of course you don't literally do that on the database because it would be inefficient, but there are several algorithmical tricks that can expose an interface that works as if.


that makes sense on a language level, when you hold a reference to some data and you can assume nothing can be changed about it. how does that hold on DB level?


In the same way. A database is basically just a giant data structure, a table is not unlike a B-Tree (in some engines it literally is a B-tree). Data warehouses already do something like this informally, as they are structured in a star schema around a single "append-only" fact table.


You would be able to query and INSERT but not DELETE and UPDATE.

This is useful for example in banking applications that keep an audit trail for example.

A sysadmin would not be able to update or delete items in the audit table and so can't cover up a crime.

If the database is tampered with at the file level, they have a way to detect that. (Probably some kind of merkle tree.)


allright, makes perfect sense. thank you!


SQL system versioned tables but with git hash tree versioning for every mutable command.


The QLDB performance comparison looks quite dodgy, but I can't find their QLDB benchmark code to see what they are doing wrong.


> This new functionality allows travel back in time through the data change history, and even compares these values in the same query!

So we can actually treat our databases like immutable infrastructure and actually roll back changes now without the hulking cludge that is snapshots/restores and database migrations? That's game-changing.


this is hugely interesting, i have to look into this, but... for dev/test environments, can i have a "unverified" version, where clients reget/reset the state?


Definitely not the first database to allow time travel, TM or not.


its the combination of cryptographic client-verification, SQL (that includes verification for every return value present and past) and being able to travel in time


I think it's the first to allow it with TM.


Any major customers using this and if so how?


GDPR requires to erease user date if users withdraw their consent or their data are no longer required for purpose which you originally collected or processed it for.

Therefore, you must carefully check that no personal data is stored in immutable databases.


stay tuned - this is on the immudb roadmap not too far in the future


the importance is to maintain full history and verification of your actions as well, so you have proof of the value deletion.


So should you also delete userdata from existing backups? :)


> For any question contact us on Discord.

Hard no.


Alternatively, the team is hosting a virtual release-party on May 31st, 2021 at 6pm (18:00) CET.

https://www.codenotary.com/blog/immudb-release-1-0-release-p...


not exactly immutable is it? their docs say you can do UPSERT for example. the key is that once you update something, the clients can check using crypto that something was changed. you can't do this in regular databases.


Immutable in the sense that the old value is preserved, even if you update it, and you can't change the history (tamper-evident).


There, I did it for you in PostgreSQL: ALTER TABLE table_name SET (autovacuum_enabled = false);

Snark aside, it‘s still not 100% clear what‘s the upside of using a completely different database, just for that use case.


Huh? Dead tuples are not queryable in Postgres.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: