The linked blog post is making the limited claim the posts on archive.org accurately represent the posts present on Reid's site at the time they were archived, and do not appear to have been altered post-archiving.
This might actually be a good use case for a blockchain. Hashing the data that's added to the archive and then putting the hash in the blockchain would reasonably prove the data in the archive hasn't been modified at a later date.
I do agree that a tamper-resistant store would be useful for things like journalism, legislature, official government communication, campaign content for politicians, etc. A distributed ledger for these would also be good because then you’re verifying that store in public view.
It’s too bad all you can build with that is a meager, profitable SaaS business, not a wild speculative crypto-billionaire rocket ride.
The hash chain doesn't contain the data, only a hash of the data. So the original article can still be altered, and the hash chain would only prove that it had been changed. I believe nothing in these "right to be forgotten" laws forbid noting that an article has been edited to remove names.
An interesting alternative would be to hash "chunks" of the original article so that a future verification could be applied to particular parts of the content. Let's imagine you hashed every 32 bytes, you could then determine which chunks changed at what times, without revealing the plain text content.
The question of how to identify large complex works, of potentially variable forms (markup or editing format, HTML, PDF, ePub, .mobi, etc.) such that changes and correspondences can be noted, is a question I've been kicking around.
Chunk-based hashing could work. Corresponding those to document structure (paragraphs, sentences, chapters, ...) might make more sense.
Yeah that's an interesting question. How to parse the content into meaningful pieces and then hash in such a way that the content is not known, but the hash can be mapped to where it was in the document at an earlier time.
Keep in mind that at the scale of a large work, some level of looseness may be sufficient. Identity and integrity are distinct, the former is largely based on metadata (author, title, publication date, assigned identifiers such as ISBN, OCLC, DOI, etc.). Integrity is largely presumed unless challenged.
As it pertains to private citizens, I would not recommend something like this to archive or verify their personal data. But for government records, campaign records, etc, I would think that those laws do not apply to that information.
They will be as effective as anti-piracy law would be if pirates were paid to seed. At best they will prevent respectable publications from directly using distributed archives as a source.
> Hashing the data that's added to the archive and then putting the hash in the blockchain would reasonably prove the data in the archive hasn't been modified at a later date.
Unfortunately, the files themselves aren't public, and each file contains dumps from hundreds of websites, so even if they were public they're not the easiest thing to verify.
Still, being the guy behind OpenTimestamps I should point out that in this case I don't think timestamp proofs really add that much: Reid's claims seem dubious even without cryptographic proof.
RFC3161 has very poor security, as it blindly trusts certificate authorities.
You really need better auditing than that, which is why the certificate authority infrastructure now relies on a blockchain - Certificate Transparency - for auditing. Similarly, for timestamping specifically, Guardtime has used a blockchain for auditing their timestamps since well before blockchains got called blockchains.
So here's something I can't get a straight answer on:
Surely if content is served over HTTPS with a valid certificate, it should be possible to save (possibly as part of a WARC) a "signature" of the TCP stream that would go beyond proving that a web archive was created at a certain time, but also that it was served using that person's private key and thus from that person's web server. To claim otherwise, the subject would have to claim that a fraudulent certificate was generated for their domain or that their web server was broken into.
Basically, the way the crypto math works in HTTPS is it's a symmetrical proof that only proves that either the sender or the receiver sent the TCP stream. Normally that's OK, because you trust yourself. But in this case the problem you're trying to solve is to prove what happened to a third party who doesn't trust the receiver, so your idea doesn't work.
It's the same with the RSA key exchange. It's inherent in the fact that the TLS negotiation exists to make both sides agree on a common master secret (and some public cryptographic parameters like which cipher to use), from which all the keys used to encrypt and authenticate either direction of the stream are derived. Once the master secret is known, all keys are known and the rest of the connection can be decrypted and/or forged at will. (The "triple handshake" attack exploits this, by making two connections share the same master secret.)
The certificate is used to sign (parts of) the values used to create the master secret. It doesn't sign anything after that.
This depends on how you define “blockchain”. If your model is bitcoin-style with attempts at anonymous consensus it's definitely a negative contribution.
If you're not trying to get rich quick, however, something a Merkle tree is a great fit and it seems like there'd be value in a distributed system where trusted peers can vouch for either having seen the same content (even if they cannot distribute it due to copyright) or confirm that they saw you present a given object as having a certain hash at a specific time. Whether that's called a blockchain is a philosophical question but I think it'd be a good step up over self-publishing hashes since it'd avoid the need for people to know in advance what they'd like to archive.
To make that concrete, imagine if the web archiving space had some sort of distributed signature system like that. The first time the integrity of the Internet Archive is called into question, anyone on the internet who cared could check and see a provenance record something like this:
IA: URL x had SHA-512 y at time z
Library of Congress: URL x also had SHA-512 y at time [roughly z]
British Library: We didn't capture URL x around time z but we cross-signed the IA and LC manifests shortly after they crawled them and saw SHA-512 y
J. Random Volunteer Archivist: I also saw IA present that hash at that time
That'd give a high degree of confidence in many cases since these are automated systems and it'd lower the window where someone could modify data without getting caught, similar to how someone might be able to rewrite Git history without being noticed but only until someone else fetches the same repo.
(Disclaimer: I work at LC but not on web archiving and this comment is just my personal opinion on my own time)
That won't work because any hash on it's own would be trivial to regenerate after modifying the data. You need something that can't be changed retrospectively in order to trust it.
That's what makes the blockchain useful - to change anything you'd need to regenerate all the hashes after the point you want to modify. That's a lot more difficult. Having a proof that's generated by network of parties (like a cryptocurrency) would add to the trust level, but it's not essential.
EDIT: If the archive published hashes of everything they added daily in the NYT (or any publication) it would become unprintably large. It would only work digitally, at which point we're back to something that's trivial to modify...
Is it still possible to place classifieds in the NY Times? I don't think there's still anyway for someone to call up and have some random hash published, right?
I suspect the hardest part of doing that would be simply that you don't fit into their pre-existing categories.
FWIW, if you plan to do that, I'd suggest you put a Bitcoin block hash in the NY Times instead, which would prove the timestamps of everything that's been timestamped via Bitcoin. You can then timestamp your own stuff for free via OpenTimestamps, at which point your proof goes <your data> -> OpenTimestamps -> Bitcoin -> NY Times.
Timestamps are additive security, so it makes sense to publish them wisely. But if you're going to do that, might as well strengthen the security of as much stuff as possible in one go.
Forgive my shallow understanding of block chain, but wouldn't that make the archive immutable? Surely there are times where the Wayback Machine needs to delete snapshots, in cases where there's copyright infringement or other illegal activity.
Yes, it would make the archive immutable, but that doesn't prevent the data from being deleted.
A very similar example is found in git repos: while normally you'd have every single bit of data that lead up to git HEAD, you can use git in "shallow" mode, which only has a subset of that data. If you delete all but the shallow checkouts, the missing data will be gone forever. The missing data is still protected from being modified by the hashing that Git does - and you're guaranteed to know that data is in fact missing - but that cryptography doesn't magically make the data actually accessible.
> Forgive my shallow understanding of block chain, but wouldn't that make the archive immutable?
Kind of. The current state of the archive is mutable, but that changes to that state are logged to an append-only edit history — it's that edit history that is the "blockchain", and starting from a known good state and replaying all those edits must produce the current state. In fact, this is how cryptocurrencies work too — the state is the balances/utxo set, and the blockchain records transactions, which are effectively just mutations on that state.
In this situation, you'd look at the current state and find the deleted snapshot missing, but the edit log would have an entry saying the snapshot was added (and what its hash was at the time), then another entry saying it was deleted.
This is also an issue for major blockchains in deployment now, specifically Bitcoin. There is the potential for illegal content, or links to it, to be stacked on BTC’s blockchain [0], and so anyone who holds that blockchain would also possess it.
I believe this would also be an issue for things like Filecoin/IPFS but I’m not sure if the liability issues are different or nuanced.
IPFS works like torrents: users only host things that they choose to, so there's no issue of some people being stuck hosting content they don't want to.
If you put the data itself in the blockchain then that would be true. I'm suggesting putting a hash of the data in a blockchain; you could delete the data and keep the hash in the chain. You couldn't regenerate the hash to check it which might be a problem but if the data has been deleted you'd have to accept the hash regardless. It'd only affect that link in the chain. (This is from my limited understanding of blockchain math. I definitely could be wrong.)
Paper archives usually contain a ton of copyrighted material, e.g. "John Doe's papers" includes magazines, newspapers, letters written by other people, etc that are not copyright by John Doe.
So I have an anonymous twitter account that tweets out various randomly located headlines, a couple times per day. Simply embedded in one of those tweets, each day, is a hash of the previous hash plus the current contents of some long-running data that I've been keeping and updating.
It's not as robust as a blockchain (maybe!) but it's easy and I've been doing it a good bit longer than 'blockchain' has been talked about. More importantly, I can use it to prove that I possessed certain files at certain times, historically.
I consider the value of it being anonymous right now to be unknown or undefined. In the same way, I consider the value of it being non-anonymous right now to also be unknown or undefined. Since disclosure can only flow in one direction, I'm not aware of any reason to irrevocably transfer from one state to the next.
A blockchain solution is unnecessary for this kind of issue. The question is did Reid author the posts or a hacker? You just need signing to prove that. If all of Reid's posts were cryptographically signed, then a post by a hacker would be mysteriously missing a signature and the debate would be trivially resolved.
This might actually be a good use case for a blockchain. Hashing the data that's added to the archive and then putting the hash in the blockchain would reasonably prove the data in the archive hasn't been modified at a later date.