What I'm really missing in this space is something like this for content address...

amluto · 2024-07-20T12:19:04 1721477944

I would settle for first-class support for object hashes. Let an object have metadata, available in the inventory, that gives zero or more hashes of the data. SHA256, some Blake family hash, and at least one decent tree hash should be supported. There should be a way to ask the store to add a hash to an existing object, and it should work on multipart objects.

IOW I would settle for content verification even without content addressing.

S3 has an extremely half-hearted implementation of this for “integrity”.

ianopolous · 2024-07-20T15:06:07 1721487967

That's how we use S3 in Peergos (built on IPFS). You can get S3 to verify the sha256 of a block on write and reject the write if it doesn't match. This means many mutually untrusting users can all write to the same bucket at the same time with no possibility for conflict. We talk about this more here:

https://peergos.org/posts/direct-s3

the_duke · 2024-07-20T07:47:12 1721461632

Garage splis the data into chunks for deduplication, so it basically already does content addressed storage under the hood..

They probably don't expose it publicly though.

j-pb · 2024-07-20T12:52:22 1721479942

Yeah, and as far as I understood they use the key hash to address the overall object descriptor. So in theory using the hash of the file instead of the hash of the key should be a simple-ish change.

Tbh I'm not sure if content aware chunking isn't a sirens call:

  - It sounds great on paper, but once you start storing encrypted (which you have to do if you want e2e encryption) or compressed blobs (e.g. images) it won't work anymore.

  - Ideally you would store things with enough fine grained blobs that blob-level deduplication would suffice.

  - Storing a blob across your cluster has additional compute, lookup, bookkeeping, and communication overhead, resulting in worse latency. Storing an object as a contiguous unit makes the cache/storage hierarchies happy and allows for optimisations like using `sendfile`.

  - Storing the blobs as a unit makes computational storage easier to implement, where instead of reading the blob and processing it, you would send a small WASM program to the storage server (or drive? https://semiconductor.samsung.com/us/ssd/smart-ssd/) and only receive the computation result back.

od0 · 2024-07-20T13:47:31 1721483251

Take a look at https://github.com/n0-computer/iroh

Open source project written in Rust that uses BLAKE3 (and QUIC, which you mentioned in another comment)

j-pb · 2024-07-20T14:10:54 1721484654

It certainly has a lot of overlap and is a very interesting project, but like most projects in this space, I feel like it's already doing too much. I think that might be because many of these systems also try to be user facing?

E.g. it tries to solve the "mutability problem" (having human readable identifiers point to changing blobs); there are blobs and collections and documents; there is a whole resolver system with their ticket stuff

All of these things are interesting problems, that I'd definitely like to see solved some day, but I'd be more than happy with an "S3 for blobs" :D.

khimaros · 2024-07-20T09:40:26 1721468426

you might be interested in https://github.com/perkeep/perkeep

skinkestek · 2024-07-20T10:54:59 1721472899

Perkeep has (at least until last I checked it) the very interesting property of being completely impossible for me to make heads or tails of while also looking extremely interesting and useful.

So in the hope of triggering someone to give me the missing link (maybe even a hyperlink) for me to understand it, here is a the situation:

I'm a SW dev that also have done a lot of sysadmin work. Yes, I have managed to install it. And that is about it. There seems to be so many features there but I really really don't understand how I am supposed to use the product or the documentation for that matter.

I could start an import of Twitter or something else an it kind of shows up. Same with anything else: photos etc.

It clearly does something but it was impossible to understand what I am supposed to do next, both from the ui and also from the docs.

breakingcups · 2024-07-20T11:21:57 1721474517

Perkeep is such a cool, interesting concept, but it seems like it's on life-support.

If I'm not mistaken, it used to be funded by creator Brad Fitz, who could afford to hire a full-time developer on his Google salary, but that time has sadly passed.

It suffers from having so many cool use-cases that it struggles to find a balance in presentation.

mdaniel · 2024-07-20T17:00:51 1721494851

I was curious to see if I could help, and I wondered if you saw their mailing list? It seems to have some folks complaining about things they wish it did, which strangely enough is often a good indication of what it currently does

There's also "Show Parkeep"-ish posts like this one <https://groups.google.com/g/perkeep/c/mHoUUcBz2Yw> where the user made their own Pocket implementation complete with original page snapshotting

The thing that most stood out to me was the number of folks who wanted to use Parkeep to manage its own content AND serve as the metadata system of record for external content (think: an existing MP3 library owned by an inflexible media player such as iTunes). So between that and your "import Twitter" comment, it seems one of its current hurdles is that the use case one might have for a system like this needs to be "all in" otherwise it becomes the same problem as a removable USB drive for storing stuff: "oh, damn, is that on my computer or on the external drive?"

tgulacsi · 2024-07-20T11:52:02 1721476322

Beside personal photo store, I use the storage part for file store at work (basically, indexing is off), with a simplifying wrapper for upload/download: github.com/tgulacsi/camproxy

With the adaptive block hashing (varying block sizes), it beats gzip for compression.

lockyc · 2024-07-20T11:10:48 1721473848

I agree 100%

didntcheck · 2024-07-20T10:05:34 1721469934

Or some even older prior art (which I recall a Perkeep dev citing as an influence in a conference talk)

http://doc.cat-v.org/plan_9/4th_edition/papers/venti/

https://en.wikipedia.org/wiki/Venti_(software)

j-pb · 2024-07-20T12:59:50 1721480390

Yeah, there are pleanty of dead and abandoned projects in this space. Maybe the concept is worthless without a tool for metadata management? Also I should probably have specified that by "missing" I mean, "there is nothing well maintained and production grade" ^^'

j-pb · 2024-07-20T13:03:17 1721480597

Yeah I've been following it on and off since it was camli-store. Maybe it tried to do too much at once and didn't focus on just the blob part enough, but I feel like it never really reached a coherent state and story.

BageDevimo · 2024-07-20T08:44:25 1721465065

Have you seen https://github.com/willbryant/verm?

j-pb · 2024-07-20T12:29:44 1721478584

Yeah, the subdirectories and mime-type seemed like an unnecessary complication. Also looks pretty dead.

jiggawatts · 2024-07-20T12:46:49 1721479609

Something related that I've been thinking about is that there aren't many popular data storage systems out there that use HTTP/3 and/or gRPC for the lower latency. I don't just mean object storage, but database servers too.

Recently I benchmarked the latency to some popular RPC, cache, and DB platforms and was shocked at how high the latency was. Every still talks about 1 ms as the latency floor, when it should be the ceiling.

j-pb · 2024-07-20T13:05:21 1721480721

Yeah QUIC would probably be a good protocol for such a system. Roundtrips are also expensive, ideally your client library would probably cache as much data as the local disk can hold.

singinwhale · 2024-07-20T08:25:33 1721463933

Sounds a little like Kademlia, the DHT implementation that BitTorrent uses.

It's a distributed hash table where the value mapped to a hash is immutable after it is STOREd (at least in the implementations that I know)

j-pb · 2024-07-20T12:27:53 1721478473

Kademlia could certainly be a part of a solution to this, but it's a long road from the algorithm to the binary that you can start on a bunch of machines to get the service, e.g. something like SeaweedFS. BitTorrent might actually be the closest thing we have to this, but it is at the opposite spectrum of the latency -distributed axis.

rakoo · 2024-07-20T20:46:14 1721508374

But you don't really handle blobs in real life: they can't really be handled, they don't have memorable name (by design). So you need an abstractly layer on top of it. You can use zfs that will deduplicate similar blobs. You can use restic for backups that will also deduplicate similar parts of a file also in an idempotent way. And you can use git that will deduplicate files based on their hash

compressedgas · 2024-07-20T12:36:37 1721478997

You might also be interested in Tahoe-LAFS https://www.tahoe-lafs.org/

j-pb · 2024-07-20T12:57:53 1721480273

I get a

> Trac detected an internal error:

> IOError: [Errno 28] No space left on device

So it looks like it is pretty dead like most projects in this space?

diggan · 2024-07-20T13:03:49 1721480629

Because the website seems to have a temporary issue, the project must be dead?

Tahoe-LAFS seems alive and continues development, although it seems to not have seen as many updates in 2024 as previous years: https://github.com/tahoe-lafs/tahoe-lafs/graphs/contributors

j-pb · 2024-07-20T13:06:57 1721480817

More like based on the prior that all projects in that space arent' in the best of health. Thanks for the github link, that didn't pop up in my quick google search.

snthpy · 2024-07-22T09:08:15 1721639295

Have a look at LakeFS (https://docs.lakefs.io/understand/architecture.html).

Files are stored by hash on S3. Metadata is stored in a database. I run it locally and access it just like an S3 store. Metadata is in a Postgres DB.

ramses0 · 2024-07-20T12:54:44 1721480084

Check also SeaweedFS, it has some interesting tradeoffs made, but I hear you with wanting some of the properties you're looking for.

tempest_ · 2024-07-20T14:00:52 1721484052

I am using seaweed for a project right now. Some things to consider with seaweed.

- It works pretty well, at least up to the 15B objects I am using it for. Running on 2 machines with about 300TB, (500 raw) storage on each.

- The documentation, specifically with regards to operations like how to backup things, or different failure modes of the components can be sparse.

- One example of the above is I spun up a second filer instance (which is supposed to sync automatically) which caused the master server to emit an error while it was syncing. The only way to know if it was working was watching the new filers storage slowly grow.

- Seaweed has a pretty high bus factor, though the dev is pretty responsive and seems to accept PRs at a steady rate.

SOLAR_FIELDS · 2024-07-21T02:30:09 1721529009

I use seaweed as well. It has some warts as well as some feature incompleteness but I think the simplicity of the project itself is a pretty nice feature. It’s grokkable mostly pretty quickly since it’s only one dev and the codebase is pretty small

rkunnamp · 2024-07-20T17:31:15 1721496675

IPFS like "coordination free" local S3 replacement! Yes. That is badly needed.

lima · 2024-07-20T13:07:03 1721480823

The RADOS K/V store is pretty close. Ceph is built on top of it but you can also use it as a standalone database.

yencabulator · 2024-07-20T16:52:46 1721494366

Nothing content-addressed in RADOS. It's just a key-value store with more powerful operations that get/put, and more in the strong consensus camp than the parents' request for coordination free things.

(Disclaimer: ex-Ceph employee.)

SOLAR_FIELDS · 2024-07-21T05:11:55 1721538715

Can you point me towards resources that help me understand the trade offs being implied here? I feel like there is a ton of knowledge behind your statement that flies right past me because I don’t know the background behind why the things you are saying are important.

yencabulator · 2024-07-21T15:36:58 1721576218

It's a huge field, basically distributed computing, burdened here with the glorious purpose of durable data storage. Any introductory text long enough becomes essentially a university-level computer science course.

RADOS is the underlying storage protocol used by Ceph (https://ceph.com/). Ceph is a distributed POSIX-compliant (very few exceptions) filesystem project that along the way implemented simpler things such as block devices for virtual machines and S3-compatible object storage. Clients send read/write/arbitrary-operation commands to OSDs (the storage servers), which deal internally with consistency, replication, recovery from data loss, and so on. Replication is usually leader and two followers. A write is only acknowledged after the OSD can guarantee that all later reads -- including ones sent to replicas -- will see the write. You can implement a filesystem or network block device on top of that, run a database on it, and not suffer data loss. But every write needs to be communicated to replicas, replica crashes need to be resolved quickly to be able to continue accepting writes (to maintain the strong consistency requirement), and so on.

On the other end of the spectrum, we have Cassandra. Cassandra is roughly a key-value store where the value consists of named cells, think SQL table columns. Concurrent writes to the same cell are resolved by Last Write Wins (LWW) (by timestamp, ties resolved by comparing values). Writes going to different servers act as concurrent writes, even if there were hours or days between them -- they are only resolved when the two servers manage to gossip about the state of their data, at which time both servers storing that key choose the same LWW winner.

In Cassandra, consistency is a caller-chosen quantity, from weak to durable-for-write-once to okay. (They added stronger consistency models in their later years, but I don't know much about them so I'm ignoring them here.) A writer can say "as long as my write succeeds at one server, I'm good" which means readers talking to a different server might not see it for a while. A writer can say "my write needs to succeed at majority of live servers", and then if a reader requires the same "quorum", we have a guarantee that the write wasn't lost due to a malfunction. It's still LWW, so the data can be overwritten by someone else without noticing. You couldn't implement a reliable "read, increment, write" counter directly on top of this level of consistency. (But once again, they added some sort of transactions later.)

The grandparent was asking for content-addressed storage enabling a coordination-free data store. So something more along the lines of Cassandra than RADOS.

Content-addressed means that e.g. you can only "Hello, world" under the key SHA256("Hello, world"). Generally, that means you need to store that hash somewhere, to ever see your data again. Doing this essentially removes the LWW overwrite problem -- assuming no hash collisions, only "Hello, world" can ever be stored at that key.

I have a pet project implementing content-addressed convergent encryption to an S3 backend, using symlinks in a git repo as the place to store the hashes, at https://github.com/bazil/plop -- it's woefully underdocumented but basically a simpler rewrite of the core of https://bazil.org/ which got stuck in CRDT merge hell. What that basically gets me is that e.g. ~/photos is a git repo with symlinks to a FUSE filesystem that manifests the contents on demand from S3-compatible storage. It can use multiple S3 backends, though active replication is not implemented (it'll just try until a write succeeds somewhere; reads are tried wider and wider until they succeed; you can prioritize specific backends to e.g. read/write nearby first and over the internet only when needed). Plop is basically a coordination-free content-addressed store, with convergent encryption. If you set up a background job to replicate between the S3 backends, it's quite reliable. (I'm intentionally allowing a window of only-one-replica-has-the-data, to keep things simpler.)

Here's some of the more industry-oriented writings from my bookmarks. As I said, it really is a university course (or three, or a PhD)..

https://www.the-paper-trail.org/page/cap-faq/

https://codahale.com/you-cant-sacrifice-partition-tolerance/

https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...

SOLAR_FIELDS · 2024-07-23T02:44:10 1721702650

I upvoted this but I also wanted to say as well that this summary is valuable for me to gain a better groundwork for an undoubtedly complex topic. Thank you for the additional context.

lima · 2024-07-27T08:42:36 1722069756

Thank you for this, I learned a lot from this comment.