Hacker News new | past | comments | ask | show | jobs | submit login

A strategy I haven't seen mentioned is to use the filesystem itself as an index. Directory structure can be used to implement a trie or prefix tree.

For example, if you want to look up the existence of a sample hash: `2aae6c35c94fcfb415dbe95f408b9ce91ee846ed`

Then simply check for existence of directory <data-root>/2a/ae/6c/35/etc...

I was looking at the directory structure of Gitea's Docker Container Registry and this is how they stored container images.




I'm pretty skeptical of the performance of a billion files.

I'm sure it will go at okay speed once you actually construct it, since it's basically a tree, but with so many entries I'd expect it to be much less pleasant than a database in many ways. The "preprocessing" is going to be especially awful.


There's probably a large overhead in storage used. It would probably result in much larger overall storage than 37GB. And I agree preprocessing would be painful.

But I'm curious on how lookup speeds would compare to the author's 1ms.

I'm also curious on how addition of a new hash would compare against adding a new hash to the single sorted file used by the author.

Leveraging any database is probably better in any case :)


If we're running on a SATA SSD, we can probably chain 5 or fewer accesses to the drive before we go over a millisecond. And each directory of depth likely requires 2 accesses.

> I'm also curious on how addition of a new hash would compare against adding a new hash to the single sorted file used by the author.

In a fair fight with that requirement, the sorted file would be allowed to add a few percent of extra blank entries and then it could insert in a millisecond too.


Same strategy as git uses for objects, .git/objects/d6/70460b4b4aetc

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects


Git chooses to pack them into much fewer files when there are lots of loose files.

Actually, these days many filesystems perform surprisingly well with lots of little files. What don't work with huge directories is basic utilities like ls, or anything that likes to collect file list in memory & sort it. I have some directories that essentially hang ls, where find is still happy listing the files (because it just streams the output).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: