I know it's crazy talk, but glancing at my own profile, I count maybe 100 bytes of data? Yet to represent that data in memory, it's going to blow up to 4096 bytes plus structs to represent the inode and directory entry/entries because you put each profile in its own file.
By that count, you might get somewhere near a 40x cache utilization improvement if you just used a real database like the rest of us do - even just an embedded database.
This of course before saying anything about transactional safety of writing directly to the filesystem
Sometimes it isn't worth the effort to fix the old, and instead just go to the new and improved.
When the load balancer for reddit broke once, we did't bother fixing it, we just replaced it with better (though untested) technology on the assumption it would work better. We figured it couldn't be any worse than it was, and we'd rather spend our limited time moving forward instead of treading water.
We considered this pretty seriously, and it might be required some day, but we think we'll be able to incrementally move toward a more highly available, better performing architecture without a continuity break.
The HN code base is as much an experiment in Arc as it is a social news site. Also the feature set is surprisingly different (although I wish some of the features were here like comment collapsing and async comment submission).
Maybe it would be too much work to rewrite the specific logic that prevents HN 'manipulation' for reddit? Although I suppose much of the behavior and target audience is similar...
See followup comment below. You're confusing logical atomicity for physical consistency and durability: yes, these operations may have certain atomicity guarantees from perspective of the application, but they are entirely asynchronous from the perspective of the storage medium unless you explicitly fsync(), and for example on Linux, even then the default behaviour of ext4 is to allow metadata updates to complete prior to data updates (no "write barrier").
I understand the problem scenario with ext3/ext4 journalling you're referring to here and below.
However, HN runs on FreeBSD, and my understanding is that the combination of soft-updates + journalling there actually do provide atomic rename, even in the case of catastrophic failure. McKusick talks about it here: http://www.mckusick.com/softdep/suj.pdf
Also, just to anchor the discussion a bit, the HN code does use the "write foo.tmp; mv foo.tmp foo" trick all over the place. (Or at least, the most recent version of news.arc I've seen does.)
You said POSIX, which makes no such guarantee.. soft updates are cool, though as far as I know they still don't provide durability. Still, that's far better than the default Linux behaviour
And this is one of many reasons why you should use ZFS for your data. ZFS guarantees the atomicity of renames and would not have this problem. On Solaris and FreeBSD at least. I don't know about ZFS on Linux.
Using less virtual memory to store user data doesn't imply better cache use, only a smaller cache size. The tradeoff is memory over CPU. With a database you're using more CPU. It's a fair bet to say that your memory capacity will increase at a greater rate than your CPU, not to mention costing less to power and/or cool, and a simpler software architecture to support. Wasting memory is a simple hack to increase performance and decrease complexity.
> This of course before saying anything about transactional safety of writing directly to the filesystem
1) transactions aren't made or broken by what or when they're written, they're made or broken by being verified after being written, and 2) this is a user forum for people to comment on news stories, not an e-commerce site. Worst case the filesystem's journal gets replayed and you lose some pithy comments.
That's incorrect, you're potentialy trading several system calls (open, read, close) and their associated copies, which have high fixed costs for, with the right database, no system calls at all. I've spent most of the past year working with LMDB, and can decisively say that filesystems can in no way be competitive with an embedded database, by virtue of the UNIX filesystem interface.
> this is a user forum for people to comment on news stories, not an e-commerce site
That much is true, though based on what we've learned in the parent post, until today all passwords on the site were stored in one file. Many popular filesystems on Linux exhibit surprising results rewriting files, unless you're incredibly careful with fsync and suchlike. For example, http://lwn.net/Articles/322823/ is a famous case where the many-decades traditional approach of writing out "foo.tmp" before renaming to "foo" could result in complete data loss should an outage occur at just the right moment.
So you're saying LMDB looking up a user-specific record and returning it will always be faster than either an lseek() and read() on a cached mmapped file [old model] or an open(), read(), close() on a cached file [new model] ? Is the Linux VFS that slow?
In terms of transaction guarantees, I thought the commenter was talking about the newer model where each profile is an independent (and tiny) file; if that's the case, then deleting and renaming files wouldn't be necessary, and any failures in writing could be rolled back in the journal rather than be a file that's now non-existent or renamed. From what I understand, the most the ext4 issue would affect this newer model would be to revert newly-created profile files, which again I think would be a minor setback for this forum.
Serious database can use raw partitions with no filesystem for storage. Even when storing data on a filesystem a database is unlikely to be using a single file for each entry; the database might make one mmap system call when it starts, and none thereafter (simplified example). The point is that the database can do O(1) system calls for n queries, whereas using the filesystem with a separate file for each entry you're going to need at O(n) system calls.
You could of course avoid this problem by using a single large file, but that has its own problems (aforementioned possibility of corruption). Working around those problems probably amounts to embedding a database in your application.
In the read-only case, pretty much any embedded DB with a large userspace cache configured won't read data back in redundantly.
In the specific case of LMDB, this is further extended since read transactions are managed entirely in shared memory (no system calls or locks required), and the cache just happens to be the OS page cache.
Per a post a few weeks back, the complete size of the HN dataset is well under 10GB, it comfortably fits in RAM.
By that count, you might get somewhere near a 40x cache utilization improvement if you just used a real database like the rest of us do - even just an embedded database.
This of course before saying anything about transactional safety of writing directly to the filesystem