I know it's crazy talk, but glancing at my own profile, I count maybe 100 bytes ...

kogir · on June 10, 2014

We're on the same page. First you stay up, then you improve with the time you bought.

jedberg · on June 10, 2014

Sometimes it isn't worth the effort to fix the old, and instead just go to the new and improved.

When the load balancer for reddit broke once, we did't bother fixing it, we just replaced it with better (though untested) technology on the assumption it would work better. We figured it couldn't be any worse than it was, and we'd rather spend our limited time moving forward instead of treading water.

kogir · on June 10, 2014

We considered this pretty seriously, and it might be required some day, but we think we'll be able to incrementally move toward a more highly available, better performing architecture without a continuity break.

kgc · on June 11, 2014

Why not just port to the Reddit codebase? The functionality seems similar.

jedberg · on June 13, 2014

The HN code base is as much an experiment in Arc as it is a social news site. Also the feature set is surprisingly different (although I wish some of the features were here like comment collapsing and async comment submission).

mercer · on June 12, 2014

Maybe it would be too much work to rewrite the specific logic that prevents HN 'manipulation' for reddit? Although I suppose much of the behavior and target audience is similar...

iOSGuy · on June 11, 2014

reddit.com/r/hackernews the new hacker news.

pestaa · on June 11, 2014

You might be joking, but in case not: reddit's code base is open source, so others can use it without moving the community under the reddit umbrella.

jedberg · on June 13, 2014

When we upgraded LBs we didn't have a continuity break. We just flipped the IP when it was ready.

For most DB upgrades we did dual writing so we didn't have to have a break.

sigil · on June 10, 2014

This of course before saying anything about transactional safety of writing directly to the filesystem

You do realize that rename(2), open(2) with O_CREAT | O_EXCL, mkdir(2) and still other POSIX filesystem operations are fully atomic, right?

http://rcrowley.org/2010/01/06/things-unix-can-do-atomically...

_wmd · on June 10, 2014

See followup comment below. You're confusing logical atomicity for physical consistency and durability: yes, these operations may have certain atomicity guarantees from perspective of the application, but they are entirely asynchronous from the perspective of the storage medium unless you explicitly fsync(), and for example on Linux, even then the default behaviour of ext4 is to allow metadata updates to complete prior to data updates (no "write barrier").

In other words:

1. fd = open("super-safe-file.tmp", O_CREAT|O_RDWR);

2. write(fd, "super-safe-data", 15);

3. close(fd);

4. rename("super-safe-file.tmp", "super-safe-file");

5. (kernel flushes file and directory metadata to disk)

6. CRASH

7. Machine reboots, "super-safe-data" exists, but no longer contains any data, since file data itself was never flushed.

8. Tears are shed, programmers are fired, backups are restored

sigil · on June 10, 2014

I understand the problem scenario with ext3/ext4 journalling you're referring to here and below.

However, HN runs on FreeBSD, and my understanding is that the combination of soft-updates + journalling there actually do provide atomic rename, even in the case of catastrophic failure. McKusick talks about it here: http://www.mckusick.com/softdep/suj.pdf

Also, just to anchor the discussion a bit, the HN code does use the "write foo.tmp; mv foo.tmp foo" trick all over the place. (Or at least, the most recent version of news.arc I've seen does.)

https://github.com/wting/hackernews/blob/master/arc.arc#L841

_wmd · on June 10, 2014

You said POSIX, which makes no such guarantee.. soft updates are cool, though as far as I know they still don't provide durability. Still, that's far better than the default Linux behaviour

binarycrusader · on June 10, 2014

And this is one of many reasons why you should use ZFS for your data. ZFS guarantees the atomicity of renames and would not have this problem. On Solaris and FreeBSD at least. I don't know about ZFS on Linux.

0xbadcafebee · on June 10, 2014

Using less virtual memory to store user data doesn't imply better cache use, only a smaller cache size. The tradeoff is memory over CPU. With a database you're using more CPU. It's a fair bet to say that your memory capacity will increase at a greater rate than your CPU, not to mention costing less to power and/or cool, and a simpler software architecture to support. Wasting memory is a simple hack to increase performance and decrease complexity.

> This of course before saying anything about transactional safety of writing directly to the filesystem

1) transactions aren't made or broken by what or when they're written, they're made or broken by being verified after being written, and 2) this is a user forum for people to comment on news stories, not an e-commerce site. Worst case the filesystem's journal gets replayed and you lose some pithy comments.

_wmd · on June 10, 2014

> With a database you're using more CPU

That's incorrect, you're potentialy trading several system calls (open, read, close) and their associated copies, which have high fixed costs for, with the right database, no system calls at all. I've spent most of the past year working with LMDB, and can decisively say that filesystems can in no way be competitive with an embedded database, by virtue of the UNIX filesystem interface.

> this is a user forum for people to comment on news stories, not an e-commerce site

That much is true, though based on what we've learned in the parent post, until today all passwords on the site were stored in one file. Many popular filesystems on Linux exhibit surprising results rewriting files, unless you're incredibly careful with fsync and suchlike. For example, http://lwn.net/Articles/322823/ is a famous case where the many-decades traditional approach of writing out "foo.tmp" before renaming to "foo" could result in complete data loss should an outage occur at just the right moment.

0xbadcafebee · on June 10, 2014

So you're saying LMDB looking up a user-specific record and returning it will always be faster than either an lseek() and read() on a cached mmapped file [old model] or an open(), read(), close() on a cached file [new model] ? Is the Linux VFS that slow?

In terms of transaction guarantees, I thought the commenter was talking about the newer model where each profile is an independent (and tiny) file; if that's the case, then deleting and renaming files wouldn't be necessary, and any failures in writing could be rolled back in the journal rather than be a file that's now non-existent or renamed. From what I understand, the most the ext4 issue would affect this newer model would be to revert newly-created profile files, which again I think would be a minor setback for this forum.

hyc_symas · on June 11, 2014

Yes, absolutely LMDB will always be faster, because LMDB can return a record with zero system calls.

Can't make the same guarantee about other DB engines. Take a look: http://symas.com/mdb/inmem/

__david__ · on June 10, 2014

> …with the right database, no system calls at all.

How does that work? Doesn't the database talk to the filesystem? Aren't there a bunch of syscalls going on there?

lmm · on June 10, 2014

Serious database can use raw partitions with no filesystem for storage. Even when storing data on a filesystem a database is unlikely to be using a single file for each entry; the database might make one mmap system call when it starts, and none thereafter (simplified example). The point is that the database can do O(1) system calls for n queries, whereas using the filesystem with a separate file for each entry you're going to need at O(n) system calls.

You could of course avoid this problem by using a single large file, but that has its own problems (aforementioned possibility of corruption). Working around those problems probably amounts to embedding a database in your application.

_wmd · on June 10, 2014

In the read-only case, pretty much any embedded DB with a large userspace cache configured won't read data back in redundantly.

In the specific case of LMDB, this is further extended since read transactions are managed entirely in shared memory (no system calls or locks required), and the cache just happens to be the OS page cache.

Per a post a few weeks back, the complete size of the HN dataset is well under 10GB, it comfortably fits in RAM.

blablabla123 · on June 10, 2014

I guess even most persistent embedded databases are on top of a good auld FS... SCNR...