Hacker News was down all last night. The problem was not due to
the new server. In fact the cause was embarrassingly stupid.
On a comment thread, a new user had posted some replies as siblings
instead of children. I posted a comment explaining how HN worked.
But then I decided to just fix it for him by doing some surgery in
the repl. Unfortunately I used the wrong id for one of the comments
and created a loop in the comment tree; I caused an item to be its
own grandchild. After which, when anyone tried to view the thread,
the server would try to generate an infinitely long page. The
story in question was on the frontpage, so this happened a lot.
For some reason I didn't check the comments after the surgery to
see if they were in the right place. I must have been distracted
by something. So I didn't notice anything was wrong till a bit
later when the server seemed to be swamped.
When I tailed the logs to see what was going on, the pattern looked
a lot like what happens when HN runs short of memory and starts
GCing too much. Whether it was that or something else, such problems
can usually be fixed by restarting HN. So that's what I did. But
first, since I had been writing code that day, I pushed the latest
version to the server. As long as I was going to have to restart
HN, I might as well get a fresh version.
After I restarted HN, the problem was still there. So I guessed
the problem must be due to something in the code I'd written that
day, and tried reverting to the previous version, and restarting the
server again. But the problem was still there. Then we (because
by this point I'd managed to get hold of Nick Sivo, YC's hacker in
residence) tried reverting to the version of HN that was on the old
server, and that didn't work either. We knew that code had worked
fine, so we figured the problem must be with the new server. So
we tried to switch back to the old server. I don't know if Nick
succeeded, because in the middle of this I gave up and went to bed.
When I woke up this morning, Rtm had HN running on the new server.
The bad thread was still there, but it had been pushed off the
frontpage by newer stuff. So HN as a whole wasn't dying, but there
were still signs something was amiss, e.g. that /threads?id=pg
didn't work, because of the comment I made on the thread with the
loop in it.
Eventually Rtm noticed that the problem seemed to be related to a
certain item id. When I looked at the item on disk I realized what
must have happened.
So I did some more surgery in the repl, this time more carefully,
and everything seems fine now.
Sorry about that.
Might be a good time to mention Rubber Duck Debuggging. http://en.wikipedia.org/wiki/Rubber_duck_debugging