Why HN was down

DanielBMarkham · on Feb 18, 2013

Amazing that such a large percentage of debugging involves determining exactly what you are debugging. The definition of the problem, many times, is the solution.

Might be a good time to mention Rubber Duck Debuggging. http://en.wikipedia.org/wiki/Rubber_duck_debugging

davesims · on Feb 18, 2013

A few times a month, I'll look up at one of my colleagues and say, "hey, got a sec? I need to talk to the duck," and they know this means I'm going to talk to their head but they can basically keep doing what they're doing and nod occasionally.

This serves several purposes:

(1) It's less insane-sounding than actually talking to an inanimate object in an open work environment.

(2) It actually feels better and forces me to think more clearly when I'm talking to an actual person -- the cognitive focus is higher when the object of conversation can actually, in theory, think and talk back (YMMV).

(3) And finally, although it does require some focus on the part of the other coder, it's not nearly as taxing to them as actually helping me solve the problem or pairing up with me.

So it's a good compromise somewhere between pair programming and talking to an actual rubber duck. Again, YMMV. Maybe I'll call it "Pair Ducking."

confluence · on Feb 18, 2013

I call it the House method :)

You bring a detailed problem and break it down, and talk about it to someone else (who often isn't qualified to answer your questions due to knowledge/time constraints) - and in doing so - resolve the problem by challenging one's own assumptions.

This was effectively how every House episode was resolved.

bbrizzi · on Feb 19, 2013

When I'm stuck on a problem for way too long, I start typing it out in Stack Overflow. Usually by the time I'm done describing it, I've already solved it.

restlessmedia · on Feb 19, 2013

I've lost count the amount of times I've done that. Also, I'm probably the top of the pops in answer replies to my own questions.

junto · on Feb 19, 2013

I also feel guilty when I do this, but at least the answer helps others who might have the same question.

autotravis · on Feb 19, 2013

I think stackoverflow encourages[1] this, so no need to feel guilty.

1. http://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answ...

troebr · on Feb 19, 2013

Haha, we called it the House method too, we ended up making a cardboard humanoid for when nobody was available.

kamjam · on Feb 18, 2013

This is so true, probably about 90% of the time my colleagues call me over about a problem they are facing, explain in detail what the problem is and then eureka! Most of the time it would actually take me much longer to figure out the exact issue since I don't know the ins/outs and subtleties of the code but it's exactly as you say.

Of course, it makes me look really good cos I just "helped" them solve their issue :)

cientifico · on Feb 18, 2013

Maybe because the solution is in asking the good questions.

kamjam · on Feb 18, 2013

Yup, for sure it is, sometimes just having someone else there and having to run through all the steps for them points out the obvious. There was another user further down that said he solved a lot of his own problem just typing them out with enough detail to be able to post on StackOverflow. Same principle.

jsmeaton · on Feb 18, 2013

On occasion, I'll write a question on stackoverflow and re-read it a few times before hitting submit just in case I get that eureka moment. I think I've written way more non-submitted questions than submitted questions.

kyrias · on Feb 18, 2013

You should consider submitting the question and answering it yourself, might help someone else.

Brandon0 · on Feb 18, 2013

I do this a lot too. In fact more often than not I end up not posting the question because either I solve the problem or I think of a possible solution I should go try first. I think we should start calling it Digital Rubber Ducking.

hfsktr · on Feb 18, 2013

I am looking for an excuse to make digitalrubberduck(y/ie).com

If I was a bit more clever I feel like there is a use there.

tekromancr · on Feb 19, 2013

Just make a programming themed chatterbot,drop some ads (If you are so inclined), and you are golden! Someone posted a plugin for IntelliJ that allows you to do just that. I would use a web based one of it existed.

danohuiginn · on Feb 19, 2013

I have occasionally used eliza for this (the basic chatbot in emacs and elsewhere). I'm sure with a bit of tuning you could make a debugging-centred variant of it.

tekromancr · on Feb 19, 2013

What do the Unit Tests say?

Have you tried running a debugger and stepping through the code?

Hmm, go on.

Wait a sec, can you reexplain that last bit?

enjo · on Feb 19, 2013

I have a whiteboard in a closet I use for this... seriously...I go in there and start jotting down notes and talking to myself.

It fascinates me how quickly I usually find the answer.

lemming · on Feb 18, 2013

Amusingly, IntelliJ has a plugin to do exactly this:

https://sites.google.com/site/codeconsultantplugin

davesims · on Feb 19, 2013

That's fantastic. I use IntelliJ for Android stuff, installing now. They should add a 'Duck' mode that just brings up a big picture of the rubber duck.

Swizec · on Feb 18, 2013

Working from home I tend to just write out my thoughts on a piece of paper. It works perfectly.

timv · on Feb 18, 2013

When I work from home (and I did it exclusively for 8 months of last year) I ended up talking to my wife (my 1 yo daughter wouldn't stand still for long enough).

My wife is a social worker by training, so it was pretty rare (though not unheard of) for her to be able to give me real input, but over the years I've trained her well enough to follow most of what I'm saying and nod at the right points :)

pseut · on Feb 18, 2013

I send myself an email for the same purpose and set up an alias for /dev/null. It makes keeping notes a lot easier since I have the copy in sent mail and can reply to it as needed in the future. (this approach seemed less crazy before I typed it out...)

z-e-r-o · on Feb 18, 2013

For me, this should be called stackoverflow debugging. I genuinely solved a lot of my problems by trying to write a _good_ question on SO about my problem. The problem seems really difficult when I try to ask it in one sentence, just out of my head. However once I try to describe the background, what I'm trying to achieve, what I'm using, when does the problem happen, simplified down to sub-cases, usually by the time I'd be 80% ready with writing the question, I realize the answer.

Semaphor · on Feb 18, 2013

That happens to me a lot. Most of the time I just formulate the question I have in my head into something coherent and by that point I either have the solution, know what to search for or, in case it's not a question but a comment, I realize it's not worth saying.

confluence · on Feb 18, 2013

I'm a serial SO self-answerer. I write really in depth, complicated questions for complicated problems, with code, data and testable cases - and by the time I've finished the question and posted it - I've figured out the solution - or I'll have it a few hours later.

I usually just leave the question/answer online so that others can benefit for it.

robotic · on Feb 18, 2013

I've made several posts to SO and then realize the answer moments later. I usually just self-answer.

elcodedocle · on Feb 18, 2013

Same here: I've been working on a couple of projects by myself for the most part of last year, and when even the duck failed, I could usually figure out an answer just by trying to find the words to post my problem in SO in a way somebody would take the time to read it and be able to answer it. I don't recommend it as a first approach, though, since it's quite time consuming (Or maybe I should blame it on not being a native speaker...)

aroman · on Feb 18, 2013

Likewise for me, but with IRC. Though I suppose I should try asking on SO first to save myself the semi-public embarrassment ;)

s_baby · on Feb 18, 2013

Yup, the incentive is there to state your problem as clearly as possible to get back a good response. By doing this I answer my own question half of the time.

furyofantares · on Feb 18, 2013

There is a line from Futurama that perfectly applies ton a lot of debugging.

Farnsworth: My God, is it really possible?

Fry: It must be possible, it's happening.

Fry: By the way, what's happening?

mhurron · on Feb 18, 2013

Extremely appropriate as Fry is his own grandfather and the site software can't handle that relationship.

Aardwolf · on Feb 18, 2013

That's one of my favorite lines from Futurama, 'Ohh, a lesson in not changing history from Mr. "I'm My Own Grandfather"!'

robinh · on Feb 18, 2013

Curiously, that episode was on TV where I live just an hour ago.

brehaut · on Feb 18, 2013

http://en.wikipedia.org/wiki/Pareidolia

robinh · on Feb 18, 2013

Oh, I know. Just wondered if the person who posted the comment above mine had just seen that particular episode, too.

banachtarski · on Feb 18, 2013

Is this the forward time machine episode??

I love futurama more than any man could love any tv show.

short_circut · on Feb 18, 2013

I feel like your name reflects that fact. It seems to be a reference to the Banach-Tarski duplashrinker

ztravis · on Feb 18, 2013

More likely it is a reference to its eponym, the Banach-Tarski paradox.

banachtarski · on Feb 19, 2013

It's both :) I studied math too.

furyofantares · on Feb 18, 2013

It is -- it's when they are observing the second big bang.

Someone · on Feb 18, 2013

Amazing? for anyone who has read Polya's "how to solve it" (http://en.wikipedia.org/wiki/How_to_Solve_It), that is hardly surprising.

If you don't understand your problem, you can't make a plan. If you can't make a plan, you can't execute it.

Another interesting lesson from that book is that one should spend time on evaluation (how did this come about? Could We have fixed this sooner? How are we going to prevent it in the future?)

biot · on Feb 18, 2013

People at work are amazed when I successfully debug an issue over the phone. In reality, it amounts to 50% experience plus another 50% of Sherlock Holmes: "When you have eliminated the impossible, whatever remains, however improbable, must be the truth". Once you've identified what you're dealing with via a few strategic questions, it becomes simple quite rapidly.

kamjam · on Feb 18, 2013

One of my favourite debugging tips has always been "give everyone full access to the folder/service" and see if the problem is "fixed". If so revert and now apply the correct permissions. I've seen this come up so many times although my superiors always complained "it's not the right way to do it", whilst I agree "every full access" is bad, this was for debugging purposes only!

65a · on Feb 18, 2013

Debugging is often best accomplished as a binary tree search aimed by familiarity/experience. Once you can put bounds on the search, it becomes possible to get the answer in just a few questions.

Totally agree.

hobs · on Feb 19, 2013

This is the best way to work through debugging/troubleshooting as far as I can tell, amazingly a skill many people lack, and others that just understand it intuitively without it ever thinking about it. That is one of the big divisions between hackers and everyone else in my mind.

chii · on Feb 19, 2013

i think perhaps some people just don't see the world in a hierarchical way, so in their frame of mind, the problem is intractable.

snowwrestler · on Feb 19, 2013

It's amazing how often I am able to fix problems by simply trying all the possible solutions--often while colleagues are saying things like "stop wasting time, it can't possibly be that." But of course often it is "that".

jgrahamc · on Feb 18, 2013

The sort of debugging seems to have been around since the very beginning: http://blog.jgc.org/2010/05/talking-to-porgy.html

RyanMcGreal · on Feb 18, 2013

I'm not sure if rubber duck debugging would have helped here. The problem was in the data, not the code. (I know, I know: in Lisp code is data.)

chernevik · on Feb 18, 2013

That's exactly the sort of thing a duck will tell you. "So what has changed? Let's see, new code, new server, and I fixed the commenter's comment. That was simple I just hard-hacked the comment id and . . . excuse me a second."

RyanMcGreal · on Feb 18, 2013

Good point. I was thinking in terms of going through the code line by line, which if anything would lead you away from the trail.

DanielBMarkham · on Feb 18, 2013

Yep. I thought this through as I was typing my comment.

(There must be some joke involving the use of a meta-duck, but I can't come up with it. :) (Same principle applies, of course, just LISP makes the determining of "what" a bit more tricky. (insert discussion here about the general differences between debugging imperative and functional code)))

cube13 · on Feb 18, 2013

Rubber duck debugging may have actually been the distraction that caused pg to make the mistake, too!

_d8fd · on Feb 18, 2013

This. Even with the best test coverage in the world, you still bump into edge cases that you couldn't have predicted. As a former QA Engineer, I used to say there's still room for QA in a test driven environment. Now I say there's no replacement for a sharp mind with enough knowledge, curiosity, and good judgement.

mvzink · on Feb 18, 2013

This is also why pair programming is so great.

ryen · on Feb 18, 2013

Not easy on a Sunday while at home.

lifeisstillgood · on Feb 18, 2013

There are a number of comments that add up to "what steps will you take to ensure this does not happen again" - akin to a incident review. As speculation that's fine, as advice, I don't think it should be listened to.

I am reminded of an long-in-the-tooth sysadmin of my acquaintance who logged in everywhere as root. His theory - "they are my boxes. I screw it up, I fix it." I eventually realised that typing sudo every time he touched a box was no defence against doing the wrong thing.

An awful lot of sites at 1.2m views would have outsourced the running and development of the whole thing - there are entreprenuers who say its not even worth our time to code up the MVP. I find this approach sensible from a business point of view, but still it does not sit right with me.

I am supposed to have a nice website with lots of good content to attract inbound marketing - so I tried getting someone on textbroker to write an article for me. It read like a High School essay - no life, no anime. And so I will probably write my own CMS and my own content.

And pg sits there and writes his site in his own language, with his own moderation tools. Apart from the hilarious idea he could find a ten person ruby shop to outsource to, its nice to see someone taking the time to play again. Its why I like to see jgc on here too.

I am not entirely sure those thoughts are joined up (I am procrasting like crazy) but if they come to mean anything its we are playing in pg's sandbox. If the sand leaks it's his sand, and the only company this is mission critical to is YC.

lmm · on Feb 18, 2013

Typing sudo won't save you, but using a higher-level interface will. Everyone I've ever known to change something in the database by hand, everyone at all, even on a hobby project that they know like the back of their hand, has screwed it up sooner or later. At some point the pain tells you you should stop doing that, and you create an admin tool that lets you do what you need to repeatably and safely.

3pt14159 · on Feb 18, 2013

I've never screwed it up on a live database, but I do take about 5 mins, first reviewing the keys, the type, whether or not something can be null, checking to see if critical columns have

    select count(distinct column_name) having count(distinct column_name) > 1;

To make sure that there isn't an underlying uniqueness assumption.

Sure I could do it in 10 seconds and save myself 290 seconds (a 97% savings!) but then one day I'd have to scramble like crazy in the middle of the night trying to figure out what I screwed up for hours on end.

I'm not saying don't build an admin tool, obviously those are needed for things like banning users, but just get in there and carefully fix the data if something is wrong.

mafro · on Feb 18, 2013

This. Back in the day when I was in more of an analyst role, I ended up /having/ to hack on the live DB frequently (reasons for this were myriad).

1. Always, always make a backup just before the hack.

2. Write a small set queries like 3pt14159's to check uniqueness and other pertinent properties.

3. Write a SELECT query to show the data you are going to change.

4. Borrow the WHERE clause from 3, and write your UPDATE statement.

5. Run 4, and then run 3 again to see that you successfully fixed it.

6. When it goes wrong, restore the backup from 1 :0

grey-area · on Feb 19, 2013

Don't you have a dev db somewhere that you can replicate the live db to? Time spent setting that up will be more than repaid by the time and stress saved when you have to do a quick fix - you can simply run your changes, check it all works on your replicated site, and then make the changes on your live db (preferably with some sort of migration tool which applies the same sql and backs up first). If you have a regular backup process you could tie into that to populate the dev database.

Even if you can't replicate the entire live db, if you can automate backup, deployment of changes and test first elsewhere it makes the entire process far less fraught.

stusmith1977 · on Feb 19, 2013

I'm going to add step 5b - save what SQL you executed (against what server, and for what reason), ideally in source control, as an audit trail.

Otherwise, I end up having this conversation (which actually happened):

Him: <Big Client> is having troubles! Features X, Y, and Z aren't working! Me: Hmm, has anything changed? It was all OK on Friday. Him: No, nothing's changed. Me: Really? Him: Well I ran a bunch of scripts on Saturday while I was visiting them. Me: OK, so what exactly did you run? Him: Just a bunch of scripts.

mkubler · on Feb 19, 2013

As a tip - Also backup your staging database and have all backups using something like Rsnapshot or maybe even in a version control system, something which does point in time backups.

I learnt this after I inherited a project which had been written by some Romanians and it was pretty horrible. There was no MVC framework, it was a hacked together mess.

Somehow the live site started using the staging database instead of the production database, both were on the same server. Every time we (the devs) pushed to staging a script would grab the latest version of the live database and overwrite (drop tables) the staging database. The assumption being that the staging database is a bit like a demo server, changes made to it are temporary and just for testing, but that it should look as similar to the main website (but updated) as possible. The production database was backed up in about 5 different ways, but the staging database wasn't backed up at all.

After about a week of vanishing books, books which authors had uploaded to the self publishing portable with descriptions and other information, we realised what was wrong. Their files stayed but their accounts and book details were wiped.

In another epic fail on the same server I later moved the root folders by running the following as root (I'd probably have been stupid and run the same command if not as root but I'd have put sudo in front of it). > cd /home/<username>/public_html/public_html > mv /* ../

I was meant to mv ./* (files from the current directory into one below cause they'd been copied across into the wrong folder. Needless to say moving the root folders such as /etc and especially /lib and /bin is a BAD idea. Although is fixable, but that's another story.

gbog · on Feb 19, 2013

6. Should be ROLLBACK, a life saver. Works in postgres.

furyg3 · on Feb 19, 2013

Maybe I'm old school, but shouldn't this be done in an dev or acceptance environment?

I hack on the "live" DB every day, and by live I mean i sync this DB to another environment, try it out, run it on prod.

gav · on Feb 19, 2013

One of the things I prefer to do is to only write UPDATE statements that update a single row. For example instead of:

UPDATE line_items SET quantity = 1 WHERE quantity < 1;

I'd script the following updates:

UPDATE line_items SET quantity = 1 WHERE quantity < 1 AND id = 123;

For each of the individual rows that needed to be changed. Then I have a check that I'm really updating just the rows I expect, this is especially important to me where the UPDATE involves joins, as I find this is the trickiest to get right.

Swannie · on Feb 19, 2013

Is there any other way??? :)

I thought everyone did this - well... for small datasets, skip the back, use a transaction. Rollback if your step 5 failed and try again.

jonob7 · on Feb 18, 2013

This is pretty much exactly how I do it.

I still sometimes get that sinking feeling in the stomach that I have screwed something up, usually just after I hit the 'execute' button. And I really don't want to have to take the site down to run the restoration.

MartinCron · on Feb 19, 2013

This reminds me of a feature that I wish that database systems supported: Make it impossible to execute DELETE or UPDATE statements without a WHERE clause.

luser001 · on Feb 18, 2013

> I eventually realised that typing sudo every time he touched a box was no defence against doing the wrong thing.

IIRC, sudo logs all commands to syslog. Which might come in handy. Yes, root commands will be logged by bash to .bash_history, but there are limits of # of commands lots, what happens if you are logged in multiple times into same account etc.

Anyway, that's why I like sudo.

wpietri · on Feb 18, 2013

Beyond the logging, which I love, I use it for the differentiation of states. I'm just a little more attentive when I type "sudo" before something.

At work, a relatively young engineer accidentally typed a command meant for a test database server into a production window. There was a big rush to restore from backups, and there was a small amount of data loss.

One thing that came out of the retrospective, requested by the engineer in question, was the production hat. Before you opened up a connection to a production machine, you had to put on the large pirate hat. You could only put it back when you had closed the connections. I didn't really need it, but it was a great way for people to learn the necessary caution.

It also ended up being a nice exclusive lock on futzing with production, and seeing it in use led to some good discussions that otherwise might not have happened. But the main thing was developing a strong differentiation of states in everybody's heads.

darkarmani · on Feb 18, 2013

Another cheap solution is colorizing the prompt red for dangerous consoles and green for dev machines. This makes it very easy to notice when you selected the wrong terminal.

Spoom · on Feb 18, 2013

I use a different background color for the terminal in question; in this case, I use a dark red to signify a production system, and a dark blue to signify a development system. I find it quite useful. You can do this through Xterm profiles (Edit -> Profiles, Terminal -> Change Profile) in Linux and OS X, and I'm sure there's a way to do it in Windows / PuTTY.

lifeisstillgood · on Feb 18, 2013

I have tried the red / green console but never a pirate hat.

My son now definitely thinks work is like his school :-)

UnoriginalGuy · on Feb 18, 2013

Plus security. With root login disabled a remote attacker won't have a known username to attack.

kisielk · on Feb 18, 2013

I find most of the time I'm using sudo I don't want to type it before every command and so I use sudo -i which pretty much negates the benefit of logging anything other than to tell that I was sudo at some point.

stcredzero · on Feb 18, 2013

> It read like a High School essay - no life, no anime. ... I am not entirely sure those thoughts are joined up (I am procrasting like crazy)

Your procrasting like crazy has much anime.

philsnow · on Feb 18, 2013

s/anime/animus/ ? or is this a new usage of "anime"?

lifeisstillgood · on Feb 18, 2013

s/anime/anima - as in soul, vitality

(Not so much Jung's inner woman)

I think the sentence does read better if it is complaining there are not enough cyberpunk Japanese comics on my site though :-)

lalc · on Feb 18, 2013

Here I was thinking you were referring to the Japanese meme "No ___, No Life!"

dasil003 · on Feb 18, 2013

I'm not sure whether it's terrifying or relieving to realize that if all I dream of comes to pass and I achieve something akin to the legendary status of pg in the hacker community that I will still be susceptible to the inevitable facepalm moments that come with direct database access.

In any case I am thankful for the detailed explanation.

larrys · on Feb 18, 2013

Some of the most spectacular airplane crashes are by the most experienced pilots.

If you've ever tried something new as a hobby you tend to be very careful. Once you gain confidence you take more chances and don't do what even a beginner might do.

dasil003 · on Feb 18, 2013

Too bad we rarely get a postmortem on batshit insane production hackery that actually goes off without a hitch.

shoden · on Feb 19, 2013

I would like to turn this into a poster or a t-shirt.

stcredzero · on Feb 18, 2013

There must be a rare personality type that never experiences this kind of overconfidence. Perhaps a less glamorous cousin to the Buddhist beginner's mind?

danielweber · on Feb 18, 2013

There are certain classes of autistic people who are very good at always following the rules and finding people who are not. And in certain places they are exactly what you want.

http://www.nytimes.com/2012/12/02/magazine/the-autism-advant...

larrys · on Feb 18, 2013

I find that if I am doing something "dangerous" (example might be using power tools) I have to say to my self "be careful this is dangerous" to avoid being on autopilot and making casual errors. Maybe a better example is the way you train yourself after you've picked up a box the wrong way and pull something to try to remember each and every time to watch your specific movements.

kamjam · on Feb 18, 2013

But at some point, you will become complacent, and it will taken a mistake to remind yourself again.

We've all done it. I shut down an NT4 production server because I was connected via remote desktop and clicked shutdown rather than log off. This was back in the day when there was no pop-up asking for reason you want to shut down and confirmation.

Luckily it was just our internal intranet server!

yuhong · on Feb 18, 2013

You were running NT4 Terminal Server Edition?

kamjam · on Feb 19, 2013

Yes, i think so. It was so long ago and it was my first programming role!

yuhong · on Feb 20, 2013

AFAIK that edition had logoff instead of shutdown on the Start menu for that reason. You can still access shutdown by hitting Ctrl-Alt-Del on the console or clicking Windows NT Security.

kamjam · on Feb 20, 2013

It was so long ago I have no idea. I defo vividly remember that I shut it down via the start menu, just one of many moments that stick out :)

dasil003 · on Feb 18, 2013

I have to remind myself this every time I start up Sequel Pro now since in the last release they switched the command keys for Run Selected... and Run All...

ansible · on Feb 18, 2013

I agree. I've been told that with motorcycle riding, the first 10K miles are the most dangerous. This is when you've gotten out of the newbie stage, but don't yet understand your own limits nor the bike's limits.

Eliezer · on Feb 19, 2013

I thought it was a pretty cool error. I mean, you've got to not screw up on a lot of boring things before you can screw up this interestingly. Most failures are much more boring.

Joeri · on Feb 18, 2013

The amusing part is that no matter how legendary you become, restarting the server is always a good idea to solve problems. Software is rarely designed to run forever. Last week i had a moment of madness because a line of code remained buggy even after i debugged it. Turned out it was php's opcode cache that just needed a reset to get its wits back.

hnriot · on Feb 18, 2013

he's not legendary for his IT skills.

mark-r · on Feb 19, 2013

It was his IT skills that got him the big sale to Yahoo that got him the bucks to start YC. Not sure where this comment came from.

papsosouid · on Feb 19, 2013

You really think yahoo bought viaweb because of pg's legendary ability to reboot servers and type "./configure && make && sudo make install"?

jbverschoor · on Feb 18, 2013

Now he is ;)

dasil003 · on Feb 18, 2013

Low blow.

sehugg · on Feb 18, 2013

Great postmortem and good lessons to learn here:

* Don't manually modify database without a well-tested procedure and another pair of eyes

* Don't leave persistent problems (e.g. memory problems) uninvestigated so that you miss new problems with similar symptoms

* Don't push new code to production while operational problem is ongoing (unless it addresses the operational problem)

I'm pretty sure I've repeated this exact same sequence before with similar results...

JonM · on Feb 18, 2013

* Don't push new code to production while operational problem is ongoing (unless it addresses the operational problem) ^^ absolutely!

scotty79 · on Feb 18, 2013

* while you are displaying a tree keep track of the items you already displayed so you can detect a cycle

sukuriant · on Feb 18, 2013

I think the assumption there was that it was safe, since the code disallowed this from happening, naturally.

scotty79 · on Feb 18, 2013

Assumption is the mother of all screw ups.

Even if you think that the code that creates and modifies your data will not put it in some undesired state, the code that uses this data should assume that the data may be in all undesired states you can dream up and should do its best not to do something seriously bad when that happens (like landing in endless loop/recursion or executing possibly user provided strings).

zzzeek · on Feb 18, 2013

use CHECK constraints to prevent invalid data patterns when possible.

erichocean · on Feb 18, 2013

Sadly, even that isn't enough.

In our production database, I used CHECK constraints religiously. Worked great.

Then one day, I was no longer able to commit ANY transactions to a particular table, even completely innocuous ones.

The problem? The database itself had violated its own CHECK constraint on a previous commit, but was enforcing it on all subsequent commits, causing them to fail. Brilliant.

Moral: not even CHECK constraints will save you.

----

P.s. This was a proprietary database, and when I reported the problem to the vendor (eventually, I figured out how to reproduce it), the vendor actually refunded our (expensive) support contract rather than fix the bug -- they couldn't figure out how to fix it despite having a small bug report that reproduced the problem.

In the end, I actually had to remove the CHECK constraint altogether. :(

zzzeek · on Feb 18, 2013

> The problem? The database itself had violated its own CHECK constraint on a previous commit, but was enforcing it on all subsequent commits, causing them to fail. Brilliant.

I've never heard of that, and unless you're using a really buggy, broken database, it should not be possible.

> P.s. This was a proprietary database, and when I reported the problem to the vendor

well there you go. I think in practice, a simple CHECK constraint like the one we'd do here (literally, comment_id > parent_comment_id) is pretty easy to put one's faith into.

papsosouid · on Feb 19, 2013

That's hard to do when you are afraid of databases and just store everything in files.

neurotech1 · on Feb 18, 2013

This should serve as a example template for how to accurately and transparently explain to users what went wrong. No deflecting blame, no useless platitudes.

Credit to PG, RTM and the rest of the team for keeping the sites uptime as high at it is.

jgrahamc · on Feb 18, 2013

"No deflecting blame"

Who were they going to blame?

wpietri · on Feb 18, 2013

He could have blamed the new server. Or whatever distracted him. Or the user, for being dumb. I've seen people do all of those.

Or he could have just dodged the blame entirely.

_euac · on Feb 18, 2013

Some people are just freaking difficult to work with. I've worked with people that wouldn't accept responsibility even after every other possible cause was ruled out. I've even gotten this reply: "well, you must have been unlucky to get the faulty e-mail, because it seems to work most of the time". Yeah, because that's how programming works: cowboy coding and hoping for the best.

This guy actually called bugfixes "optimizations": "hey, X the feedback widget on the front page isn't working", "oh, yeah, I haven't worked on that because that's code that needs to be 'optimized', so it's low on my issues list". Ugh.

I've learned my lesson now. In fact, that's an incredible lesson for a startup founder: never, ever, hire someone who dodges a question on an interview. And the first time they avoid taking responsibility for something that was clearly their fault, fire them. The last thing you want is someone who'll blame everyone and anything else for their issues. It's a great way to kill morale and create rifts in a small team.

paganel · on Feb 18, 2013

> And the first time they avoid taking responsibility for something that was clearly their fault, fire them.

I guess there's still difficult for a lot of people to acknowledge their own mistakes, maybe because they're afraid of getting fired for that (acknowledging the mistake), which in the case of startups/small companies happens very rarely.

From my own experience of working at startups for my entire professional career as a programmer (7 and a half years) I can tell you that the first step when noticing you f.cked something up is to take immediate responsibility and then asking yourself "how can I/we fix this?" (you might need the help of other people to fix your mistake). After you've fixed the issue the question should be "how can we make so that this doesn't happen again?". That being solved I'd say nobody cares anymore whose fault was it to begin with, there's always other more important stuff to do.

I agree that maybe at larger companies this kind of thing might happen exactly the opposite way, i.e. you can get fired for making a mistake and nobody really cares to fix other people's stuff, because their next paycheck/financial well-being does not depend on that (or so they think).

_euac · on Feb 18, 2013

Exactly! When everyone is on the same boat the priority is fixing the stuff, then wondering whether attributing responsibility is important (most of the time it isn't. Who cares who fucked up the e-mail template, as long as it's fixed next time it runs.)

I think in this guy's case the causes for his reluctance (or rather, incapability) to accept his own mistakes had much deeper roots. He was literally the most self-centered person I've ever met, to the point that he wouldn't accept anyone's opinion on anything. He went out of his way to find doctors that'd go with his suggestions and run all kinds of tests on him to determine why he had a blood pressure problem, when he was clearly way overweight and had the unhealthiest diet I've ever seen. He'd dress up in shorts and t-shirts during the worst days of winter, and then take a niacin pill to force a capillary rush so his hands wouldn't feel cold (?)

Basically, he just thought the world had to bend to his will. Why use common sense, when you can just say "fuck it" and find a workaround that fits your mindset. Of course you can't expect someone like that to 'accept' his own shortcomings.

The scariest part of all this is he tried, for a while, to become a cop. Yep, imagine that: a 240lbs armed prick, completely unable to reason, forcing his way on everyone. I shudder at the thought.

lotyrin · on Feb 19, 2013

> imagine that: a 240lbs armed prick, completely unable to reason, forcing his way on everyone. I shudder at the thought.

Where are you from where that's something you have to imagine, because I want to move there.

artursapek · on Feb 18, 2013

pg has an essay where he says the smartest people he knows are always willing to take blame or admit they don't know the answer to a question.

vertis · on Feb 18, 2013

Sequoia Capital or Andreessen Horowitz

reddit_clone · on Feb 18, 2013

The user who replied incorrectly?

smackfu · on Feb 18, 2013

I don't know, it's a lot easier to be transparent when the stakes are so low. Most service providers have a real incentive to not put out quotes that can later be used against them, which tends to make explanations very technical or deflecting.

larrys · on Feb 18, 2013

"But then I decided to just fix it for him by doing some surgery in the repl."

I've always found it's a good idea to not deviate. Whether it be running, parking or anything else once you deviate from some regular behavior you run into potential problems that you hadn't anticipated.

"For some reason I didn't check the comments after the surgery to see if they were in the right place. "

More or less my point. If this wasn't a deviation from normal behavior you would have "checked the comments after the surgery" because it would have either become habit or the shear number of times you tried a fix resulting in an error would have made that more likely to occur.

irahul · on Feb 18, 2013

> I've always found it's a good idea to not deviate.

Aren't you assuming "surgery in repl" is a deviation? What if it's normal course of action for him?

> More or less my point. If this wasn't a deviation from normal behavior you would have "checked the comments after the surgery" because it would have either become habit or the shear number of times you tried a fix resulting in an error would have made that more likely to occur.

How about the opposite scenario? He has done it so many times with desired results, that he didn't bother checking?

badgar · on Feb 18, 2013

> Aren't you assuming "surgery in repl" is a deviation? What if it's normal course of action for him?

This is a big difference between engineering and hacking. An engineer would never regularly do something so dangerous.

But I suspect pg isn't an engineer when he works on HN, I suspect he is a hacker, and just does whatever he wants to, whenever he wants. Which is his prerogative.

prawks · on Feb 18, 2013

> I've always found it's a good idea to not deviate.

> you run into potential problems that you hadn't anticipated.

The second statement is no reason to live by the first. In fact, I think you'd be doing yourself a disservice by staying so comfortable. Being comfortable with the unanticipated, however, is a powerful quality to have.

rwallace · on Feb 19, 2013

That's fine, but try to become comfortable with the unanticipated on a test server, not the production server.

scotty79 · on Feb 18, 2013

If you don't deviate from what you usually do you don't learn.

Obviously don't deviate from routine (or rather prescribed procedure) when you are running nuclear power plant or airplane maintenance. But when tinkering with the site that gives you no money and won't cost any lives you can loosen up a bit.

tolmasky · on Feb 18, 2013

Why do "self posts" like this show up in the same light gray as posts with negative vote counts? My eyes aren't great and I find it hard to read

emillon · on Feb 18, 2013

The rationale for this is that if you need to post a long text post, it should be in the form of a blog post instead. I agree with you that it's not really adapted for a meta post.

hayksaakian · on Feb 18, 2013

Maybe post color is based on some get_text_post_color method that applies to self posts and comments, where the color depends on comment karma. Given that self posts like the OP are votable as if they were a normal link post, their comment karma value is probably 0.

roryokane · on Feb 18, 2013

I don’t know pg’s reasons for making self-posts light gray, but you can fix problems like that with the bookmarklet Zap Colors: https://www.squarefree.com/bookmarklets/zap.html

vl · on Feb 18, 2013

I'm using "Hacker News Enhancement Suite" Chrome extension - it fixes multiple problems, including this one.

irahul · on Feb 18, 2013

Disclaimer: Hindsight is 20/20, and stuff.

If reverting code didn't fix it, reverting server didn't fix it, incorrect data is the most likely culprit(I am not claiming this should have outright occurred to you; just thinking out loud). I take it you introduced non terminating recursion by making a thread its own parent, and you made the change on disk.

But this analysis is the last thing that comes to mind when you already have introduced 2 new variables the same day - new code, new server. And an old, recurring variable(GCing too much) is in play as well.

benatkin · on Feb 18, 2013

So what do you do to avoid this in the future? Do you stop doing surgery in the repl, or do you do the surgery with functions that check for cycles from now on?

ww520 · on Feb 18, 2013

This reminds me of the countless conversation I had with people after a crisis. What can we do to prevent this from happening again? What process can we put in place? What restriction needed to be tightened up?

And that's how processes are born.

tantalor · on Feb 18, 2013

> And that's how processes are born.

Not necessarily. Processes are implemented by people, so they can break at any time.

The correct solution is more code, or less bad code.

Someone · on Feb 18, 2013

And how do you get less bad code? Magic dust or process? My money would be on the latter, as in http://www.fastcompany.com/28121/they-write-right-stuff.

tantalor · on Feb 19, 2013

A newsfeed with ranking is a bit far off a space shuttle.

badgar · on Feb 18, 2013

> So what do you do to avoid this in the future?

It's HN... there's no SLA, there's no postmortems, there's no doing things better in the future. pg just runs this site out of the good of his heart, we should be lucky the volunteers run it for us at all.

larrys · on Feb 18, 2013

"pg just runs this site out of the good of his heart"

Don't think it's a "good of his heart situation". HN provides a benefit to YC and YC companies and attracts people to YC. As another example Fred Wilson has a very popular blog AVC and has said many times that he considers it "his secret weapon" (or something like that) because the value it provides over his competition.

jlgreco · on Feb 18, 2013

> there's no postmortems

1. Press [Home] key.

2. Read postmortem.

3. ???

benatkin · on Feb 18, 2013

> It's HN... there's no SLA, there's no postmortems

I didn't mean to imply that there were. I was just curious.

> there's no doing things better in the future. pg just runs this site out of the good of his heart, we should be lucky the volunteers run it for us at all.

Since there are multiple volunteers, I think that the site always feels important to at least one of them. I imagine that some of them have gone more than a week without giving a shit about HN, but not all of them at once. So I think there is doing things better in the future. In fact, HN keeps getting improvements behind the scenes, to keep it running, keep it interesting, and keep it from getting overrun with trolls.

oh_sigh · on Feb 18, 2013

> pg just runs this site out of the good of his heart

Hilarious. I would have believed you if you appended "and his wallet"

badgar · on Feb 18, 2013

One core. One HD. Bandwidth is trivial with no images. How much do you think this site costs to run?

jlgreco · on Feb 18, 2013

He means that PG runs the sight because it makes business sense, not out of charity.

oh_sigh · on Feb 18, 2013

You didn't understand what I said. I was implying that this site is a money-maker for pg, not that it costs him money.

gknoy · on Feb 18, 2013

His time. Maybe the core and HD are Very Nice ones, too, of course.

benatkin · on Feb 18, 2013

How much do you think the time of YC partners is worth?

martinced · on Feb 18, 2013

I don't think parent did mean that he had a "right" to expect some level of quality or anything.

It's just that we, as programmers, tend to take measures so that silly bugs do not happen anymore or that, at least, we leave big clues as to what went wrong.

In a project I had a similar issue: I was wrapping lists inside immutable lists but, due to a silly bug, I kept wrapping immutable lists inside immutable lists at every save made. So saved files would grow bigger and bigger.

And I did fix the bug and also added a big fat warning logs in case too many nested lists were detected.

pg might just as well have now added something preventing infinite recursion inside the comment tree or some WARN logging telling when a generate page is getting too big, etc.

I'd still find it very interesting to know what pg did, if any, to dodge / minimize / make it easier to determine if such an issue happens in the future.

Legion · on Feb 18, 2013

"We'll do it live!"

jbuzbee · on Feb 18, 2013

"Hey, Hold my beer and watch this!"

eytanlevit · on Feb 18, 2013

gruseom · on Feb 18, 2013

This is a particularly endearing piece of "hacker news". It's so easy to relate to.

lucb1e · on Feb 18, 2013

Are you saying you manually modify the database? Like, shifting around things by id instead of just making admin buttons next to posts?

sgt · on Feb 18, 2013

I think I get what you're hinting at.

Ok, so this is Hacker News, it's in the name, and most of us are aware that HN is also a research/hobby project. It's not made to be an rock-stable enterprise system doing bank transactions or what not, so I think what pg did was prefectly excusable. People make mistakes. Nobody will die without HN for a day or two, and it won't affect the site's popularity one bit.

lucb1e · on Feb 18, 2013

No that's right, but I worry about apache2 being down for potentially one or two users or bots that visit/crawl my website during a one-minute reboot. Meanwhile the big boys are down for 16 hours because they do things that any other person would have gotten a decent scolding for. Just look at the points per hour this thread is getting, if I had posted this about my website on my website people would have said I was stupid.

You are right though, making mistakes is human as they say, and nobody dies because of this. In fact, less popularity might be good for the site's content quality. I'm just surprised by how much they care about thousands of hourly users, that what I would dream of having.

RyanZAG · on Feb 18, 2013

I'd say this is a pretty important lesson: you don't need flawless technology and zero downtime to be popular and/or profitable. You need content worth viewing. People are more than willing to put up with technical errors if it's something they want/need.

Focus on providing people what they want/need, and don't worry so much about having flawless technology until you can employ a horde of PHDs.

artursapek · on Feb 18, 2013

I would say that's a observation you just made there.

chc · on Feb 18, 2013

Patrick McKenzie had a great horror story on his blog a couple of years back. He runs a service that provides appointment reminders to businesses' clients (e.g. "Don't forget, you have an appointment to get your hair colored at Best Little Hair House tomorrow at 3"). Long story short, an attempt to manually correct a hangup in the live system resulted in his product spamming his customers' clients (that's right — not just his customers, but their customers) with up to 40 phone calls back-to-back.

So, how many customers do you think he lost because of this? The answer is two, and one of them signed back up because they were impressed by the great job he did in handling the fiasco.

Moral of the story: As long as you really are making your best effort, you might be surprised how willing people are to deal with human error. Yes, they might be be mad, but a mistake is (usually) not the end of the world.

alex_c · on Feb 18, 2013

>Just look at the points per hour this thread is getting, if I had posted this about my website on my website people would have said I was stupid.

For what it's worth, I upvoted this thread specifically because we've all done something this stupid (or worse) :)

irahul · on Feb 18, 2013

HN runs on plain files. He wasn't modifying database, but calling functions(I believe) in the repl to change the parent id of the thread.

But that apart, even if there were an admin button to change the parent id of a thread, he would still have made the same mistake.

Unless the code in question was checking for loops. In that case, repl would have worked the same.

gummadi · on Feb 18, 2013

Glad that you bought this up. If you have more knowledge regarding this, can you please explain how exactly the posts & nested comments are stored directly using flat files. How are concurrency issues handled?

lucb1e · on Feb 18, 2013

I sort of meant that you shouldn't modify things like that directly. Be it a filesystem, database, or any other place that makes it possible to mess things up to bring a rather strong server down.

irahul · on Feb 18, 2013

I see where you are coming from. But I am saying this didn't happen because he did things live. This happened because he entered incorrect id making a thread its own parent(or grandparent; doesn't matter).

This is the kind of mistake one would make even if you were writing proper migrations. He was doing things live isn't an issue; neither is an incorrect id. The issue is the code doesn't check for loops.

chernevik · on Feb 18, 2013

I am but an egg, I have two questions.

One, if the data were held in a database, should a change like this be captured in the database logs? I am seeing more and more situations where I want these, I notice that they are by default turned off for mysql and wonder if this reflects a de facto judgment that logging slows performance more than is usually worthwhile.

Two, if the data were kept in a database, wouldn't something like this be prevented by a constraint preventing a comment from making itself an ancestor? But I suppose there is a slight performance hit in checking such constraints, and the case arises so rarely that this hit isn't generally worthwhile.

irahul · on Feb 18, 2013

> I notice that they are by default turned off for mysql and wonder if this reflects a de facto judgment that logging slows performance more than is usually worthwhile.

I think it's more like your application is doing the logging already(probably; most of the frameworks do). If you really need it, turn it on yourself.

> Two, if the data were kept in a database, wouldn't something like this be prevented by a constraint preventing a comment from making itself an ancestor?

Copy pasting the table from another comment.

    create table post (id int primary_key, parent_id int references post(id), child_id int references post(id), created_at timestamp)

There isn't a simple check constraint you can place to ensure a parent's, or a grand-parent's, or a grand-grand-parent's parent_id isn't child.id You will have to write a trigger.

This isn't really a big problem to solve. pg simply overlooked this problem. Had he not, he would have checked child.created_at > parent.created_at in his mutator method. So, when you do a post.parent = some_post(assuming mutator is parent=; replace it with post.setParent or (send post set-parent some-post) or whatever), it checks if post.created_at > some_post.created_at, and then assigns post.parent_id = some_post.id

lmm · on Feb 18, 2013

Databases, at least the SQL kind, really aren't good at dealing with hierarchical data, and I don't know how you'd even begin to express that kind of constraint. I don't think a traditional database is the answer here. (If it were me, once I'd done it more than twice I'd write a "move thread" admin tool in the UI, and after I screwed it up like this I'd have a place to add such a check to).

mr_luc · on Feb 18, 2013

If you were using some kind of representation for Nested Sets -- left-to-right depth-first numbering, or a human-readable id.id.id chain -- then it's really easy to write a constraint for that: parent left < myleft, right > myright, or dotted_id.split('.').filter{|first, rest| return false if rest.contains first} (yeah, yeah, that second pseudocode would be unrealistically PITA for some DBs).

More generally:

I'm not a big SQL wonk anymore, but I find a lot of people have the intuition that relational databases are ill-suited for trees.

An intuition that is much closer to the truth is that almost all databases can handle trees pretty well, because there's still an unambiguous concept of ordering and containment, and you can usually arrange things so as to do range/ancestor/inclusion queries efficiently.

It's graphs with loops/without unambiguous concept of ordering/containment that are really hard.

EEGuy · on Feb 18, 2013

Found this: The excellent Postgres documentation includes an SQL graph search with two different ways of graph cycle checking, here: http://www.postgresql.org/docs/9.0/static/queries-with.html

One way involves accumulating an array of nodes already visited as the tree gets walked, checking each node as-visited for membership in the array-to-date.

The other method, a bit more of a hack, is just adding a LIMIT clause.

I think the 'WITH' clause is a great addition to the SQL standard, very much worth the learning the weirdness of its syntax and its optional 'RECURSIVE' term (which, as the Postgres documentation points out, isn't really recursion, it's iteration).

irahul · on Feb 18, 2013

I think if you want the family tree, you can write a self referential(assuming post table is self referential as it should be) recursive query.

But in this case, writing a before insert/update trigger which ensure some_post.created_at < parent.created_at before setting parent.parent_id = some_post will do the trick.

triplepoint217 · on Feb 18, 2013

Yes, I was trying to make posts editable on the HN instance I run, so I got clever and started messing with the files in emacs. Then I learned that the HN code does not like files in the story directory with ~ on the end of their name (emacs backup files), oops ;).

RKoutnik · on Feb 18, 2013

Sometimes that's easier (albeit more dangerous, as we just saw).

martinced · on Feb 18, 2013

"Are you saying you manually modify the database?"

Oh manually modifying production database on the fly ain't unheard of.

However it's still not "very Chuck Norris" on a scale of Chuck Norrisness compared to the modification of a running app directly in the REPL. I mean: it doesn't matter if you manually modify the DB itself or not when you directly modify the app from the REPL itself (the app being anyway "in charge" of the DB).

Sure, modifying manually the production DB might be an issue to some. But I can guarantee you that it's the last of your worries when you're actually modifying production code directly from the REPL ; )

birken · on Feb 18, 2013

Do you have munin monitoring on the production HN server?

That would really make situations like this easier to debug. First, it can pinpoint exactly when something started happening, which in this situation might have helped you realize the problem was caused by your change. Secondly, in this specific situation it probably would have been easier to differentiate a situation where you are running low on memory vs this completely different situation.

As somebody who spent a lot of time professionally debugging large software systems when they were misbehaving (as a Google SRE), I can tell you that looking at graphs of many key metrics (disk IO, CPU, memory, then application specific things) was always the place to start when debugging a situation, because you can learn so many things right away. When did it start? Was it a slow buildup or an immediate thing? What is the general problem (Memory?, Disk IO?, CPU?, none of the above?)? Has a similar pattern happened in the past?

Then you can start to get fancy and plot things like "messages/minute" or something and then it becomes easy to see when issues are affecting the site performance and when they aren't.

stcredzero · on Feb 18, 2013

That and something like the Smalltalk Change Log would have made this a no-brainer debug. (Yes, every REPL action in Smalltalk got logged by the same mechanism that logged every code change.) Such mechanisms aren't trivial, but they're not rocket science either, and they have tremendous ROI.

znowi · on Feb 18, 2013

I wonder what exactly did distract you :) When I do surgery on a production server, I triple-check making sure everything works properly.

I have two assumptions: 1. HN has a low priority in the overall scheme of things, 2. Self-confidence overflow :)

coderintherye · on Feb 18, 2013

Happens to a lot of us. Great reason to always write tested cleanup scripts for this stuff instead of editing directly on the server. The only time I brought down my product last year was from a similar screwup, I was removing users by hand and somehow managed to end up with a 0 in my list of user ids, thus deleting the anonymous user, and causing havoc to my server, which took a long time to track down.

dap · on Feb 18, 2013

Thanks for the detailed explanation.

It sounds like everything was done to fix the problem except try to figure out what the problem actually was. Why not use tools to see what the program is doing, form a hypothesis, gather data to confirm or reject the hypothesis, repeat until cause found, and then take corrective action that by this point you have high confidence will work?

I realize HN is more of a side project than a production service, but the goal is the same in both cases: to restore service quickly so you can move on to other things. It feels like a more rigorous approach would allow restoring service much faster than randomly guessing about what could be wrong and applying (costly) corrective action to see if it helps.

Besides that, in many cases (including this one), you cannot randomly guess the appropriate corrective action without finding the root cause.

luser001 · on Feb 18, 2013

I use assertions to protect against things like this.

I liberally sprinkle my code with assertions (CS theory calls them pre-conditions and post-conditions, iirc) to crash early if the system is an invalid state.

One my pet peeves is that few programmers seem to love assertions like I do. Would love to see to comments on this.

timothya · on Feb 18, 2013

What assertion would you have used in this case? For every comment you'd have to iterate through all it's parents to check if there is a cycle, which seems pretty inefficient to do for something that should never happen (there are other ways that you could check for this problem as you go, but the only other ways that I can think of require holding extra state just in order to perform the assertion).

I'm for assertions when they are simple and don't cost much (especially during development), but it's not feasible to check every condition that should not happen.

petercooper · on Feb 18, 2013

You could assert a limit on depth, perhaps. Then the cycle would still exist but after X number of comments, the rendering ends.

timothya · on Feb 18, 2013

This is a reasonable solution. While it will (almost) never provide the correct result (it might print out a cycle of comments until X is reached, or it might cut off a very long but legitimate comment thread), it would provide a reasonable guarantee on this sort of problem not generating infinite pages.

petercooper · on Feb 18, 2013

At the risk of being accused of flame-baiting, I'd say it's the engineering solution rather than the mathematical one.. ;-)

For some reason I tend to be a fan of the "stick it in a secure box" rather than "get it right in the first place" approach..

zzzeek · on Feb 18, 2013

typically, if you're operating upon a particular comment, you've gotten there by traversing to it from the parent. Ensuring that traversals don't encounter cycles is easy, keep hold of a hashtable (or a set) of comment ids as you traverse. As the traversal encounters a comment, its id is added to the hash, and as you complete traversal of each comment, the id is removed. If you encounter an id that's already in the set, assertion failed - or better yet, log the condition and then cease the traversal. That way everything keeps running and the error is visible in the logs.

If the code is organized (as it should be) such that all functions which require traversal of hierarchical comments pull this from a single function, then the hash check only need be applied in that one place in the code, where it need not be visible anywhere else.

badgar · on Feb 18, 2013

>iterate through all it's parents to check if there is a cycle,which seems pretty inefficient to do for something that should never happen

The number of parents is almost always under 3 or 4 and never over 100. Writes occur a few times a second at peak. You are prematurely optimizing.

swah · on Feb 18, 2013

The kind of assertion he needed though, could only be ensured by the database, not application code (my impression).

luser001 · on Feb 18, 2013

Agreed, infinite loops are a little hard to protect using asserts.

When I hit the first infinite loop bug on a code path, I frequently add code to assert that the number of calls is less than $A_LARGE_NUMBER to catch future occurrences of the same root cause.

mpweiher · on Feb 18, 2013

I dimly remember a language that just hard-limited loops. I thought it was John Pane's HANDS system, but I can't seem to find a reference in the thesis...can anybody refresh my memory?

http://www.cs.cmu.edu/~pane/research.html

http://www.cs.cmu.edu/~pane/thesis/

Pretty cool work regardless, I really like the way it deals with aggregates, for example.

swah · on Feb 18, 2013

This is similar to the "while with timeout" that is common in embedded code (of course, watchdogs are better...)

badgar · on Feb 18, 2013

> Agreed, infinite loops are a little hard to protect using asserts.

    assert(is_tree(comment_graph))

Typically, a composite entity (like an "item" on HN which has many "comments") will define invariants to ensure data integrity. In this case, the invariant is that an "item"'s comments form a tree.

The database layer often contains this logic, but it depends on how you're building your application; NoSQL backends for example typically must put validation in the application layer. Since HN just uses files, a well-developed application layer should be riddled with invariants like this.

irahul · on Feb 18, 2013

The kind of assertion he needed could not be ensured by the database. The kind of assertion he needed was there are no cycles in the graph. How would you ensure that in a database?

Also, HN uses flat files, not database.

Someone · on Feb 18, 2013

A constraint on the parent-child link table "Child creation time stamp > Parent creation timestamp" would do it.

Might not be a bad idea, if the site were to have the two requirements "maintenance must be done on the live site from a repl" and "5 nines availability".

irahul · on Feb 18, 2013

How are you modelling your data? I think this should be a self reference.

    create table post (id int primary_key, parent_id int references post(id), child_id int references post(id), created_at timestamp)

How will you place the check constraint? You only have parent_id and child_id, not parent and child entities. You will have to write a trigger.

I am not saying this can't or shouldn't be done. I am saying a db won't directly solve it.

However, you example will work perfectly for enforcing constraints in the code via the mutator which can compare child and parent timestamps, provided pg was doing it via a mutator, and not directly changing the ids.

Someone · on Feb 18, 2013

Following http://stackoverflow.com/questions/3438066/check-constraint-..., and assuming that ID's get doled out in increasing order:

    create table post (
      id int primary_key,
      parent_id int references post(id),
      CONSTRAINT foo CHECK (id > parent_id)
    );

swah · on Feb 18, 2013

I was thinking something like (supposing the comments were stored as "closure tables" like Karwin suggests):

  CREATE TABLE comment_tree (
   ascestor_id REFERENCES comments(id) NOT NULL,
   descendant_id REFERENCES comments(id) NOT NULL,
   CHECK ( ascestor_id <> descendant_id )
  )

but I'm probably overlooking something. (I'm aware that HN uses flat files, I was just making a counter-point to the "simple assert" solution...)

irahul · on Feb 18, 2013

That will prevent a child being its own parent. It won't work for more than one level i.e a post being its own grandchild. Assume (post_id, parent_id) sequence: (1, 3) -> (2, 1) -> (3, 2).

zzzeek · on Feb 18, 2013

you can assert that "post_id > parent_id", assuming comments are always created subsequent to the creation of their parents (as is the case here) and that integer identifiers are always increasing (otherwise use timestamps). (1, 3) above would indicate an invalid case (not necessarily a cycle, but a precondition for one).

swah · on Feb 18, 2013

Please note that the "Closure Table solution involves storing all paths through the tree, not just those with a direct parent-child relationship."