Hacker News new | past | comments | ask | show | jobs | submit login
Knightmare: A DevOps Cautionary Tale (2014) (dougseven.com)
207 points by strzalek on Feb 4, 2015 | hide | past | favorite | 85 comments



I once shut down an algorithmic trading server by hastily typing (in bash):

- Ctrl-r for reverse-search through history

- typing 'ps' to find the process status utility (of course)

- pressing Enter,....and realizing that Ctrl-r actually found 'stopserver.sh' in history instead. (There's a ps inside stoPServer.sh)

I got a call from the head Sales Trader within 5 seconds asking why his GUI showed that all Orders were paused. Luckily, our recovery code was robust and I could restart the server and resume trading in half a minute or so.

That's $250 million to $400 million of orders on pause for half a minute. Not to mention my heartbeat.

Renamed stopserver.sh to stop_server.sh after that incident :|

P.S. typing speed is not merely overrated, but dangerous in some contexts. Haste makes waste.


Might be better having something like the following at the top of your script:

    echo "This will STOP THE SERVER. Are you sure you want to  do this?"
    echo "Type 'yes' to continue:"
    read response

    if [[ $response != "yes" ]]
    then
      echo "You must type 'yes' to continue. Aborting."
      exit 1
    fi

    echo "Stopping server ..."
It barely takes any time to type 'yes' but it makes you stop and think.


That'll make it impossible to use in scripts. This remember me the time I tried to fix my "rm -r * .o" with a CLI trash system, instead of doing proper backups.

Might be better to start reviewing EVERY command one sends to important servers, and testing them if viable. What probably is the line that vijucat took... that's the line that everybody ends up taking, the only thing that changes is the number of accidents needed.


Then you add an optional CLI argument that makes it skip the prompt, and use that version in scripts.


As 'euid' said, you can just add a check. If you don't want to add proper option checking (because this is the only option) you can do something like:

    response=$1

    echo "This will STOP THE SERVER. Are you sure you want to do this?"
    echo "Type 'yes' to continue:"
    [[ $response ]] || read response

    if [[ $response != "yes" ]]
    then
      echo "You must type 'yes' to continue. Aborting."
      exit 1
    fi

    echo "Stopping server ..."
Which will allow you to run the script with "./scriptname.sh yes" to bypass the check.

EDIT: Though, as you say, it is much better not to be running anything like this on a live server anyway. Any change should be part of a deployment procedure that is carefully checked and tested, as well as having a rollback procedure in the event of something horrible happening.

Of course, in the real world, you sometimes don't have that luxury and you just have to hope and pray :-)


Yes, this solution would have worked for our situation.


Am I wrong to assume that you weren't using an init script of some sort because this stopped multiple services, perhaps in a specific order?


Repercussions?


Luckily, and just that one time, there was nothing but disapproving glares.


I can't read something like this without feeling really bad for everyone involved and taking a quick mental inventory of things I've screwed up in the past or potentially might in the future. Pressing the enter key on anything that affects a big-dollar production system is (and should be) slightly terrifying.


I have the same fear, but I wonder if that fear stems from lack of training and or documentation and or time.

When I ask myself why I am afraid of deploying to production servers its always because I don't fully understand what the deployment process does. If I had to deploy manually I would be lost at sea.

Is it just nievete thinking that enough documentation and training makes that fear go away?


It is, ultimately, a management problem.

Computer systems are rarely useful on their own. They need to be attached to some other process in order to derive any value from them. Sometimes it's a network, somethings it's a factory, sometimes it's just process run by humans.

In most domains that you attach computers to, it's possible for one person to understand both the computer side of it and the other side of it. Web development is like this, you can easily understand both the web and computer systems, as one is really just another version of the other.

If you're attaching the computer system to another very hard-to-understand system, like, say, a surgical robot, then the person that understands both domains well enough to avoid problems like this is a unicorn, there might well be only one person in the world that's built up expertise in both fields.

To get around this you need careful, effective management of the two pools of labor, and robust dialogue between the two teams. The second this starts breaking down, is the second you start marching down the road to disaster.

In the aforementioned example, everyone would be well-aware of the failure modes. So it's a bit easier to manage. In finance, failure modes can be so subtle, particularly as the system grows more complex, that they can escape detection by both teams unless they're both checking each other's work and keeping each other honest.

Institutional knowledge transfer has to constantly be happening, areas of ignorance on both sides have to constantly be appraised and plans undertaken to reduce said ignorance. The more everyone knows about the entire system, the more likely it will be that critical defects like this can be discovered before they strike.

This kind of effective interaction of those at the bottom can only be organized and directed at the top. It's very much a "captains winning the war"-type situation, but captains can't lead without support from the generals.


Do you have any advice with regards to institutional knowledge transfer, or have you seen any examples of when this was done exceptionally well?

Knowledge transfer problems have been a running theme at many of my previous work places, I'm interested in what I can do to help, I'm a documentation proponent but there must be more.


I've always thought that it's beneficial to wear multiple hats and sit with multiple teams/people either throughout your employment or at the beginning. I think the more hands on you are with every aspect of a business the less likely you are to insulate yourself or create silos and barriers.


It's a really hard problem. From a personal standpoint as a coder, the problem with documentation is that it has to be maintained same as the rest of the application it's documenting. That's why you see the push towards self-documenting code in the Ruby world, where you can just look at a code file and know exactly what it's doing because convention. Every tool you add on to your workload doesn't just impose an initial dev cost, but also an ongoing maintenance cost.

When you're also dealing with humans, you have to pay a management cost too, so you have some documentation, where are you going to put it? Is there a repository somewhere where you could put it where anyone who wants to work on the program later will know to look? Often there isn't. So you have to make one. It will need the use of company resources. You need to educate people that there's this place for documentation and everyone needs to use it, because it won't make any sense for a documentation repository to have just your own documentation in it.

The size and scope of what it takes to be effective at this makes it a management problem, resources need to be allocated, and directions have to be given. Someone has to drive the project, to make it happen above and beyond his job duties.

Most of the time this happens when something big fails. Today, an important scheduled email was just found out to not have happened for the last eight weeks. The stakeholders did not notice that the email was not hitting their desks every MWF as usual. When the problem was fixed because another process that was less important but actually had someone on the ball enough to notice it was failing complained, the important email started coming in again, prompting a giant WTF. My reaction is a giant shrug, if it's important to you, you need to be monitoring it, I'm not omniscient, pushing the responsibility back onto the management.

So the knowledge transfer in this situation goes as follows, I need to know which business processes are important, "all of them" is not an acceptable answer. Second, other teams need to be aware that when systems are automated, that means that they're not really being monitored, that's what automation means. Whoever is in charge needs to delegate a human to do it manually. That person can't be me, my time is too valuable for that shit.

Eventually, I can build a system for getting the kind of feedback that wasn't built into the system in the first place, maybe some kind of job verification system, that, after enough tweaking makes the system as a whole more reliable. But that still won't remove the necessity for some human to have the job at the end point of the system to ensure that the information is flowing on time and on target. No matter how robust the system is, there will still be silent failure modes that can go for months or years unnoticed.


Wow, thanks for your thoughtful response, you've helped me realize I already have it rather good by comparison. Eg we already have a designated One Source Of Truth for documentation.

Having been a coder I do understand that documentation is just something else to maintain. And this is a sentiment I hear from many people in development. Read the code is a common response.

As someone who shares part of the on call rotation though, I've grown to see it differently. While that documentation is extra overhead to maintain, having that documentation ready will save me from having to call you at 3AM when your project breaks in production and your documentation was written without consideration for on call. When this happens, I have no choice and you're getting woken up at 3am. I've found that when cast in this light, I have yet to meet a developer who wasn't eager to avoid that early morning phone call.

Thank you for your comment, its good to know the color of the grass on the other side of the fence sometimes :)


Everytime I'm reading the story, there is one question that I've never understood: why can't the just shutdown the servers itself? There ought to be some mechanism to do that. I mean, $400 millions is a lot of money to not just bash the server with a hammer. It seems like they realized the issue early on and was debugging for at least part of the 45 minutes. I know they might not have physical access to the server, but wouldn't there be any way to do a hard reboot?


Violently shutting down trading also exposes the bank to significant risk, eg. massively leveraged trades on stocks that were meant to be held for 3 milliseconds suddenly hanging on until the system's back up.

In hindsight, this would still have been preferable to losing $400 million, but quite obviously nobody at the time realized just how catastrophic this was going be.


With the benefit of hindsight I think company wide realtime dashboard would be high priority. I guess actual deployment procedures would be higher though.


If the system was not counting its trades, would they show up in the dashboard?

A reconciliation would be necessary, which would come from where the orders were being sent to (the exchange), but with millisecond delays a realtime dashboard seems only necessary for this case (not that it is a bad case) and while End Of Day reconciliations are needed, I'd be interested if anyone knew of exchange requirements for intra-day trading reconciliations?


They had a wire-protocol they were incrementally updating without a version-byte.

That is caused by bad engineering, and they needed to do less of that.

"Realtime dashboard" and "deployment procedures" are more engineering, not less.


Given the types of transactions Knight was involved in, it's unlikely they had physical access as the servers would be in lower Manhattan to keep latency down. Couple that with the lack of any established procedures to kill their systems, and it's one hell of a nightmare. If the idea of just pulling the plug even came to them, I can't imagine how well that phone call would be received by the datacenter techs even if they believed it wasn't a hoax. And that's assuming that pulling the plug wouldn't cause other problems.

But really, the 45 minutes probably flew by faster than you or I could really imagine. You're in a crisis situation, you tell yourself you just need another five minutes to fix something. Five becomes ten, becomes twenty, and before you know it, your company is looking at a $400M nightmare.


In the story he points out that there was no kill switch.

And, as has been found in other disasters in other industries, kill switches are hard to test.


Long ago in a previous life, I worked in a factory that made PVC products, including plastic house siding. One of my co-workers got his arm caught in a pinch roller while trying to start a siding line by himself. There was a kill switch on the pinch roller - six feet away and to his left, when his left arm was the one that was caught. Broke every bone in his arm, right up to his collarbone.

He screamed for help, but no one could hear him over the other noisy machinery. Welcome to the land of kill switches.


Yikes. That reminds me of The Machinist.


It feels more and more like the only responsible way to engineer systems is with a built-in always-on-in-production chaos monkey, to always be killing various parts of them. Normally this is done to ensure that random component failure results in no visible service interuption, but in this situation, you'd also be able to reuse the same "apoptosis" signal the chaos monkey sends to just kill everything at once.


Everything should be written crash-only[1]. That way they don't have to worry about pulling the plug at any time.

http://en.wikipedia.org/wiki/Crash-only_software


Crash-only is nice and all but you can't crash the other side of a socket...

Like you couldn't crash a steel mill controller and expect the process equipment to be magically free of solidified metal. It only means the servers will come back up with a consistent state.


"Crash-only engineering" is a method of systems engineering, not device engineering; it only works if you get to design both sides of the socket.

In the case of a system that needs hard-realtime input once it gets going (like milling equipment), the "crash-only" suggestion would be for it to have a watchdog timer to detect disconnections, and automatically switch from a "do what the socket says" state to a "safe auto-clean and shutdown" state.

In other words, crash-only systems act in concert to push the consequences of failure away from the site of the failure (the server) and back to whoever requested the invalid operation be done (the client.) If the milling controller crashes, the result would be a mess of waste metal ejected from the temporarily-locked-up-and-ignoring-commands process equipment. The equipment would be fine; the output product (and the work area, and maybe the operators if they hadn't been trained for the failure case) would not be.


At first I thought you had written "rocket" instead of "socket", which would also make much sense.


And as a general rule of thumb, what the other end of the rocket should be doing is pointing down.


A bit trivial, but actually a rocket flies sideways, not straight up.

You go straight up you just fall back down - you need to go into orbit which means flying sideways.


Nah. Some times you want it to crash too.


The story also points out that there were no emergency procedures. While not as instantaneous as a kill switch, known good procedures could have significantly reduced the final effect.


As a designated market maker, there are probably regulatory requirements that force them to be in the market. Granted, if they'd known the full extent of the damage, they almost certainly would've pulled the plug. But, I'm guessing that was a factor.


Hedging.


While articles like this are very interesting for explaining the technical side of things, I am always left wondering about the organizational/managerial side of things. Had anyone at Knight Capital Group argued for the need of an automated and verifiable deployment process? If so, why were their concerns ignored? Was it seen as a worthless expenditure of resources? Given how common automated deployment is, I think it would be unlikely that none of the engineers involved ever recommended moving to a more automated system.

I encountered something like this about a year ago at work. We were deploying an extremely large new system to replace a legacy one. The portion of the system which I work on required a great deal of DBA involvement for deployment. We, of course, practiced the deployment. We ran it more than 20 times against multiple different non-production environments. Not once in any of those attempts was the DBA portion of the deployment completed without error. There were around 130 steps involved and some of them would always get skipped. We also had the issue that the production environment contained some significant differences from the non-production environments (over the past decade we had, for example, delivered software fixes/enhancements which required database columns to be dropped... this was done on the non-production systems, but was not done on the production environment because dropping the columns would take a great deal of time). Myself and others tried to raise concerns about this, but in the end we were left to simply expect to do cleanup after problems were encountered. Luckily we were able to do the cleanup and the errors (of which there were a few) were able to be fixed in a timely manner. We also benefitted from other portions of the system having more severe issues, giving us some cover while we fixed up the new system. The result, however, could have been very bad. And since it wasn't, management is growing increasingly enamored with the idea of by-the-seat-of-your-pants development, hotfixes, etc. When it eventually bites us as I expect it will, I fear that no one will realize it was these practices that put us in danger.


You should read Charles Perrow's "Normal Accidents" and all will be revealed. This is hardly a new problem.

http://www.amazon.com/Normal-Accidents-Living-High-Risk-Tech...


The post is quite poor and suffer a lot from hindsight bias. Following article is so much better: http://www.kitchensoap.com/2013/10/29/counterfactuals-knight...


Great link. Thanks for posting.


If you fill the basement with oily rags for ten years, when the building goes up in flames, is it the fault of the guy who lit a cigarette?


In the case of wildfires, why yes, yes indeed we do.


The conditions for wildfire do not arise out of neglect and laziness.


The conditions for risk from wildfire do arise out of a multitude of factors. Including discounting hazards, poor construction, insufficient warning or evacuation capabilities, and more.

But more to the point: the conditions in which wildfires are common involve such insanely high risk of conflagration that the least spark can set them off. And do.

When you've got to tell people not to mow lawns or trim brush for fear of stray sparks setting off tinder-dry brush, you're simply sitting on top of (or in the midst of) a bomb. And that is the reality in much of the world -- Australia, the western US, Greece, and elsewhere. Or a hot car exhaust from a parked vehicle. Or broken glass. Or, yes, flame in the form of cigarettes, a campfire, or barbecue.

But except for the case of deliberate arson (which does happen), I find the practice of convicting people for what's essentially a mistake waiting to happen to be quite distasteful.


Did the article specify that some one person was at fault?


This is a story about the importance of DevOps, i.e., this company went bankrupt because of poor DevOps practices.


Repurposing a flag should be spread over two deployments. First remove the code using the old flag, then verify, then introduce code reusing the flag.

Even if the deployment was done correctly, during the deployment there would be old and new code in the system.


I used to work in the HFT, and I don´t understand is why there was no risk controls. They way we did it was to have explicit shutdown/pause rules (pause meaning that the strategy will only try to get flat).

The rules where things like: - Too many trades in one direction (AKA. big pos) - P/L down by X over Y - P/L up by X over Y - Orders way off the current price

When ever there was a shutdown/pause a human/trader would need to assess the situation and decide to continue or not.


To add insult to injury, Knight got fined for not having appropriate risk controls after this incident.


I remember reading a summary of this when it occurred in 2012. It's obvious to everyone here what SHOULD have been done, and I find this pretty surprising in the finance sector..

Also your submission should probably have (2014) in the title.


"I find this pretty surprising in the finance sector.."

...not to anyone who has worked in financial software


The art of rehashing old stories for exposure. Without mentioning where you got your material of course.


https://news.ycombinator.com/item?id=4333089

I believe it was this, as it was a Guardian article - which has since been subject to the digital graveyard.


Why would they repurpose an old flag at all? That seems crazy to me unless it was something hardware bound.


To keep the messages as short as possible, to reduce the time-costs of transmitting and processing them. It's HFT, they do things like that.


Which is why I don't feel bad at all. Live by the sword, die by the sword.


I was honestly surprised the NYSE/NASDAQ didn't step in and reverse the trades. They've done so in the past when automated trading systems went off the rails. I am glad they didn't do so, as such favoritism is completely unfair and sends a terrible message to HFT companies, but it was still surprising.


Okay, yeah, if the flag is in the message not just an internal flag that makes MUCH more sense.


I assume "flag" in this context means something akin to a command-line flag.


I assumed it was a single bit in a string of them that describes a message.


You're correct. Single bits representing boolean values are often called "flags".


Except it's probably a byte or less.


For highly efficient messages I'll bet they did bit packing, so it was probably only a single bit, which is why they'd want to reuse it rather than completely change the message format.


Okay, I can see that if they're binary messages and a single bit is no longer being used.


I've seen it done on a regular basis at a large financial institution. Like, so often that almost half the fields had a different meaning now or have dual purpose on some old systems.


It's nice to see a more detailed technical explanation of this. I've used the story of Knight Capital is part of my pitching for my own startup, which addresses (among other things) consistency between server configurations.

This isn't just a deployment problem. It's a monitoring problem. What mechanism did they have to tell if the servers were out of sync? Manual review is the recommended approach. Seriously? You're going to trust human eyeballs for the thousands of different configuration parameters?

Have computers do what computers do well - like compare complex system configurations to find things that are out of sync. Have humans do what humans do well - deciding what to do when things don't look right.


Somebody was on the other side of all those trades and they made a lot of money that day. That's finance. Nobody loses money, no physical damage gets done and somebody on the other side of the poker table gets all the money somebody else lost.


Its probably weirder than that. As soon as their system "lost containment" all of that money probably just diffused into the market like smoke amid hundreds of thousands of trades. There's no "big winner" on the other side of the table grumble at.

HFT systems aren't like a bunch of poker players slyly eying one another across a table, they're more like electro-chemical gold miners concentrating gold from the rivers of cyanide sludge pumping up from the financial system mines. Heaven help them if they depolarize.


Also - if they had accidentally made a $450m profit, none of this would ever be public.


This must be an old wives tale. I live in Chicago and a trading form on the floor beneath us went bankrupt, in roughly the same time, with a similar "repurposed bit" story.

Maybe it's the same one .....


Knight Capital. It was a big deal. Not an old wive's tale. http://www.nanex.net/aqck2/3522.html


HE was joking.


In that case, "whoosh".


Ah yes, this story is legendary. I discuss it in my JavaScript Application Design book[1]. Chaos-monkey server-ball-wrecking sounds like a reasonable way to mitigate this kind of issues (and sane development/deployment processes, obviously)

[1]: http://bevacqua.io/bf


Wasn't Knight in trouble for some other things as well?


"Power Peg"? More like powder keg.


What really looks broken to me in this story is the financial system. It has become an completely artificial and lunatic system that has almost nothing to do with the real - goods and services producing - economy.


Trickle down economics = financialization of capital. Those who influence the policy are beneficiaries of the financialization of capital: think of grads from elite law schools and b shools!! If there is no financialization of capital, these grads from elite law/b schools wont get fat bonus.


As usual in catastrophic failures, a series of bad decisions had to occur:

- They had dead code in the system

- They repurposed a flag for a previous functionality

- They (apparently) didn't had code reviews

- They didn't had a staging environment

- They didn't had a tested deployment process

- They didn't had a contingency plan to revert the deploy

It could be minimized or avoided altogether by fixing just one of the points. Incredible.


> They (apparently) didn't had code reviews

I don't get that. There was no code issue. The old and new code both worked as intended, it was a deployment and deployment-verification problem.

> They didn't had a staging environment

Yes they did. They staged the new code and tested it. They did a slow deployment also.

> They didn't had a contingency plan to revert the deploy

They did revert the deploy within the 45 minutes. It made it worse.

I think you need to re-read the article. Your assessment is strange given the event.


> I don't get that. There was no code issue. The old and new code both worked as intended, it was a deployment and deployment-verification problem.

A code review could raise the issue of repurposing a flag in case they had to revert the deploy. Changing the semantics of a flag is a big no-no anyway, and there are ways to guard against that.

> Yes they did. They staged the new code and tested it. They did a slow deployment also.

But they didn't had a staging environment that matched their live environment, apparently. You want a staging environment that is 1:1.

> They did revert the deploy within the 45 minutes. It made it worse.

If you think reverting a deploy by simply pushing an older version is the same as a contingency plan, think again.


Code review could have been another set of eyes to predict the problem of reusing a flag.


If the message was as compact and low level as possible it was probably a bit flag, so in that context it makes sense to repurpose it.

Being so removed from binary and bit level interactions it can be easy to forget things like this.


I agree with the GP; I don't think code reviews or testing was the problem.

I think the best-practices they violated is that they deprecated and repurposed a flag within a single release cycle. That sort of activity should take two release cycles at least, one to remove the old functionality and one to add the new functionality.


and if you do it all well, you are paid the avg dev salary.

The value of a good dev is a realised only when someone screws up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: