Ask HN: What's the worst thing your code has done?

gozur88 · on July 18, 2017

I wrote some code to manage kits in a warehouse. Like, a customer would order a kit that required A, B, and C. Then the picker would get sent to those three locations and put it all in a box and onto the conveyor belt for shipping.

The problem was the warehouse owner wanted partial kitting. So if the warehouse only had (in our example) A and B, the code would send the picker to put A and B into a box, then direct him to drop it off in a special partial kit area. When C was back in stock, the system would have the workers fill out the partial kits and ship them. This way if a kit required a dozen items and you were just waiting for one to arrive, you could get most of the work done beforehand.

The problem was now A and B are in boxes and not in "inventory". So when someone orders a kit that contains A, B, and D the A and B bins are empty (as all items A and B are already part of a kit and thus not available) and the code would direct him to put D in a box and put it in the partial kit area. Eventually the D bin is empty, so when an order comes for a kit that requires D and E, we get another flood of partial kits, all going to the same location (which was just a square painted on the warehouse floor).

Anyway, long story short, if the right few items were out of stock and the right orders came in the right sequence, nearly the entire inventory of the warehouse ended up in a giant pile of boxes that was too large for the workers to sort through even when the needed items arrived.

Everything was humming along just peachy for weeks and then BAM! Red faces all around. It took days for them to put all the inventory back into the proper bins and fix all the data, and that probably cost into seven figures, all told.

In my defense, I wasn't the last one to touch that module.

AnimalMuppet · on July 18, 2017

Sounds like the problem was the requirements, not the implementation...

mapster · on July 18, 2017

what was the best solution for them? out of stock creates work backorder & potential log jam req. overtime and temp workers, but avoids the out-of-stock false positive. in the end, more happy kit owners.

gozur88 · on July 18, 2017

It's been awhile, but I think we just put a hard limit on the number of partial kits.

jnord · on July 17, 2017

In the early days of my career I had to modify some code for a PLC that operated on a car production line. The modified code took too long to run so a watchdog process assumed the code had frozen and performed an emergency shutdown of the hydraulics of the line's welding robots. Six cars were damaged when the heavy robot arms crashed and buckled car roofs, and the one-car-every-45-seconds production line ground to a halt for 15 minutes.

vortico · on July 18, 2017

A similar PLC story: A friend was an intern at a well known coffee manufacturer. He was writing code for new sensors to use which were to be installed on the assembly line machines later that year. There was a staging machine that his team used in the same room, and they would log into the machine and push their code to run tests.

He sent the machine an update, rebooted it, yet the staging machine was acting unchanged from the previous version. Moments later a supervisor ran into the room, yelled to put his hands up from the keyboard, reached over him, ran some commands, and disappeared into another room. A few minutes later he announced that my friend logged into the main assembly line machines and rebooted them with code that used a sensor that didn't exist yet, which stopped the entire chain for ~20 minutes. The company suffered $250k production losses during that time.

Jugurtha · on July 18, 2017

How come an intern has access to the main production line PLCs?

imhoguy · on July 18, 2017

This reminds me poor intern's story posted few days back on Reddit: https://www.reddit.com/r/cscareerquestions/comments/6ez8ag/a...

vortico · on July 18, 2017

It must have been an error on the network's access restrictions, because I agree, he shouldn't have had the ability to get to that part of the network. Maybe it was a routing error, since he claims he didn't switch the machine destination before uploading.

euyyn · on July 17, 2017

Emergency shutdown meaning all the huge robot arms fall down, instead of just freezing in place, doesn't sound like a good idea to start with :)

bigger_cheese · on July 18, 2017

Hydraulics are scary things there can be a lot of stored energy released very quickly if one of them gives out suddenly. I've heard horror stories about limb amputations etc due to sudden release of hydraulic pressure causing shrapnel to go flying across the room.

If there is a fault the safest thing for them to do is bleed out energy slowly unfortunately in this case sounds like this crushed the 'obstruction' in the process.

My Industrial plant code screwup story was not caused by me but was pretty impressive, what was supposed to be a "simple firewall change" knocked out communication between two interlinked parts of our plant which caused the line to stop and a big delay with a few million dollars lost. I believe the root cause was someone fat fingered the addition of a new firewall rule and we ended up dropping every incoming packet.

vacri · on July 18, 2017

Cutting power = you know that the arms will go slack.

Freezing = you hope that whatever the problem is, maintaining power doesn't make it worse.

CapacitorSet · on July 18, 2017

Ah yes, PLC programming. I expect there will be a lot of stories from PLC programmers in this thread ;)

jnord · on July 18, 2017

"It is easy to make a mistake but if you really want to stuff it up you need a PLC". :)

eat_veggies · on July 18, 2017

Hey! I'm working at my first software internship and I will very soon be doing some PLC programming on moderately important chemical equipment. Do you have any tips on how to not spectacularly fuck everything up?

CapacitorSet · on July 19, 2017

I found it helpful to not change "complex" outputs directly, but rather call a function that does that for me and handles the complexity in one place. For instance, if a brake must be removed before activating a motor, rather than pasting the same code every time you need to use the motor, wrap that in a small function block and use that.

In general, make the code easy to interpret, especially when debugging - which means organizing your code and using simple abstractions. State machines can be useful because they're easy to interpret if you comment your states, leading to high-level understandings like "the machine is waiting for an item" rather than "it will do something when I0.5 goes high".

Procrastes · on July 17, 2017

That sounds pretty spectacular, alright. Although, I can't help wondering what it would have sounded like to set the robot arms to playing the William Tell Overture as a restart test. I mean the cars were already messed up...

chrisbennet · on July 18, 2017

Many years ago, when the earth's crust was still cooling. I wrote an application to generate tool paths for the milling machines my employer made. Milling machines use a cutting tool that looks something like a drill bit except that it cuts on the side of the tool instead of the tip.

One day I was told that my software had a bug. The tool wasn't being retracted (pulled out of the material being cut) before being rapidly moved to a new location. As a result, the cutting tool was being broken off.

I asked [I think it was our application engineer] if we sold the replacement tools to the customer and I was told "yes". Then I asked him: "Then isn't breaking off tools kind of a feature"?

"Just fix it Chris. Just fix it."

LarryPage · on July 18, 2017

And as the earth's crust finally cooled, you fixed it.

AnimalMuppet · on July 17, 2017

We had a microwave generator that was used to cook cancers in living patients. We'd ask for a given power, and we had the ability to read back how much power we actually got. But we didn't check that the power we read back was something reasonable. When an op amp failed, the generator produced full power whenever we asked for any power at all. The patient literally got hot enough to emit smoke.

Thank God, the patient was a pig. We hadn't made it into clinical use yet.

CapacitorSet · on July 18, 2017

>When an op amp failed, the generator produced full power whenever we asked for any power at all.

Huh, I'd expect sensitive systems like these to have some sort of hardware redundancy/voting system.

sjg007 · on July 18, 2017

The therac system is well known.. hardware interlock.

kazinator · on July 18, 2017

Although that is true, the point is that the software could have done something (in this particular failure case) since it had the means to monitor the actual power. Like "Oh shit, there is too much power; something is wrong; shut everything down".

ill0gicity · on July 18, 2017

I'd been looking for a new way to roast whole hogs... My search has ended here.

Procrastes · on July 17, 2017

Holy crap! That kind of failure (where some governing component drops out) is one of my favorite nightmares. Did you add an RF meter to the design?

AnimalMuppet · on July 17, 2017

No. We added code that, if the power was too far off from what we asked for, tried to kill power three different ways, plus alerted the operator. It was a bit tricky, because the power is never exactly what you ask for (variable impedance match, plus noise).

wfunction · on July 17, 2017

> We added code that, if the power was too far off from what we asked for, tried to kill power three different ways, plus alerted the operator.

My first worry was that your measurement would be wrong, not that the power wouldn't be killed! Any redundancy on that side? Or was it not necessary?

AnimalMuppet · on July 18, 2017

The specific issue that we encountered was that the power was measured correctly, but was out of control. At that point, not being able to kill the power is a very real concern.

If we measured wrong, we could either be high or low. If we measured high (that is, the reading is higher than reality), we would either turn down the power until it read right, or else kill power completely. If we read low, though, IIRC we would limit how high we'd turn up the gain to try to get the power we were asking for.

There was also a feedback loop based on temperature. If the power was double what we asked for, the temperature would quickly climb, and we'd reduce power. It would have worked, even with inaccurate power readings, though not as smoothly as it should with accurate power readings. But when we got 20 times the power we asked for (due to the power control failure), it was too much too fast.

AstralStorm · on July 18, 2017

Congratulations. You've found out there is value in having real programmable fuses instead of control electronics.

karthikb · on July 18, 2017

Usually the failure mode in these kinds of system should be to emit nothing....

kazinator · on July 18, 2017

Did the police show up?

AnimalMuppet · on July 19, 2017

For a pig? No.

ohquu · on July 20, 2017

WOOSH

throwawaysntc · on July 18, 2017

My code probably contributed to the financial crash of 2007/8.

Unfortunately, I cannot share much details except that I wrote code that was meant to manage the amount of risk that a certain really big financial institution was supposed to take. My code may or may not have shipped after I left that institution. If it did ship, maybe it did not do what it was supposed to do. If it did not ship, maybe it failed to replace the broken system that it was supposed to replace. Either way, months after I left, the head of the institution acknowledged on TV that they were taking on more risk that they intended to.

vortico · on July 18, 2017

Not me, but https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

amingilani · on July 18, 2017

Oh my god, I laughed for a good 5 minutes on this one. The comments just add to the hilarity.

Procrastes · on July 18, 2017

That is a thing of beauty.

davimack · on July 18, 2017

Not mine, but one I ran into. This is on an automated testing rig for microwave devices, which are odd things - you don't have wires for microwaves, you have wave guides, which are basically tubes which you can pipe the microwave through, and which are incredibly fiddly to get situated properly. So, to test one of these things, you're likely to get a failure and not have any discernible reason for it failing - you'll tear it down and not find any problems, put it back together and it'll work just fine.

Well, the engineer writing the test code knew these devices were odd, and that sometimes they'd just fail. So, s/he put in an if block to the effect that, "if this fails once, run the test 30 times and, if it passes 25/30 times, call it a pass." So, every now and again, the entire automated testing line comes to a halt and sits there for 31x the amount of time it should take, and it's not a short test (maybe sat there 30 minutes each iteration).

zaptheimpaler · on July 18, 2017

I wrote some code that was pulling batches of events off a queue, doing some processing and then writing them out to HDFS.

The inner loop was something like:

    while message:
      converted_event = new Event()
      for event in message.events():
         converted_event.set_fields(event)
         write_to_hdfs(converted_event)

Can you spot the bug? Led to a month of corrupted data before I noticed..

The `set_fields` method does not clear all fields, so every event had more and more junk data than the one before it. All because i thought i would be clever and get some performance gains by initializing `converted_event` outside the inner loop.

flukus · on July 18, 2017

Working on school software I forgot to add "and IsDeceased = 0" to a query. Turns out parents don't like getting notifications about their dead childs truancy.

kazinator · on July 18, 2017

A database with dead kids that have to be tested for in every damn query is a pretty nasty database.

Maybe there should be a separate database of historic students who used to go to that school, and currently enrolled.

It's not just "isDeceased", but "goesToThisSchool". Nobody want to get some notification from a school about something, when their kid doesn't go there any more for any reason.

dragonwriter · on July 18, 2017

> Maybe there should be a separate database of historic students who used to go to that school, and currently enrolled.

Or, rather than duplicating data, just use a view with appropriate criteria to limit to currently active, living students for most queries. But a developer that's called into build a query generally isn't going to get a lot of mileage out of suggesting rearchitecting the database, in either of those ways.

flukus · on July 18, 2017

> But a developer that's called into build a query generally isn't going to get a lot of mileage out of suggesting rearchitecting the database, in either of those ways.

It's like you were there :)

It was a third party product so changing the structure was out of the question. We had some views but they pulled in the entire database and ended up with so much duplicate and irrelevant information they were unusable.

I tried creating a clean set of views like "v_currentStudents" that could then be joined on for information relevant to the current report. I even built a small test suit for them, but getting the support devs (who I was covering for when this happened) to change their cowboy ways was too hard. Management didn't like they idea either, cut into the billable hours.

faitswulff · on July 18, 2017

Oh, holy shit. This wins the thread, as far as I'm concerned.

carvin · on July 18, 2017

I was an intern at a university security lab working on a 7 months project. Early on, I figured it would be a good idea to use SVN to save my work so I setup a repository and did a few commits but quickly stopped maintaining the repo.

One hour before the end of my internship, I was ready to leave, my work done, ready to be used for the next person taking over the project. I want cleanup my files and documentation so it is all tidy and I try to commit my work. Of course SVN cannot commit because the repo and my work have nothing left in common. So I type (on a Linux system): svn delete to cleanup the repo so that I can push my files... I lost months of work and I was not able to recover my lost files from the file system... I had to leave for my country of origin since this internship was part of an exchange program. I felt so bad about it, it still haunts me.

acidus · on July 18, 2017

Don't let it drown you!

tatersolid · on July 18, 2017

I once wrote a server "clean up" script moved all .log files older than a few hours to an archive.

Someone else added it to a group policy for all corporate servers, including all our Exchange servers, where the active database transaction logs are named .log.

hluska · on July 18, 2017

If we're ever in the same city, I owe you a beverage! Great story.

istotex · on July 18, 2017

On the last project I was working on, I built a backend on Node.js v4 for an online course site. For a long time I was trying to convince our team leader to switch to Node v6, since it supported ES6 and I couldn't wait to use the new JavaScript features like, e.g. classes. However, he was always reluctant to make the switch, since there were other priorities at the time.

At some point, I found out that inserting 'use strict' at the beginning of each Node.js module, enabled the experimental ES6 (harmony) features in Node v4. Needless to say, I was super excited and immediately started using classes and other ES6 goodies everywhere, even refactoring already existing modules.

Shorty after that, we noticed that our servers were leaking memory and started crashing almost every day. At the time, I had no idea what the problem was - and believe me I tried everything to find a solution - until a couple of months later we switched to Node v6, and everything miraculously returned back to normal. In the meantime though, during those 2 dreadful months between v4 and v6, we had to setup cron to restart our servers every single day at 04:00...

Never use experimental features.

jefozabuss · on July 18, 2017

Never use experimental features ... in production

Procrastes · on July 17, 2017

I'll kick it off with my own. I've had a few, but the most dramatic was when I once changed the wrong line in a configuration script and ripped a three ton(U.S) mixer out of a concrete floor.

euyyn · on July 17, 2017

Don't leave it there! Details!

Procrastes · on July 18, 2017

I was working on a control system for cattle feed mills. We had to wire into the system sensor-by-sensor and actuator-by-actuator as they continued to make feed. We started out with the entire system simulated, then gradually ended up with a fully live system.

I (thought I) set an actuator running an auger (screw) that offloaded the feed from the mixer into a leg (12 meter tall vertical screw) to "run always." That should be safe right? The auger runs all the time, carrying away anything that dumps into it. What I had actually done was set a hay belt to "run always" it was stuffing the mixer with more and more hay until it was a solid mass inside the box.

Everything seemed fine when we started that next batch of feed... then the mixer started. The lights dimmed and there was this shriek of metal and a bang from the mill floor. We shut down and went out to see this huge mixer hanging off a drive chain at 45 degree angle from the floor. Bolt heads the size of manhole covers had sheared off and were lying nearby. Fortunately no one had been standing nearby. I don't know if my memory matches reality, but I recall a light from one of skylights shining down on it in the grain dust like a spotlight.

I was pretty sure this was going to be my last day on the job.

I walked over to stand next to the Mill Manager, a salty fellow named Marvin with three fingers on his right hand. Marvin looked up the chain and back down to the bolts on the floor and said "Yep, it'll do that."

Two workers lowered it down and welded the bolt heads back in place like they did it every day.

I was with the company for five years. I don't recall every having a support call from that mill after we finished the installation.

vacri · on July 18, 2017

> three ton(U.S)

tons and tonnes/'metric tons' are roughly interchangeable, like yards and meters. :)

LorenPechtel · on July 18, 2017

Wasn't actually my fault: My code ordered the factory to errantly produce several thousand dollars worth of left-hinged doors. (A guy who should have known better set a bunch of flags that messed up it's hinge-determination logic. Anything that was supposed to be produced as one left and one right got produced as two left instead.) As everything was build-to-order it's unlikely any got used at least for their intended purposes. (I still have a few unused doors around--put some casters on them and you have a nice looking rolling wooden platform. The laser printer on the floor beside me is sitting on one of those.)

ioddly · on July 18, 2017

When I was a teenager, I crashed a MUD hosting server by forking a process in a loop. The admin kindly explained ulimit to me. (This was before VPSes were a thing).

I was so mortified, I guess it stuck well enough that that's the worst thing off the top of my head.

But it seems like I'm an underachiever based on this thread.

AnimalMuppet · on July 18, 2017

Being an underachiever on this thread may make you an overachiever at writing code...

MarkMMullin · on July 18, 2017

Desperately sought just an extra 4K of RAM to see if a LISP expert system would get through a diagnosis on a Huge Aircrash Firefinder maintenance guide - had a kernel license, dug around and found a magic flag for a 4K block - tested it, seemed OK, put it out in the field, and the first time it ran, it grabbed that extra 4K and was instantly rewarded with a "Panic: out of swap space" and the whole damn thing dropped dead :-(

kazinator · on July 18, 2017

> Huge Aircrash

Is that a jab at Hughes Aircraft? :) Looks like "Firefinder" is some kind of radar system developed by them.

MarkMMullin · on July 20, 2017

Friendly jab - it was old when I started my career in the '80s :-) And yeah, firefinder is a radar

sidlls · on July 18, 2017

Helped the armed forces of my country kill people.

amingilani · on July 18, 2017

So your code worked as intended, and the intention was to kill people since it was used by the military? You sir, have written the most destructive code on this thread.

I'm sorry for how you must feel.

sidlls · on July 18, 2017

Yeah. I had no idea at the time what its purpose was, either and found out about it after the fact.

It isn't a pleasant thing to live with.

szemet · on July 18, 2017

Then you did not want to hurt anyone.

Imagine someone who works as a knife grinder. If he do his job right the knifes will be much more dangerous, they may cause accidents or even some will be used intentionally as a deadly weapon.

Then considering these possibilities: an ethical knife grinder should do a shitty job, should quit, or should live in self-reproach?

It may be more complicated if you are a gunsmith. But those guns are used by your customers - so in what extent your ethical evaluation will depend on their actions in this case?

For example if your guns are used in an arming race, and eventually they help avoiding war then you are a saint? If your guns are used in a victorious war then you are a hero? And if they are used for killing innocents then you are an evil person? Or you should be judged by the average probabilities of the global gun usage? Or what?

dragonwriter · on July 18, 2017

> Imagine someone who works as a knife grinder. If he do his job right the knifes will be much more dangerous, they may cause accidents

Poorly sharpened knives are far more likely to cause accidental injuries, and serious ones, than well-sharpened knives. At least in kitchen use, though I'd expect the “a dull knife is more likely to fail to cut what you meant, slideshow off, and strike something else” effect would apply in most uses of knives.

szemet · on July 18, 2017

Maybe. It is also possible that better knifes cause less but more serious accidents, and then it is hard to compare the two. But I stop now, because what we have now are just plausible hypotheses without any real evidence - theoretical knife science waiting for confirmation... ;)

bmelton · on July 18, 2017

OSHA[1] recommends keeping knives sharp to prevent restaurant and kitchen maladies from occurring.

The Ohio Bureau of Worker's Compensation[2] recommends the same.

The Bureau of Industrial and Labor Statistics[3] cites dull knives as a common cause for injury, and recommends keeping knives 'sharp and in good trim' to prevent accidents.

In short, "a sharp knife is a safe knife" isn't hokum. When you're pushing a knife into something, you're storing and releasing kinetic energy. A sharper knife requires less kinetic energy to begin cutting the object, which is ostensibly dangerous, but not as dangerous as a failure to cut, which releases all that kinetic energy in uncontrollable fashion.

Past that, in the event that you do get cut by a knife, a sharper knife makes a cleaner cut, which means easier healing, easier care, and (if dire enough) easier reattachment. Oh, and less scarring to boot.

[1] - https://www.osha.gov/SLTC/youth/restaurant/knives_foodprep.h...

[2] - https://www.bwc.ohio.gov/downloads/blankpdf/SafetyTalk-Preve...

[3] - https://books.google.com/books?id=W0M4AQAAMAAJ&pg=PA190&lpg=...

sidlls · on July 18, 2017

I don't feel responsible. I do feel like having a bit of remorse has been useful for me. It's helped me be much more discriminating in my choice of employer.

kazinator · on July 18, 2017

Almost anything can be repurposed for killing, so take it easy.

Suppose you work on the grep program GNU Coreutils. Harmless, right?

Some regime could use that to grep out a list of innocent people to put on a hit list.

If you had no idea what the purpose was, that means the program had conceivable purposes of various kinds, not related to killing, just like grep.

If you develop something that is pretty much only for killing, obviously so, then you know, right? Or else are capable of incredible denial.

sreya · on July 18, 2017

This is interesting. Or is this just a "joke" (albeit dark) using an alternative definition of "worst"

sidlls · on July 18, 2017

The person asking wasn't specific about what he or she meant. I understand usually that means in the "how did your code fail in some spectacularly bad way?" sense but I took some liberty to answer.

canada_dry · on July 18, 2017

Almost got me fired on the spot.

One of my first implementations at a bank many years ago... bunch of 'C' levels are in the main branch for my first big launch demo...

Tape a few keys...

      **ERR ** HELL FROZE OVER!

LPT: never use this in an else case.

kazinator · on July 18, 2017

So this was something like:

    default:  /* unreachable case */
       assert(0 && "hell musta just froze over");

that type of thing? Impossible case throwing funny error message?

tejtm · on July 18, 2017

Exactly what I told it to do. Which seemed perfectly reasonable to me ... but had my boss running down the hall muttering something about damage control, seems not all biologists liked receiving letters introducing them to other biologists who's results on some marker or another differed in some not trivial way.

kafkaesq · on July 18, 2017

Made people rich, who definitely didn't deserve it.

donatj · on July 18, 2017

I wrote code for a domain squatter ad control system as my very first task at my very first job out of college. I am not proud and honestly didn't realize what it was until I got pretty far into it.

ams6110 · on July 18, 2017

Not my code, but I was involved in cleaning up the aftermath. Financial company, a programmer had made a one line change to clean up some working directory at the end of a program. Something like

  "rm -rf /var/scratchdir /"

Yeah the space was a typo. Wasn't running as root but was able to make a pretty big mess regardless.

seanwilson · on July 18, 2017

I move things to /tmp now instead of deleting them. Where the margin of error is a single character "rm" is just too risky.

kazinator · on July 18, 2017

I did something like that in some code in a from-scratch embedded Linux distro, maybe nine years ago:

  rm -rf $MISPELED_ROOT_DIR/lib/

Oops, the script didn't have "set -u", and I happened to run that as root. So, /lib directory gone.

I managed to recover that machine by copying libs from another one running the same distro.

aivarsk · on July 18, 2017

I developed and maintained CI scripts for large modular C++ application 10+ years ago. Someone added `rm -rf $(SOME_TEMP_DIR)/` to global Makefile that was run before building anything. My CI scripts did not set SOME_TEMP_DIR...

Came to work the next day, nightly build still had not finished on slave servers, had errors about non-existent home folder when tried to log in.

What made it worst was that every server mounted a NFS share that contained fingerprints and binaries of different versions of software modules built on different platforms.

Killed all slaves, restored the NFS share from week old backups on tapes, tens of developers could not create new versions of software and send previous versions/patches to customers for a while.

vortico · on July 18, 2017

This is about the third time this has happened in this thread. What's the reason for writing the final "/" and not just `rm -rf $(SOME_TEMP_DIR)`?

allenrb · on July 18, 2017

Really hoping there's at least one Ariane 5 avionics engineer who reads HN...

tj-teej · on July 18, 2017

This ones a doozie

I was working on Cloud Management software for a Private Cloud at a major tech company in SV. We had software which would reserve Prod IP space for hypervisors, e.g. this hardware SKU can support up to 5 VMs, therefore it needs to reserve 5 IP addresses in the corresponding subnet.

Turned out the API call to reserve the IP space from the IP Manager wasn't asynchronous and because the manager tried to get consecutive space, the runtime increased exponentially with the requested # of IP addresses.

In preparation for Holiday traffic, we were onboarding a new SKU of Hardware. This hardware supported more tenants and so instead of requesting 7 IP addresses per HV, now we're asking for 15. This took the latency of a call to the IP Manager from 3-5 seconds to 5-10 minutes. To round off the perfect storm, the code was retrying requests which failed, without propagating the failure to the Cloud Admins using the software.

One day in October, I received a panicky call from our Capacity manager, customers are trying to spin-up VMs but are being told there's no IP space left. He knows we've onboarded all the racks, and he's done the math on the subnets (which are showing as fully reserved), and there still isn't IP space...WTF!!

Turned out the IP manager's VIP was cutting off requests after a few minutes, (never a possibility when reserving only 7 spaces) but the reservation process wasn't stopping, the IP was being reserved, marked as in-use, but never actually making it to the networking service to be used by VMs.

Solution: At 2am on a Friday night I ran a script to manually mark tens of thousands of production IP records as not-in-use in the IP manager, purely based on grepping through logs from my service, and nslookups. But don't worry, we pinged each IP just to be safe :)

kazinator · on July 18, 2017

I ran a BBS on an 8-bit microcomputer in the 1980's. I wrote everything myself, including low-level modem drivers in assembly code.

I had some code which handled a temporary loss of carrier. It would poll for the carrier to come back for a few moments, otherwise indicate to layers higher up that carrier is lost, so the user can be logged out.

Problem is, in that piece of code, I forgot to pop something off the stack that I pushed onto the stack. I had a user who was a bit of a cracker. I got a note from the guy, "I got into your operating system by dialing touch tones while connected".

Dialing a touch tone interrupted the carrier sense in the modem, triggering that code with the bad stack handling that would crash the BBS program, leaving the I/O hooks still connected to the modem driver, giving the caller full access to the system.

This didn't reproduce during the usual case when the carrier was lost permanently, only when it recovered.

baccredited · on July 20, 2017

Can anyone top this one:

  -  function initMultiowned(address[] _owners, uint _required) {
  +  function initMultiowned(address[] _owners, uint _required) internal {

This bug led directly to over $30 million dollars being stolen yesterday. Not my code, but impressive nonetheless.

Hackers have stolen $32 million in Ethereum in the second heist this week http://www.businessinsider.com/report-hackers-stole-32-milli...

Fix initialisation bug. https://github.com/paritytech/parity/commit/e06a1e8dd9cfd8bf...

arunmp · on July 20, 2017

About fifteen years back I wrote a shell script which runs in the background and which is supposed to send an email to the administrator with the log file , every time it ends up with error. The trouble was, it was an infinite loop( being a background process!) and there was some error .I forgot to tell the code to end , in case there was an error.Very dutifully, the program clogged up the company mail server completely with thousands of mails with error logs over the weekend ,no emails coming or going out and one very angry administrator.

khedoros1 · on July 18, 2017

I investigated this bug: Backup system, using a tree data structure where the root was a hash describing a backup, and the leaves were variable-size chunks of data. Backing up a virtual machine, it would process only the changed areas, and re-build that section of tree. Roughly 1 in a few million backups silently lopped off a branch of the tree, a couple levels up. Customers have thousands of VMs, we have thousands of customers. Silent data corruption, somewhere, every day. Rarely-triggered off-by-one errors in un-reproducable data suck.

oldsklgdfth · on July 20, 2017

I was tasked to write a restart function for a desktop application. At the time I was straight out of college with no idea what I was doing, so I asked the lead for some direction.

He told me to: -write out a script that waits 1 second and then runs the application -run the script in a separate process -kill the application

I bet that code is still there. It works, but damn is that cringy.

seanwilson · on July 18, 2017

Not mine but I've seen someone do the classic of having a Bash script with something like "rm -rf $PATH/" where if you run the script without $PATH set it'll wipe out the whole drive if it has permissions. Took out a CI server but luckily we had backups.

Edit: OK, this seems like a very common issue!

PhasmaFelis · on July 18, 2017

The Linux version of Steam had one of these for a bit. People were not happy.

seanwilson · on July 18, 2017

Hmm, what's the lesson here to stop this common and very high impact bug then? Never delete directories using Bash scripts + whatever delete function you do use should be locked down to only ever being allowed to act in your app's subfolder + empty path strings aren't allowed?

mattbgates · on July 20, 2017

I was testing on a Shared Hosting once and got stuck in a loop, crashing the entire server and everyone on it. I had to get the host to reset it because it just wasn't going to ever end. They weren't mad and didn't penalize me or anything, just told me to be careful.

juli1pb · on July 18, 2017

system("rm -rf $dir/")

I forgot to check my inputs. Ran in production for a backup system.

andrewstuart · on July 18, 2017

Been unused and irrelevant.

SirLJ · on July 19, 2017

Had a bug in my stock market scan and missed a trade that would have netted me 20% - easy the biggest trade of the year...

twovi · on July 18, 2017

rsync -avz project_files/ root@192.168.0.1:/

Essentially production was not acceptable for a little bit....

imaginenore · on July 18, 2017

Accidentally removed our corporate ID from the ad code, very high traffic website. So the ads displayed, but we were not getting paid for the clicks. $140K lost in a few hours. At the time that was almost double my yearly salary.

Nobody got fired, because we had a QA team, and their testing procedure didn't test for something like that.