Ask HN: Strange bug workarounds?

mbrock · on Sept 12, 2016

I worked on health record software. An elusive bug in the custom SQL Server crypto plugin led to very occasional corrupted entries, which was very bad.

The guy who wrote the crypto plugin had of course quit and nobody knew how it worked.

Fine-combing the C++, I found an off-by-one error that would cause the predicted failures: after rebooting SQL Server, the first entry would get encrypted with a zero key. (Hooray, we could now also fix all the corrupted data.)

For various reasons it would have been difficult to ship new DLLs to the affected customers. Only a handful used this particular crypto and it would be much easier to patch the existing binary DLLs on their servers.

Well... looking at the machine code, I found that the troublesome off-by-one operations were actually in the printable ASCII range... so I just taught my friend in tech support to do a particular obscure search and replace in Notepad++, something like changing ",}" into ",~" in the binary DLL... and then hot-reload it with an SQL Server command... worked perfectly.

obituary_latte · on Sept 12, 2016

Nice! Must have made that tech feel and look like a hero :)

tobiaswk · on Sept 13, 2016

What a great little story. Thanks for sharing!

throwaway_yy2Di · on Sept 12, 2016

Not my workaround:

http://spectrum.ieee.org/aerospace/space-flight/titan-callin...

http://descanso.jpl.nasa.gov/seminars/abstracts/viewgraphs/H...

This was an extremely serious bug in NASA/ESA's Cassini-Huygens probe, in the S-band link between Huygens (landing on Saturn's moon Titan) and Cassini (acting as radio relay).

It was a timing bug. There'd be a very high relative velocity between Cassini and Huygens, creating a significant (~2e-5) Doppler shift in the link. This shifted the frequency of the 2 GHz carrier (by 38 kHz). Likewise, it shifted the symbol rate of the 16 kbps bit stream (by 0.3 bps). The second effect was overlooked. On the demodulating end (Cassini), the bit-synchronizer expected the nominal bit rate, not the Doppler-shifted bit rate. Since its bandwidth was narrower than the 0.3 bps Doppler shift, it was unable to recognize frame syncs; this was proven in experiments post-launch. The parameter that set the bitrate was stored in non-modifiable firmware.

As it was when launched, Huygens would be unable to return any instrument data. For some context, this was the only probe that's ever visited Titan, at a cost of about $400 million.

The workaround

[spoiler]

The workaround was a major change in the orbit trajectory of Cassini (a $3 billion probe). Details aside, it set up an orbit geometry with this feature: at the time Huygens was descending in Titan's atmosphere, Cassini would be flying at a ~90° angle to their separation. The relative velocity was still 20,000 kph, but tangential velocity doesn't contribute to Doppler shift.

bbcbasic · on Sept 12, 2016

That's a truly epic workaround!

iamcreasy · on Sept 13, 2016

Do they always use star tracker when making these kind of trajectory changes?

dlinder · on Sept 12, 2016

I worked on a social news product and part of our look was to have an icon for every story - either an image pulled from the page, a user-uploaded image, or, in the case of Flash content (say, a video player), a screen capture.

We had it all up and running - loading the content, waiting for the player to initialize, taking the snapshot, generated sizes - on a windows machine when, one day, the request came in to migrate that machine to a VM. After the migration, things were fine - until we disconnected RDP. Snapshots were coming back at the right size, but totally white.

The eventual "solution" was a laptop in the engineering area RDP'ed into this VM to keep the snapshots from going white. It got unplugged one holiday weekend, earning it a red hand-sharpied sign - "PRODUCTION LAPTOP: DO NOT UNPLUG". It was unplugged again one fateful weekend, this time prompting a healthcheck to be written that looked for all-white images in its output.

That rig ran that way, I believe, until someone had the insight to make a second VM, this one RDP'ed into the first.

Turtles, all the way down!

mdip · on Sept 13, 2016

That's awesome and the solution is not as uncommon as you'd imagine.

At "a large telecom" I used to work at, we had a specific process that handled billing that relied on a DOS application which was written targeting a specific modem's hardware. They'd tried to migrate it to something else for quite some time but the guy who wrote it lived in a different state and was let go from the company when we closed that site down and moved all of its equipment to Detroit. It ran on an old Compaq (not HP Compaq, Compaq) desktop PC and in 2014 or our VP received a frantic call that the drive had failed and the computer wouldn't boot (from a younger tech who was used to working on server class hardware). The code for this application had been lost forever and nobody had any idea how it actually worked but my understanding was that with it not functional, we were losing enough money to make it a "drop everything priority".

They brought the machine over to my building and the VP of my department called me to assist[0]. Sure enough, the system wouldn't even see the drive. It was at this point that I noticed three numbers with the letters "C", "H", "S" next to each. This had happened before, apparently, and someone discovered the BIOS battery had died. Thankfully, they were kind enough to put the drive parameters on a label for me. I popped into the BIOS, put 'em in and it booted. The computer remained powered on in the cubicle I repaired it in (just outside said VP's office) for a year until the dev team got around to modernizing the code.

[0] I was not a support person at this time but was in the past and it wasn't unusual for them to call me in on strange problems. I was also known for having recovered a hard drive with important data on it using the break-room fridge (though I'm not sure this VP was aware of that).

dlinder · on Sept 13, 2016

You sound like a kindred spirit. I have put hard drives in freezers to release stiction; I have baked motherboards in the oven to re-flow questionable solder. I wonder if anything in our kitchen is sacred! Sometimes I wish I had "MacGyvering goofy tech junk" as a full time job!

mdip · on Sept 13, 2016

No doubt! Yup, I've done the oven thing, too (several PS3 motherboards as well -- used to buy 'em broken on Craigslist when there was a chance they'd be running older firmware and resell them).

Trick with the freezer hard drive: if you ever order perishable items over the internet, they sometimes ship in boxes with large bags of "blue goo". Pop those in the fridge and the next time you need to keep a drive spinning long enough to get one last copy out of it, sandwich it between two of those. They don't get cold enough to pick up condensation and short the drive and the blue goo keeps cool for a long time if the bags are large enough.

smoyer · on Sept 13, 2016

My father-in-law started calling me MacGyver in the late '80s when I repaired his CB radio using a ball-point pen and modeling cement ... The name stuck.

frereubu · on Sept 12, 2016

Not so much a software bug, but back in my early days (late 1990s) supporting an office network in London there was a computer where the mouse was making the cursor behave erratically during roughly the same period every afternoon. We swapped out the mouse, the controller card, even the computer - effectively replacing all the physical equipment - and nothing seemed to stop it. We went through all sorts of ideas - too near the microwave, heavy fax machine usage, someone's mobile phone - until we realised that it was optical mouse, and the sun would shine through that window each afternoon at the same time and screw up the sensor in the mouse. We stuck a bit of cardboard to the side of the desk and it never happened again.

nom · on Sept 12, 2016

Haha awesome. I once was fooled by the sun, too. I noticed an unusual high power consumption of several KWh in my logs. They always appeared at the same time, almost up to the same minute.

So it turns out there is a very small time slot where the sun can reach through a window into the hallway. That was enough to offset the light sensor that I attached to the power meter inside the closet. The threshold was set too tight.

Think about the possible sources that influence this 'bug': - the month - the time of day - the weather / state of the clouds - open/close state of the bathroom door - reflectivity of the hallway (objects, doors open/closed)

heywire · on Sept 12, 2016

Towards the end of summer, one of my Raspberry-pi security cameras starts detecting "motion" in the form of the sunlight dancing on the wall when the fluffy clouds float by :)

justinpombrio · on Sept 12, 2016

Alert! Either break-in, or fluffy cloud!

throwanem · on Sept 12, 2016

Alert! What a lovely afternoon!

prawn · on Sept 14, 2016

We had an office alarm system that would occasionally trigger incorrectly. It was movement and heat sensitive. The alarm would trigger on weekend afternoons. We were puzzled for weeks until it turned out to be a combination of the second hand on a clock and the sun shining through a window and warming it up.

oaktowner · on Sept 12, 2016

Sun shining through a window in London? Everything else in the story is believable, but... ;-)

mdip · on Sept 13, 2016

Too funny, though, I'd be willing to bet "the same time every afternoon" was more that it happened at the same time in the afternoon when it failed which probably made it even more painful to isolate since it was reliant on the sun appearing in an area not known for sunlight.

mjg59 · on Sept 12, 2016

Samsung laptops would fail to boot if the UEFI variable store was 100% full. The original solution to this in Linux was to leave at least 5K of free space. However, on several systems, removing UEFI variables didn't actually free up space - it was marked as free internally, but the reported amount of free space didn't increase, and so Linux would refuse to allow you to create new variables. The "solution" was to attempt to create a variable larger than the available free space, which forced the firmware to trigger a garbage collection run and re-synchronise the internal and external views of the amount of available free space. Doing something that we knew would fail was a requirement for avoiding killing laptops.

yuhong · on Sept 12, 2016

Interestingly, there recently has been https://github.com/Microsoft/BashOnWindows/issues/976

romanhn · on Sept 13, 2016

Many years back, I was working on a web application that, among other things, could generate PDF user reports. These reports were generated from HTML web pages using a third-party library. Normally this worked well (as well as such a tool could be expected to work anyways), however once a month or so the fonts on the reports would come out super tiny. This would then happen in random reports until we rebooted all of the app servers. The bug occurred in production only, never in our dev, staging or QA environments.

Many hours of investigations were committed, many emails to the vendor were written, much hair was torn out. No luck whatsoever. Months passed, and the bug reoccurred at random intervals and did not consistently affect all reports. One day I logged in remotely to one of the Windows app boxes as an admin/console user and was annoyed to once again discover that it forced my screen resolution to change. That's when I had an epiphany and 10 minutes later was able to reproduce the bug in my local environment.

Turns out the third-party library had some funky rasterization logic that took into account both the resolution of the machine when the library/service was started as well as the current resolution, pretty much expecting both to be the same. Logging in remotely as a console user has the behavior of taking on the resolution of my local machine, which was always higher than what the remote box ran at. Another thing to note is that the console user logged into the same running instance of Windows that was generating the PDFs. BAM! The cached value used by the library no longer matched the runtime resolution and the reports now generated screwy tiny fonts. This happened rarely because logging in as admin/console was not the recommended approach, and it was inconsistent because we had multiple app boxes and the other ones continued to work OK.

Solution - disallow admin/console remote logins. This was one of the most obscure bugs I have had the pleasure of solving.

kogir · on Sept 13, 2016

The Motorola iDEN [1] series of phones were pretty sweet back in their day and had a JVM you could actually write and deploy apps on.

I worked on Loopt, an early mobile location sharing app, and we talked to our server over HTTPS. Things were working great on a few LG and Sanyo phones, and worked fine in the iDEN emulator, but POSTs would fail consistently on the device itself. GETs worked fine.

After watching traffic on the server for a bit, I noticed the POST requests all advertised HTTP/1.1 and sent the Expect: 100-Continue header. On a whim I configured the server to treat all incoming connections as HTTP/1.0 so it would never send the 100 (Continue) response [2].

It worked!

Or did it? Turns out the iDEN phones were now happy, but the other phones were not and would refuse to send POST bodies if they didn't receive the 100 (Continue).

This well and truly sucked, and we thought for a bit we'd need to have two different endpoints with different configurations to support the differently incompatible phones. Lame.

But then I remembered the format of an HTTP request:

    POST /path HTTP/1.1\r\n
    Expect: 100-Continue\r\n
    [Header: Value]\r\n
    \r\n
    [Body]

What if I supplied a malformed URL? Something like "/path HTTP/1.0\r\nX-iDEN-Ignore:"? Then, if there's no validation or encoding, the request will look like this:

    POST /path HTTP/1.0\r\n
    X-iDen-Ignore: HTTP/1.1\r\n
    Expect: 100-Continue\r\n
    [Header: Value]\r\n
    \r\n
    [Body]

Turns out that worked. The JVM was never updated or fixed, the hack shipped, and it worked consistently for the lifetime of those phones.

[1] https://en.wikipedia.org/wiki/IDEN

[2] "An origin server ... MUST NOT send a 100 (Continue) response if such a request comes from an HTTP/1.0 (or earlier) client" https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8....

nhf · on Sept 13, 2016

I remember iDEN phones! I had a Motorola i355 for a while. Software sucked but the thing was an absolute tank. Plus, it was one of the rare "dumbphones" of the time to have integrated GPS, so I remember using it as a GPS tracker for a while afterwards.

josu · on Sept 12, 2016

My favorite:

>Wing Commander was originally titled Squadron and later renamed Wingleader. As development for Wing Commander came to a close, the EMM386 memory manager the game used would give an exception when the user exited the game. It would print out a message similar to "EMM386 Memory manager error..." with additional information. The team could not isolate and fix the error and they needed to ship it as soon as possible.

>As a work-around, one of the game's programmers, Ken Demarest III, hex-edited the memory manager so it displayed a different message. Instead of the error message, it printed "Thank you for playing Wing Commander."However, due to a different bug the game went through another revision and the bug was fixed, meaning this hack did not ship with the final release.

https://en.wikipedia.org/wiki/Wing_Commander_(video_game)#De...

tominous · on Sept 12, 2016

I worked on an HSM system (hybrid disk/tape archival) which suddenly started having lots of I/O errors writing to tape. We tried new media. We tried new drives. We double-checked cables and SFPs. No luck.

Finally we tracked down the issue: when the contents of a particular file were archived to tape, the tape drive crashed. I suspect it was a tape firmware issue, maybe to do with the native compression.

The workaround was to mark that particular file as "not to be archived" and we stopped having media and drive errors.

hga · on Sept 12, 2016

Ah, yes, the "best", as in fastest by far with highest quality results OCR software back in the '90s was owned by a company that was rumored to have pissed off their core technical team, who left and only occasionally deigned to do consulting for them.

So they had a wonderful core which was wrapped in baroque APIs, but the real problem was that core wasn't entirely wonderful, occasionally when you presented it a "Death TIFF" as we called the images for their file type, it would reliably crash. Software or firmware versions of the code (they had a hardware accelerated box with one or more Intel RISC chips), on the PC platform at least, e.g. Windows 3.x using a DOS box, this would entirely lock up the machine.

To get around this for a client that had 500,000 images to OCR on a tight deadline for a legal case (and this was the golden area of legal document imaging, back then lawyers would pay 50 cents per OCRed page, because a full text search could e.g. impeach a witness on the stand in real time), I created a system where the PC would always be printing out asterisks if it was OCRing pages. That allowed an operator to tour the machines and easily see when he had to manually reboot one stuck on a Death TIFF, after which my software would recognize what had happened and continue with the next image.

foone · on Sept 12, 2016

I bet I worked with the same company when I was with the government. We had a subcontractor who'd been hired to digitize something like 200 million paper records (they made it about 50 million in before we ran out of funding). But a small fraction of the TIFF files they generated wouldn't work with any of the tools we had on hand.

It turned out that Windows 98 shipped with an Imaging program (Licensed by MS, not written by them) which predated the standardization of the JPEG-in-TIFF subformat, but they'd basically guessed at how it would work and shipped that. The final spec (and the version of JPEG-in-TIFF nearly everyone else implemented) ended up being different. So basically nothing could read it.

We ended up calling them up every time a customer found one of these files and having them print out that image on one of their windows 98 machines, and scan the printout back in using one of the newer machines. Sure, we lost some quality, but at least the customers could access the data now.

For a time reference, these broken images were still showing up in newly scanned documents in 2011 (when we stopped working with them due to massive fraud), so they must have been using their Win98 scanner systems even then.

hga · on Sept 12, 2016

No, to the best we could determine, and we had a guy who liked to get into the weeds of CCITT Group 3 and 4 compression, it was the raw images themselves, and there was nothing wrong with them, some just tickled a bug. If I remember correctly, their API required stripping off the header and presenting the OCR code with some metadata and the compressed image. It's been way too long for me to remember the details, except that it was fairly obnoxious to interface to, I couldn't just hand it a TIFF in some way (helped us VARs really "add value" and earn our keep :-).

We were producing our own TIFF files using our own software that drove monster Kodak ImageLink scanners (software I in fact took over, redid the SCSI driver of, and eventually did a clean rewrite of the engine on Sun workstations), so the images and their compression came straight from Kodak, and going further, I don't recall those 600 pound beasts ever screwing up at that level.

And this was way before Windows 98, it was Windows 3.0 or by then 3.1, like in 1992, Windows was utterly naive about document image files. Which I can see was a blessing (although maybe it was losing quality, I'd long switched to NT by the time 98 came out).

foone · on Sept 12, 2016

We also had weird CCITT Group 4 issues, because of someone trying to be extra smart and convert TIFF to PDF without a recompress (PDF supports Group 4 compression too, so you can turn a Group4 TIFF into a Group4 PDF by just swapping the header!)

I didn't mean it was definitely the same company, just a similarly annoying TIFF issue.

hood_syntax · on Sept 12, 2016

That's a quality hack right there

hga · on Sept 12, 2016

Yep, this is by some margin the hackiest thing I've ever done in my career. If I'd been doing it on Sun hardware, though, I would have been able to include power cycling hardware for the accelerator box.

dsp1234 · on Sept 12, 2016

In a similar vein, I was messing around with Windows filenames. Using the extended path syntax, it's possible to use reserved words in a windows file name (com1.txt in this case)[0]. This breaks most tools that use the Win32 API (explorer, notepad, most COM components, IIS, etc). I showed everyone around the office, and laughs were had.

Fast forward a bit and a new backup system is being put in place. But it keeps breaking, but only on this box. While researching the issue, explorer keeps breaking when doing searches and third party search tools keep breaking.

Took me a little bit to remember what I'd done and fix it.

[0] - https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

daveloyall · on Sept 12, 2016

Gosh, this takes me back. There was some special filename you could give a directory on Win98SE which would result in it being truly hidden, but the contents still accessible via some arcane workaround. I made such a directory on my first computer, then forgot about it, and--as in your story--eventually remembered it when trying to back up the filesystem.

Does anybody know the nature of the hidden directory hack I've referred to?

1wd · on Sept 12, 2016

A friend showed me a trick where he typed a couple of assembly instructions in debug.com to change the DOS AUX device to something like BUX. That allowed creating or accessing an AUX directory. The converse assembly instructions would restore the AUX device and completely hide the AUX directory.

ninjaoxygen · on Sept 12, 2016

We used to make directories with just the character 255 (a hard space) as the filename. It would totally mess up File Manager in Win 3.1 (the tree would collapse when you touched the entry) and behaved very oddly in Win95/98 if I remember correctly. Maybe that is it?

daveloyall · on Sept 12, 2016

Yes, I think you got it. :) https://www.dslreports.com/forum/r5309449-W98-Alt-255-phenom...

abricot · on Sept 14, 2016

My high school (back in 2000) had a visit from a german gentleman who uploaded porn to their public ftp(!?).

When we (another student and I, also a student) tipped the Teacher-cum-admin off, the folders masqueraded as a printer in the NT file explorer. They couldn't delete them.

We recommended that they wiped the machine and disabled public ftp. (weren't that big of an issue as it was mostly a print server).

cperciva · on Sept 12, 2016

Grepping my checked-out source trees quickly:

1. spiped re-binds SIGINT if it is launched as pid 1, in order to work around a Docker bug: https://github.com/Tarsnap/spiped/blob/master/spiped/main.c#...

2. In my POSIX-violation-workarounds script, ironically enough, I work around a bug in bash which makes 'command -p sh' run with the incorrect path (this has since been fixed, but continues to be present in older installed versions of bash): https://github.com/Tarsnap/spiped/commit/e3968941c9c1b20c63d...

3. In my getopt code, I use a (non-C99-compliant) computed goto in order to work around a bug in LLVM's handling of sigsetjmp/siglongjmp: https://github.com/Tarsnap/libcperciva/commit/92e666e59503de...

4. Many years ago, I added a spurious 'volatile' into some Tarsnap code in order to prevent a buggy LLVM optimization step from running (it was making the Tarsnap build hang on on OS X 10.7): https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape...

nathanielc · on Sept 12, 2016

Several years ago and I don't remember much of the specifics but we had an issue with static content being served from our site being randomly truncated (polluting the cache etc).

We eventually traced the issue down to the Nginx server that was serving the files and one of it's cache buffer size config options, (I don't remember which one anymore). We noticed if the file being served was larger than a certain size it would occasionally truncate the file but not always. We tested increasing the buffer size by repeatedly doubling the default value, which was a power of two, up to a size of several GBs. But the files kept being truncated for some small percentage of the requests. At this point we knew it wasn't directly related to the size of the buffer since it was larger than any files being served. Finally someone suggested that we test a value that wasn't a power of two and the issue was gone.

We figured it was an internal bug in Nginx where it was growing an allocation buffer and used powers of two, but had an off by one error that didn't copy the second half of the buffer or something. We dug through the code but never found anything and so we left the cache setting at +1 from the default power of two value and never had an issue again.

MaulingMonkey · on Sept 12, 2016

Wireshark let me find out that Unity's WWW class ignored request HTTP headers on iOS, causing our usage of S3 to fail. I worked around the problem by switching to URI based authentication.

On-screen keyboards displayed Chinese after visiting a system menu. We freed the async operation when the system menu "canceled" the keyboard operation (it wasn't supposed to be even displaying), but apparently the system had a use-after-free bug. I worked around the problem by switching to a 4 entry LRU allocator, keeping the past 3 or 4 canceled operations around untouched (1 would've probably sufficed, but I'm paranoid.)

A WinRT API to check internet connectivity would exit(3) our app without error messages or related callstacks - but only if the Charm bar was open for more than 10 seconds, assuming you called it once per frame on the main thread. I had to bisect our history to figure that one out - and repro in a new test app to confirm it was the real cause.

EDIT: Third party injected DLLs crashed our app at least twice - once for some monitoring software on a coworker's computer (crashed when closing file handles as build tools tried to clean up and exit), once for an old Microsoft Word IME that predated the Win8 app sandbox who's restrictions it was violating. The monitoring software was uninstalled, the IME I couldn't think of a reasonable workaround for and left to Microsoft to fix.

beamatronic · on Sept 12, 2016

I used to have a Commodore 64. I had one specific game that would not load successfully unless my monitor ( a TV actually ) was turned off. So I had to type "LOAD *,8,1" or whatever, then turn off the monitor, then press RETURN. I'd turn the monitor back on after the disk drive lights went off.

abricot · on Sept 14, 2016

In 1999 I had an old (at the time) Pentium-133 that wouldn't let me reinstall windows when the network card was plugged in. If I did that the mouse, the graphics card, the network card, and the secondary harddrive wouldn't work.

If i unplugged the networking card when i installed there were no issues.

williamjackson · on Sept 12, 2016

There was a security camera with a built-in HTTP server at a previous job. The built-in server would respond without a problem when viewed from one computer, but would force close the connection without a response when viewed from another computer.

I used Fiddler to compare the requests from the two computers and eventually discovered that the request would fail if the `Accept` header was longer than some value (might have been 255 characters -- I don't remember).

Turns out when you install Microsoft Visio and Project, Internet Explorer's Accept header gets really long.

Dr_Jefyll · on Sept 13, 2016

In the 1980's I had a client that manufactured cheques, and the typesetting was done by four or five "Wescode 1420M" systems. These technological marvels used an 8" floppy drive to input order data -- the customer name, account number and so on. The output was rendered onto a single web of fan-fold material which successively threaded its way through two Diablo daisy-wheel printers. The key point is it was a pipeline, with multiple orders in flight simultaneously.

Floppy swapping was a normal part of the work flow, and there was an obscure vulnerability in this regard. In some circumstances if the disk was changed at an incorrect time it was possible for data to leak between orders. (For example, Ted's cheques might bear Alice's account number! To call this intolerable is putting it mildly.) Disk-swap prompts were displayed on a terminal for the operator's benefit, but the environment was hectic and humans are fallible.

Did I alter the software so it'd preview the data and verify that every disk change occured as prompted? No. The 1420M computer featured three 8080 microprocessors mucking around in shared memory, and the code was a spaghetti monolith written in assembly language. I've reverse-engineered lots of stuff before -- there are a coupla stories here [1][2] -- but some challenges you need to walk away from. The time frame would've been open-ended, and that wasn't acceptable.

What I did was supply the client with a gory hack. No apologies -- it was the best way to serve their needs! On each 1240M I installed an 8741 microcontroller that monitored program status by eavesdropping on the RS232 line that carried text strings to the terminal. If those messages failed to agree with observed disk-change activity (relayed by the Door_Open signal on the floppy drive) the microcontroller would yank the 1240M Reset line low. This would crash the pipeline and force the operator to reboot -- a considerable nuisance... and yet, enormously preferable to allowing the error to go undetected!

[1] http://laughtonelectronics.com/Service/Embedded%20Computer/e... [2] http://laughtonelectronics.com/Projects/uCtlr%20Interfacing/...

Senji · on Sept 13, 2016

"Human in loop" swapping floppies. Byzantine.

Dr_Jefyll · on Sept 13, 2016

Yup. And the triple 8080's contributed a lot to the character of the thing, too.

nevir · on Sept 12, 2016

Found deep in the guts of some shared library at Amazon (many years ago; probably still exists):

    #define private public;
    #include "something";
    #define private private;

(Not to fix a bug, but certainly a hacktastic workaround)

imron · on Sept 13, 2016

Looks like it will produce a compilation error to me ;-)

stevekemp · on Sept 15, 2016

Yeah the trailing `;` would likely cause problems.

skissane · on Sept 13, 2016

Many years ago, one place I worked at had the following setup: a closed source application would generate a CSV file, which was then FTPed to another server, where a Perl script translated from CSV to fixed-column-width format (which happened to be identical to the output format of an old mainframe application that we'd migrated off), and then the fixed-column-width file was FTPed to yet another server which loaded it into a database. Now, the CSV file had a number of fields - name, address, etc; but it also had an encrypted password field. We didn't use the encrypted password for anything, we didn't even know what format it was in (hashed or reversibly encrypted or so on). The CSV format was fixed by the vendor and we couldn't change it. However, rather than being output in Hex or Base64 or similar, the closed source app just put the binary data of the encrypted password into the CSV file, which would randomly contain comma or new line characters. The author of the Perl script wasn't aware of this possibility, so the Perl script would die, complaining it had got an invalid input line (wrong number of fields), whenever that randomly happened (sometimes several days in a row, other times it could go weeks without happening).

I proposed to modify the Perl script to fix this issue. However, management refused to let anyone modify the Perl script. The guy who wrote it was a contractor who had moved on years ago. This Rube Goldberg file conversion and transfer formed part of a critical business process. A couple of years earlier it had failed, and its failure resulted in bad press and reputational damage. So they were way too scared to let anyone modify the code of the Perl script.

Instead what happened, was each day a person would manually check if the script had run successfully the previous night. If it did, they'd fix up the data issue in the input CSV file using a text editor then manually start the Perl script again. Management agreed that we could automate that checking process, and if the Perl script failed they would get an alert on our service availability dashboard. But no way would they let anyone fix the bug in the Perl script.

Senji · on Sept 13, 2016

I can deduce this place was a bank.

cesarb · on Sept 13, 2016

At a previous company, we had a legacy software written in PowerBuilder, which crashed on some of the client's computers. We couldn't reproduce the crash on our computers, no matter how much we tried.

We finally got access to one of the crashing laptops, and (with the client's permission) installed a debugger on it. After a few false starts, we found that some code deep within PowerBuilder's framework crashed when it received a particular accessibility window message, and that this window message was being sent by some Microsoft touch screen component. All of us techies had avoided buying touch screen laptops (this was when touch screen laptops were Microsoft's latest fad), which is why it had never happened on any of our machines.

The solution was to do a binary edit of the import table of the relevant PowerBuilder DLL to route all Windows calls to a helper DLL, which forwarded them to the real Windows DLL after replacing the window message callback with a small thunk. Said thunk then filtered out the offending window messages, before forwarding the rest back to the real window message callback within the PowerBuilder DLL. Hacky, but worked perfectly.

rincebrain · on Sept 13, 2016

A recent encounter was a fix for playing the original BioShock under Windows 7 - the sound would not function more than for the first intro. The trivial fix is to plug in something to the microphone port. [1]

Another good one - back around the era of the original NVIDIA Ion boards, I was helping to run a cluster of these boards as an experiment in low-power computing. [2]

Some ran Linux, some ran Windows. Running CUDA code under Linux headless is fine, running it under Windows with a non-Tesla GPU was nontrivial at best (and involved hacking up the Tesla variant of the driver to add some PCI IDs). Unfortunately, it turns out that this breaks if you don't have an actual display attached to the machine.

The solution that was implemented was to take 36 naked male VGA headers and solder resistors across just enough pins to convince the system that there was a display there, and then install them.

Or the Samsung SMART IDENTIFY hard drive bug - which meant that the advice "disable SMART to keep your data safe" was sometimes valid. (The drives had a FW bug that caused them to drop data in the write cache if they got a SMART command before flushing it.) [3]

I'm sure I'll think of more later.

[1] - http://forums.steampowered.com/forums/showthread.php?t=10931...

[2] - http://www.nvidia.com/content/gtc/documents/sc09_szalay.pdf

[3] - https://www.smartmontools.org/wiki/SamsungF4EGBadBlocks

kevhito · on Sept 13, 2016

The resistor-vga trick is one I use somewhat regularly. When traveling to give a talk, I sometimes like to practice my talk with powerpoint's "presention mode" in front of me. You can't easily get that mode without a second monitor plugged in. I keep a single bare resistor that can be stuck into two holes in the VGA connector. I keep the resistor inside a folded slip of paper so it doesn't get lost, and the paper has a reminder on it too showing which two pins to connect. Works perfectly.

bluesilver07 · on Sept 13, 2016

The 'plug-in microphone' fix was common to other games like Call of Duty as well (http://forums.steampowered.com/forums/showthread.php?t=21964...). From one of the Steam forum entries - "The reason plugging in a microphone works is because 'Stereo Mix' is automatically turned on when you plug in a mic."

Senji · on Sept 13, 2016

This would probaly not work on some shitty audio cards with shitty drivers that have explicitly removed "stereo mix" for "copyright purposes".

rincebrain · on Sept 13, 2016

Would it? I'm not entirely sure what the origin of the technology involved is, but as they said, it implicitly enables it even on drivers that don't have it as an explicit option (like the audio driver stack on my Win7 box, at the moment).

web007 · on Sept 12, 2016

The garret: glen; CSS bug, circa 2006 or 2007.

Starting out with a bunch of existing CSS, a developer added 2 new properties somewhere in the middle, but forgot a trailing semi between them. Reloading the page showed the first change, but not the second. He tried ten different variants of the second property name, spelling, values, etc. and nothing was showing up. He added another property before the broken one to help debug, and it started working. He then tried several variants on that to see if it was some arcane ordering bug, and eventually ruled that out by using two developers' names for property:value.

Because all of the intermediary versions included a semi, and because the first property allowed some kind of extended content that was ignored, it took half a dozen developers looking at the "weird bug" before someone noticed the missing semi on the first property.

djeebus · on Sept 13, 2016

An registration company I used to work at was using .NET 1.1. Being the super ambitious junior developer I was, my first move was to upgrade our software to the latest and greatest: .NET 2.0. After passing all the tests and signed off by QA, we moved it to production, and we pat ourselves on the back, having done A Good Thing (tm).

Soon afterwards, however, we started receiving reports of our users not being able to refund or charge credit cards. All that information should have been in the DB, encrypted! We quickly discovered that, on occasion, the encrypted data getting corrupted. Immediately we did what every engineer would do in our place - blame the previous engineer's code, then try and find the bug that would prove our theory right. After days of studying source code and testing theories, nothing explained occassional corruption.

Eventually we traced the beginning of our problems back to our server/framework upgrade, and found a Backwards Incompatible Change: invalid unicode code points would now be silently dropped, rather than being allowed. It turns out that all of our credit card numbers were being encrypted properly, but then DECODED using the UTF-8 Encoding and stored in an NVARCHAR column in the DB! Everything was fine in .NET 1.1 (and SQL Server 2000) but .NET 2.0 silently drops the invalid UTF-8 code points. With those code points missing, it was impossible to decrypt the data and do anything with it.

... I suppose that makes it more secure though, so there's that ...

We felt that .NET 2.0 was a big enough upgrade that it was worth adding some new warts to our system. The final hack: we found an unused pc and built a .NET 1.1 web service with two functions: encrypt/decrypt. We store credit card numbers in the database in plain text, make a call to this webservice with the row id, and it encrypts the data. This solution lasted almost 5 years before our boss accepted the pain of an hour of down time and we exported/decrypted/encrypted/imported the entire db.

ivank · on Sept 12, 2016

I've got Intel graphics and a 4K monitor on Linux. With the Intel drivers, I have no vsync (I can't use TearFree because of strange video corruption issues), but things mostly run correctly. With modesetting drivers, I have triangular tearing and serious performance issues in Sublime Text, but _do_ have vsync in fullscreen.

My workaround for watching movies with vsync? Use Intel drivers in my main X session, modesetting in a secondary X session just for mpv.

bertiewhykovich · on Sept 13, 2016

Ah, the joys of Linux. Truly the world's greatest operating system.

aruggirello · on Sept 12, 2016

My "favourite" bug workaround is for the KDE Plasma 5 desktop wallpaper changer which degrades the pictures being used (by blurring, almost ruining) whenever downscaling them (when they are larger than the desktop's native resolution), something lots and lots of KDE users are complaining about. There is no fix released yet but, being a creative user, I resorted to installing "variety", a very cool desktop wallpaper changer (and downloader).

As Variety can apply ImageMagick filters on the fly to the wallpaper being set, I set it up so that it just scaled down and cropped the image to my exact desktop resolution. This fixed the issue for me... at least, temporarily :)

To set up the filter, I edited the ~/.config/variety/variety.conf, and changed the line:

  filter1 = ...

to

  filter1 = True|Keep original|-scale '<my desktop resolution, eg. 1920x1080>^' -gravity center -extent <my desktop resolution, eg. 1920x1080>

Then I configured Variety to generate a single wallpaper file in a folder which is "watched" by the KDE Plasma desktop wallpaper changer, with the same interval. Voilà!

lscharen · on Sept 13, 2016

Not a "real" problem on a running system, but back in my first year of undergrad I had a computer science assignment that kept faulting with an "Illegal instruction" error on our Solaris systems.

I had a C compiler on my personal computer and the same program ran and compiled fine there, but we had to submit our solution in source code form on the Department's shared system to plug into the class' automated build and test scripts.

Eventually, I discovered that adding an extra space to a comment fixed the error. I wasn't experienced enough to at the time to know how to use GDB to disassemble and debug binaries, but, looking back, I think I must have triggered a compiler bug that misaligned an instruction (Sparcs were 4-byte aligned, IIRC) and adding the extra space somehow fixed the alignment of the generated code.

grkvlt · on Sept 14, 2016

Sadly I don't think that's true. A first year CS undergraduate would not be writing code that triggered a compiler bug, the real problem was most likely your code.

I suspect you had an error in your program, an off-by-one or other type of overflow, that caused the stack to be executed. Compiling without debug would mean that the code executed was harmless, compiling with debugging symbols (the -g option in gcc) enabled caused a different memory layout, which triggered an attempt to execute data that contained an illegal instruction. Since in debug mode comments are included in the data segment, adding a space to a comment further changed the memory layout making the error innocuous again.

// EDIT After thinking about this a bit more, I'm not entirely convinced by my explanation since comments aren't included in the debug symbols. However, I still think it's more likely that a debug (versus optimized) build had different memory layout, and therefore different behaviour in the presence of a stack/heap smashing bug...?

willvarfar · on Sept 12, 2016

Brings to mind this absolutely classic old story:

http://thedailywtf.com/articles/ITAPPMONROBOT

And pics of a build it inspired:

http://thedailywtf.com/articles/The-Son-of-ITAPPMONROBOT

kjetijor · on Sept 12, 2016

An friend of mine had a similar thing, where desktop-box-turned-server essentially locked up after just over 24h of uptime. Solution: Outlet/timer-thing which cycled power around 2am when nobody were looking.

Similarly - there were some minor issues with the cooling for my compute cluster at my previous job, where it weren't really designed to function in climates which had temperatures that varied too much. Notably, it'd turn off the compressors on hot summer days and cold winter days. While waiting for the tech, tiny rocks found on the roof were used in conjunction with some tape to force the mechanical relays on while waiting for the techs.

http://www.pvv.ntnu.no/~kjetijor/images/tape_rocks.jpg

ajford · on Sept 13, 2016

A few years back, I was part of a group in the early days of commissioning a piece of research equipment that consisted of many racks of FPGA and GPU computing equipment in a specially modified shipping container. This thing was installed in a desert area, and had to be cooled by a couple of AC units.

The issue was similar. On nights where the temperature dropped too close to the dew point for too long, the units would freeze over. However, at the time, there wasn't any temperature monitoring. So someone figured out how to monitor the die temp on the FPGAs without changing the running code. Took them a few days. By the time they finished, someone realized they could tie streamers to the AC vent, which could be seen in the remote video stream.

Anyways, the fix was to connect to the network, switch the AC unit to fan only for a couple of hours, then switch them back on. If I remember correctly, it was like this for about 6-8 months before they finally had someone replace the AC system with a more commercial unit that could handle the condensation.

sethammons · on Sept 13, 2016

Not mine, but a classic. Emails that can only be sent 500 miles: http://www.ibiblio.org/harris/500milemail.html

existencebox · on Sept 12, 2016

Christ, I should be keeping a list over the course of my career, I'm sure I've forgotten some gems.

Some that stand out: We had a NOSQL-esque backend that stored CSVs, as part of a data pipeline. (CSV in, data "Activity", csv out). You specified the file, if it had headers, separator, etc. As it turns out, you could not define a null separator, if you wanted to have a single column file. I needed something that would properly split what I knew to be well formed all alpha-numeric inputs within the valid ascii range, and would avoid spurious splits. The sep I used was naturally (the snowman unicode character, unicodesnowmanforyou.com, which as it turns out HN sanitizes on posting!) (The punchline comes when I started seeing this pattern show up in production code elsewhere in the company, using this exact same character choice.) Snowman separated files++ (.ssv?)

Another fun bug, was working in a very large platform that had a common telemetry library that used perf counters. The original authors, and all of the platform authors consuming the lib, had gone on their merry way without realizing that perf counter instances have a disallowed character set, which the custom lib was _embedding by default_ when it added metadata to the instance name (#foo or something IIRC). Fixing the metadata appendation was easy enough, but to fix every place where the consumers had named something with an invalid char (and then consumed with the same invalid char on the read side) ended up writing a shim that sat between the perf counter lib and world and silently character replaced the invalid chars with something strange like _<charID> (Basically reinvented the wheel of slash escaping but within the perf counter allowed charset).

And to end on an abysmal note, large project had VERY consistent naming scheme, had gotten quite deep filewise, was hitting max path len limitations on windows. Rather than break the consistent naming on a new, slightly longer file that needed to be added, or rename everything else, changed root paths from Workspace->w; Main->m; Release->r, etc. I am not proud of this one...

Even as I type this I know there are tons of hacks I'm forgetting (using plastic knives as hard drive stabilizers in a significantly sized datacenter deployment) and will gladly expound if there's interest but for now I'll let this nostalgia get reburied :)

mdip · on Sept 13, 2016

I always wondered why the Unit/Record/Group separator characters were virtually never used. In the case of human editable files, I get it (a comma is actually on the keyboard, after all). But I'm curious, in your case, why you went with the snow man over the built-in options[0]? (and I have to admit that I got a laugh out of the "pattern show[ed] up in production code elsewhere in the company" -- I've seen that so many times)

[0] http://stackoverflow.com/questions/8695118/whats-the-file-gr...

existencebox · on Sept 13, 2016

An exceedingly stupid act of paranoia; I knew the input _could not_ go above the normal ascii character set without errors elsewhere in the pipeline, it seemed therefore more robust to chose one that could by other invariants never be hit. That being said your group separators, had I thought harder about it might still have been a more valid answer. (but then I wouldn't be able to talk about it as quite so much of a dirty hack!) I imagine they aren't used much because frankly I hadn't even thought about their distinct function more than two to three times in my entire post-programming life.

gwbas1c · on Sept 12, 2016

Windows only allows a limited number of Explorer icon overlays installed. If you install a lot of programs that install Windows icon overlays, some stop working.

There are ways, though, to make sure that your icons have priority over "Joe's poorly designed explorer plugin." :)

mdip · on Sept 13, 2016

Reminds me of the maximum PATH length issue still present in most versions of Windows (I think Windows 10 Anniversary resolves it).

It was particularly painful because when you'd hit it (by, say, installing Sybase drivers or some other awful application that insisted on putting nearly every subdirectory it had in PATH), nothing would tell you that it was specifically the PATH being truncated that was at fault, you'd just get a large number of applications that would stop working and return obscure error messages.

rincebrain · on Sept 13, 2016

Windows 10 Anniversary has code to resolve it, but it's opt-in.

http://winaero.com/blog/how-to-enable-ntfs-long-paths-in-win...

CatsoCatsoCatso · on Sept 13, 2016

Microsoft Excel 2003 (or at least the copy I was stuck with) had a weird bug where if the final column of a CSV spreadsheet with headers was empty (column header there but no data) then the outputted CSV file would only contain the correct amount of commas for affected rows up until the 16th line before it just started discounting the commas to indicate an empty field at the end of the table.

This would cause all sorts of errors with the program I had to upload the files to.

My only work around was a series of Regex based find and replaces in Notepad++, I could have perhaps scripted something automatic but I was on a very locked down machine at the time.

It was one of many weird MS Office bugs I had on a A3 sheet pinned to my cubicle wall.

danbruc · on Sept 12, 2016

My favorite is not a bug workaround but for a limitation in the GUI library used.

I worked on an enterprise job scheduler that was initially outsourced to an Indian company but the project started failing and so we took back development. The software was required to be able to schedule tasks with a delay of up to a hundred or so hours but the GUI library only had a control for time of day up to 24 hours. The code we received had an interesting solution - they changed the format string to place the milliseconds part first and then some code in the data access layer that swapped hours and milliseconds back and forth on reads and writes. And there you have it, delays up to 999 hours.

santialbo · on Sept 13, 2016

In windows 10, resizing the command window would break npm https://github.com/npm/npm/issues/12887

Workaround, not resizing the command window...

And response from someone in Microsoft: https://github.com/npm/npm/issues/12887#issuecomment-2225253...

Senji · on Sept 13, 2016

Tried reproducing it with TCC/LE or ConEmu?

avh02 · on Sept 13, 2016

At a previous job I was asked to debug a large (inefficient) cronjob that was suddenly taking 24+ hours instead of the usual ~8 hours. (We had just migrated infrastructures but noticed this days later)

Being relatively new to that particular codebase I look at it and see nothing that stands out to me... after an unfruitful day and not wanting to get too deep in to the code without necessity I fired up a profiler. Logging (syslog) statements were taking HUGE swathes of time. Neither me nor the person supervising me could believe that was it so we put it on the back burner.

The next day I take another look at the log statements, fire up a python shell and find the logging statements on that server are returning instantly 4/5 times. Every 5th (or so) time it would block for 5 seconds or more. Given the cronjob writes thousands of log statements in the course of action, this became a cause of concern.

Didn't manage to look in to it deeply enough (I guess DNS caching plus crappy DNS) but the quick workaround was to toss the syslog server's address in to the hosts file, the cronjob ran 'smoothly' after that.

donutpepperoni · on Sept 12, 2016

I remember working as a help desk tech and our company used ACT the CRM software. At the time it was very poorly designed(might still be) and used an MSSQL database to store all of it's information. We wanted to port all of the information in the DB to a web app that would allow us to do different stuff with the data that ACT wouldn't allow us to do(number crunch, send email reports, etc). Part of the problem was that an ACT install automated the MSSQL part of the set up and set the root(i forget what they call it in mssql now) with a password so you couldn't see any of the internal tables. I remember spending that night after everyone went home learning how to shut down the database and force a reset on the root user so that we could add a user that could get read access on all the tables.

Everyone had been talking about getting at that data for a year or so and one night I was just like fuck it, I'll give it my best shot. Honestly it wasn't that impressive, but I certainly do remember how cool it felt to tell "the man" to F off and this was our data :).

cakes · on Sept 12, 2016

The best/closest I have is that where I once worked, we had a NetApp that allowed it to be upgraded to a version that it didn't support (it wouldn't boot) which was not how it was supposed to be... Anyway, we should've been able to fallback but the jump we tried to make screwed with paths to the bootstrapping/startup and while normally the previous version should be recoverable...well it was not because of where the upgrade process failed.

So we were trying to recover it and I had a "It's a Unix System, I know This!"-moment and was able to manually type in the path to the previous binary during an emergency/rescue prompt (based on deductions from forums, the current failed loading message, and some obvious things like architecture) and got it up and going again.

Documented that, internally, to the best of my ability.

slm_HN · on Sept 12, 2016

This is a little different, but I always think about it when someone says bug workarounds. It's literally a bug workaround from an unknown coder back in the days of BASIC...

    390 ...some basic code here...
    395 GOTO 405
    400 REN HOUSEKEEPING
    405 ... more basic code...

mirkules · on Sept 12, 2016

Most recent one is a bug in lubuntu based on 16.04 where the mouse cursor disappears after system goes to sleep (but is still functional).

Workaround is ctrl-alt-f7 to switch to console then ctrl-alt-f1 to switch back to GUI, and the mouse cursor reappears.

https://bugs.launchpad.net/ubuntu/+bug/1573454

Another one is a sweet widget in OS X called iStatPro, which was no longer working ias of Mountain Lion. But, there is this workaround which for me still works on El Capitan: http://hints.binaryage.com/istat-pro-for-mountain-lion/

reacweb · on Sept 13, 2016

We needed to print a log file on a VMS station, but the end of the file was never printed (11 pages instead of 17). The file was containing many '%' characters. I have suggested to replace them by '#'. This has solved the issue.

malkia · on Sept 13, 2016

Not entirely the same kind of workaround, but an overly genius way to get game patching on PS2 through self explotation. From Insomniac:

http://www.gamasutra.com/view/feature/194772/dirty_game_deve...

Also this on their site (but requires flash): http://www.insomniacgames.com/self-exploitation/

voltagex_ · on Sept 13, 2016

Unfortunately the swf doesn't seem to be there: Failed to load resource: the server responded with a status of 404 (Not Found)

http://web.archive.org/web/20160310003012/http://www.insomni... has it, though.

Senji · on Sept 13, 2016

That's just a powerpoint presentation in flash form.

drakonka · on Sept 13, 2016

Just had one. Not as strange as most of these but annoying. We have a custom P4V tool which often needs to be run simultaneously for two different changelists via the changelist context menu. However, we noticed that after the first instance of the script finishes on the first changelist the second one running in parallel exits along with the first, never finishing the work for the second changelist. I noticed that if you terminate the second started instance the first is unaffected, only the other way around.

At first I thought it was something wrong with handling multiple processes in our tool, or some weird multiprocess tkinter or cx_Freeze issue. Then I realized that starting these two instances of the script from two _separate_ p4v windows resolves the issue and they can run at the same time, not hindering each other. But we can't ask users to have multiple P4V windows open just to run this on multiple changelists.

The workaround, for now, is having the custom tool run a batch file instead which then runs the frozen python app exe, ensuring that each actual instance of the tool starts in its own parent process and not as a p4v subprocess.

MzHN · on Sept 13, 2016

In a map project, I had markers stored in PostgreSQL + PostGIS database.

As the amount of markers got too heavy for the browser, I tried only querying markers within a certain range of a coordinate I was visualizing.

For some reason, no matter what coordinate systems, data type casting and PostGIS functions I tried, I would always get an ellipse-shaped area of markers, where the north-south distance was twice the expected, and the west-east distance was as it should be.

As I realized that the issue was consistent, and always exactly double, I decided on a crazy workaround: I added math to the distance query, to divide the latitude coordinate by 2 and then order the results by distance and LIMIT 1000 closest markers this way.

Voilà, perfect circle on the map!

Even though the resulting coordinates were completely off, it did not matter, because only the distance comparison used the wrong coordinates.

imron · on Sept 13, 2016

Not exactly a bug workaround, but in a similar vein:

http://www.gamasutra.com/view/feature/132500/dirty_coding_tr...

Scroll down to 'The Programming Antihero'.

shermanyo · on Sept 13, 2016

Our team was porting our middleware product to an appliance environment (stripped linux os, hardened image).

We had a config script that we used internally for test environments, and were hoping to use it on box until our code covered this part of the setup process.

It relied on starting several services in order, and checking certain things were running at various points, by parsing the output of 'ps'.

Unfortunately, the appliance used a BusyBox version of 'ps' that truncated the output.

I ended up writing a shellscript that checked /proc manually and echoed a string that would match the main offenders, aliased 'ps' to the new script, ran the setup and it worked first time.

I used it on our nightly test runs for ~ 3 months without issue, until it was properly replaced.

shermanyo · on Sept 13, 2016

For the last couple weeks, I've been doing some work through the following chain:

- Windows VM (to isolate VPN connections)

- RDP to a Windows VM (jump box in the cloud network)

- VMWare vsphere client (to perform the initial appliance iso installation)

The bug I've encountered: the first keypress is echoed several times, while keys typed immediately after are only sent once. Any short (< 1 second) pause in input will cause the next keypress to echo several times again.

Leading to input like the following:

login> rrroot

password> pppassword

My workaround to get through the initial configuration (so I could ssh) involved remembering to press/release shift before I typed anything. (on screen keyboard also worked, but where's the fun in that?)

It ended up feeling like the habit of tapping esc before entering a command in vim :P

mdip · on Sept 13, 2016

This isn't a software bug, but since a lot of these aren't, I thought I'd share because it was a fun one with an unexpected cause.

I worked on a floor with about 10 people that was entirely occupied by phone switch equipment (raised floor, wires/racks, Halon fire suppression and big enough to seat several hundred people were it not for the equipment). For two weeks, about every 3 days or so in the middle of the night, the power would cut off. This was particularly surprising since the entire floor had dedicated battery and diesel backup (regularly checked/tested) and they never kicked on. Our facilities guy was going bald troubleshooting it -- brought in electricians and had the techs checking everything. There was just no explanation.

In a last ditch effort to try to get some information, he setup a laptop with a built in webcam and placed it high enough in the air so as to get most of the site[0].

A little history is necessary for the facility's design to make sense. At one point this room was used for our mainframe -- we were a local phone company and had a ton of data. This necessitated having a very elaborate near-line storage device custom built for the company. It consisted of a multi-million dollar robot (the exact kind you see on commercials building CARS, an arm about the size of an adult man coming out of the floor which ran on a track from one wall to about the middle of the space). It was enclosed in glass and would move tapes from a large shelving unit into drives and back but it was an open loop system: it never truly knew if it got a tape or if the tape made it to the drive and back and being an imperfect mechanical device, every once in a while it dropped a tape and someone would have to disable it, go in and pick the tape up off the floor (or, more often, the pieces of what was once a tape in some cases). This robot moved very fast and was very powerful so in a scenario where it's a person vs. "big moving robot"... well, there'd be pieces of person on the floor instead of tape. Since we liked our employees (and OSHA probably mandated it), the interior of the robot housing was filled with exposed "big red buttons" that would cut the power in an emergency. The exterior walls of the switch room had the same switches, though these buttons had a large acrylic cover with a hole in it so that you couldn't accidentally power down anything. A choice few of these killed power to the entire site and had a sign indicating that with something along the lines of "OH PLEASE GOD DON'T TOUCH THIS BUTTON"

Janitorial staff had been used to turning the lights out on their way out if they were left on and a new member of janitorial staff discovered, at some point, that hitting that big red button took care of all of the lights at once (along with all of the normally blinking LEDs on the thousands of switch cards, but hey -- it got dark at least!). So on his way out the door, he'd walk over to it, look at it for a second, then push it ... powering down ... everything.

The workaround was easy. We were now responsible for taking care of our garbage, dusting and cleaning from that point forward (which I think during my 7 or so years on that floor happened once) and a permanent camera was installed in the ceiling which was powered on a circuit not affected by the buttons. The buttons remained, though.

[0] I think after ruling out everything else he had suspected sabotage of some kind was responsible. Our doors used RFID badges and visitor logs were accessible, but at that time the doors that were interior to our office space didn't require badge access and there were no entries for the doors that one would have checked.

Piskvorrr · on Sept 13, 2016

First thing that came to my mind - check the logs. What, no access logs for critical infrastructure, no physical access control, "anybody could use the door, no biggie"? I had a hunch about your issue from sentence 3 onwards - I thought the story "janitor unplugs server, plugs in vacuum, replugs server when done" was universally known. Apparently, "those who don't know history are doomed to repeat it." ;)

mdip · on Sept 13, 2016

Yeah, that was the painful part. Almost nobody had access to that entire suite and those that did underwent stringent background checks and were very technical, so physical security once you were in the suite was limited.

IIRC, I believe it was discovered that the janitorial staff used building keys rather than the RFID locks so they weren't even logged when they arrived in suite.

I was a little surprised that hitting the emergency button didn't trigger an alarm of some kind, but that's how it was installed in the 80s and I'm fairly certain it's still that way, today (though I don't work there any longer).

Outside of those omissions, things really were kept in order: monthly battery tests, quarterly diesel/full system and disaster recovery tests. It's right when you think you have a solid process that someone comes along and pushes the wrong button, or burns some toast and triggers a floor evacuation/unexpected Halon test (that happened, too -- at some point they took away all of our nice things).

sethammons · on Sept 13, 2016

Not really a bug, but I just ran into this. A linter for Ruby that only wants double quotes if there is string interpolation and prevents builds from triggering. Never mind if you want to avoid escaping single quotes for readability. Here is a work around ;)

    fuck_linters = ''
    linted_string = "${fuck_linters}don't stop apostrophes"

lloeki · on Sept 13, 2016

> A linter for Ruby that only wants double quotes if there is string interpolation and prevents builds from triggering

Is that Rubocop? Put this in `.rubocop.yml`:

    Style/StringLiterals:
        EnforcedStyle: double_quotes

More here: https://github.com/bbatsov/rubocop/blob/master/config/defaul...

sethammons · on Sept 14, 2016

It is Tailor. Thanks for the info.

deadlyllama · on Sept 13, 2016

In the late 2000s I worked for a small NZ company, Innaworks, who developed a tool to automatically port J2ME mobile phone apps (mostly games) to BREW, Qualcomm's C++ environment for phones.

The number of handset bugs we had to work around was immense. One handset, the Samsung A790, would reboot if you drew text on an offscreen buffer. Another, the Samsung N330 which we nicknamed the "shaver phone" for obvious reasons[1], ignored a few least significant bits of the source x coordinate when you did a bitblt from an offscreen bitmap to the screen, IF the offscreen bitmap had fewer than 4 bits per pixel.

We ended up writing our own graphics code that wrote into the BREW backbuffer, set the damage rectangle, and asked BREW to blit that to the screen for us. This was much faster than the BREW runtime's graphics code, so games ported via our automated system often ran faster than "hand-ported" games.

The LG AX260 would crash with an error screen if you used threading -- I suspect an ISR would notice the stack pointer was in the heap and halt the phone. This was a BREW 3 phone, and BREW 3 actually had a threading API, so we thought maybe the solution was to use the real threading API instead of setjmp/longjmp. No, BREW 3 threads froze the phone too. We worked around the problem with some help from memcpy and some rather evil stack pointer manipulation. Our stacks were pretty small as all Java objects were allocated on the heap, so this wasn't as bad a performance issue as you'd think. I refactored the scheduler to avoid stack copies if it decided to keep running the current thread.

The worst bug I remember, though, was in the ARM RealView C++ compiler. It optimized out a null pointer check -- you could write to the logfile the pointer value, write to the logfile from the exception throwing code ... which never ran. I eventually got the compiler to generate an assembly listing for the function in question and discovered that no null pointer check code was there. One volatile keyword later and we were back in business.

Our customers loved the product because it just worked. We supported full Java semantics, all the way down to static initializer ordering. It was a simple choice to make -- the more robust our system was, the fewer support incidents for us and the happier the customers. We produced human-readable C++ code so you could run your app in a debugger if need be, and did some clever whole-program optimization. Our runtime was a real memory miser as a "400k" Java handset would have 400k of heap -- code and images tended to live outside that. We could compile a game for a 400k Java handset to run on a 400k BREW handset -- 400k for our runtime, the user's code, the heap, image data, audio data... I vividly remember the time I saved a whole kilobyte of RAM -- that was a major win.

I worked with the smartest people I ever have at that company. I've never been in an environment where everyone was just brimming over with technical adeptness. And we weren't just a company of young things, there were a few over 40s there too.

[1] http://www.cnet.com/au/products/samsung-sch-n330-verizon-wir...

mdip · on Sept 13, 2016

The most painful bug I encountered had to do with a visitor access kiosk I had designed and written the software for at my previous company. The workaround was to block access to a set of sites for the entire company to keep the two kiosks from failing.

About every few months the web cam would just ... randomly stop working. This would cause the kiosk application to crash while attempting to take a visitor badge photo of the visitor, rebooting the machine. Because of the nature of the device[0], it was very difficult to identify the root cause and the fix was to physically visit the kiosk, unplug the web cam, remove the driver, install the latest driver and plug it back in. Eventually, I took some time and set-up one in my office and watched it.

Something odd about the web cam was that the driver would never work if the web cam was plugged in while the driver installation ran. The installer instructed clearly, on a separate page of the installer to unplug the camera before "proceeding" and in what I have come to believe is one of the dumbest designs for driver software, it would periodically look for updates over the internet and silently install them, yielding a completely broken web cam. I spent about a month's time diagnosing the problem, mainly because that wasn't where I expected it to be, since I had other, more likely targets[1] (and I hadn't handled the OS install/driver setup).

To make matters more entertaining, the guy who maintained the hardware had added the hostnames and IPs to hosts and configured it to resolve to 127.0.0.1 but the driver service helpfully ignored that file (as far as I was told[2]) and turning on the corporate firewall (Symantec Endpoint Protection) caused blue screens. Since this driver started feeling a lot like fighting malware, we ended up attacking it as such and shut down all communication with the updates servers and IPs via the perimeter ... for the whole company[3].

[0] It was an Office Communications Server solution written in a very old API and the kiosk ran Windows XP, which we stripped of nearly everything and forced the device to use the kiosk application as its shell (which would boot itself if it encountered any problem).

[1] I had to write a component for the software to work with the web cam in C++, a language I hadn't touched in years, so my gut feeling was that it was related to that component.

[2] It could be that we missed some of the IP addresses it polled, or it could be that it just ignored the hosts file in windows. I didn't do this work so I'm not entirely sure.

[3] For whatever reason, security wouldn't/couldn't block the IPs just for the kiosk itself (something about it being setup to not require authentication to access the internet and our perimeter proxy server -- at the time -- being unable to be configured to block specific external IPs for specific internal IPs. My bet is that it was more a "not willing to" than an "unable to", but who knows?). The practical upshot is that we had some of these devices on peoples' desks within the company and they experienced the same problem so once it was banned, we received fewer help desk calls for broken web cams.

LgWoodenBadger · on Sept 16, 2016

Why couldn't you have just gotten different web cams?

mdip · on Sept 20, 2016

Sounded like a logical thing to do, and we thought of that as well. There were really two reasons: The "dumb" one was "Corporate Standards" which always seemed to serve as a method to ensure the worst possible product was forced upon everyone. I could have worked around that with a bit of political effort.

The bigger reason, though, was that the code was written targeting specific vendor APIs. Other cameras simply weren't compatible with that code and it would have been a bigger nightmare working that out, unfortunately.

hga · on Sept 12, 2016

Search for SimCity in this item: http://www.joelonsoftware.com/articles/APIWar.html

It's kinda strange for an OS to be maintained for a long time with that style of backwards compatibility....

angersock · on Sept 12, 2016

Only if by "strange" you mean "fucking lucrative".

A lot of folks are used to the crazy ship-all-the-time-regardless-of-cost world of webdev, but there is a lot of business value in not breaking things randomly.

huhtenberg · on Sept 12, 2016

Back in Windows 2000 days if you are to run something like SoftICE or DebugView and look at live debug trace from the kernel, you'd see various funny messages referring to this IE bug and that Outlook quirk being worked around. That is, instead of fixing their userspace mess they dealt with in the kernel.