The Bug Nobody Is Allowed to Understand

sgentle · on Dec 26, 2014

Yes and no. I think Stallman is right in that it's important to realise the relationships between components are often a bigger and more pernicious source of bugs than the components themselves. It's not a problem when you can pretend the components are actually one single component, which is obviously not possible with service architectures or proprietary code.

I'd expand that to say there's a kind of CAP theorem analogue for code, where if you want it to be sufficiently modular (partition-tolerant) you'll need to either sacrifice availability (before you know if it works you have to check its interaction with every other module) or consistency (you write code assuming the other modules work the way you expect and maybe they don't)

Unfortunately, beyond that core idea this article is just sort of befuddled. It is possible to understand the behaviour of a component without seeing its code (contracts, APIs, specs and tests are all examples of doing this). Further, it's a problem not at all specific to proprietary software. Even if you have access to every line of code ever written you're still not going to be able to understand the million interactions that could exist between all the things currently running on your computer in their various languages, frameworks, architectures and programming styles. And, to the extent that you can, it's probably not a very good use of your time compared with just emailing whoever wrote it and saying "hey, your software's not doing what I think it should do".

qwerta · on Dec 26, 2014

I upvote, but disagree.

Studying source code is first step. There is difference between "full understanding", and fixing a bug when software crashes under very specific conditions. Modern IDEs (and hopefully unit tests with VCS history) are great, 1+ MLOC is not really a problem.

Good luck with contacting original author, I usually give up after a few weeks.

And even proprietary software can expose enough "source" without compromising its monopoly. Microsoft bundles debugging symbols, so you can debug their software and roughly understand whats going on.

ademarre · on Dec 26, 2014

> It's not a problem when you can pretend the components are actually one single component, which is obviously not possible with service architectures or proprietary code.

Yes. It seems Stallman is mostly trying to highlight a problem with closed-source software, and he's right, but this doesn't acknowledge that it's not always possible to treat a software system as one big component, even if everything is open source. If the APIs are no more complex than necessary, and do a good job of encapsulation, then it should be easy enough to identify the misbehaving component and submit a meaningful bug report without seeing the source.

I believe that with distributed systems and service architectures, there are higher level design principles that we should be adopting. Many design patterns from other disciplines, for example OO APIs, can be applied to distributed components in large systems. These principles encourage encapsulation, which usually results in elegant communication protocols between distributed components.

kcorbitt · on Dec 26, 2014

If multiple proprietary software packages are talking to each other at all, there must be either an implicit or explicit specification they're talking over. And if that interaction is broken, that implies that either (1) the spec is ambiguous/wrong or (2) one or both parties are implementing the spec wrongly.

It seems to me that engineers from the relevant companies ought to be able to get together, talk over the problem and figure out which of those is the case, even if they're not looking at the same source code.

In any case, a well-defined spec/API is critical to effective integrations between pieces of software maintained by different teams, even if both components are open source.

Animats · on Dec 27, 2014

If multiple proprietary software packages are talking to each other at all, there must be either an implicit or explicit specification they're talking over. And if that interaction is broken, that implies that either (1) the spec is ambiguous/wrong or (2) one or both parties are implementing the spec wrongly.

Yes. Aerospace takes that seriously. So does the military. That's why Rolls-Royce engines can be taken off a B-787 and replaced with GE engines. The software industry rarely takes interoperability seriously enough to make such things work.

Many years ago, at the dawn of the TCP/IP era, I worked for a large aerospace company. We had a pre-Berkeley TCP/IP implementation from 3COM, originally intended only to talk to itself over Ethernet I brought it up to compatibility with the spec, and added a logging mechanism which logged every packet which was either wrong or which did not advance the connection.

Every day, I read through this bit bucket, and wrote emails along the lines of "Your TCP implementation is not compliant with para. xxx subparagraph yyy of the standard. See this dump of packet zzz where value aaa is incorrect due to bbb." After a while, interoperability improved.

We could do this was because, as a large aerospace company, we were bigger than the networking companies we bought from. So we could pound them into complying with the published spec. We also had backing from DoD in this; they wanted interoperability. Back then, many vendors didn't; they wanted their own dialect of networking. This was the era of SNA (IBM), DECnet (DEC), ARCnet (Wang), and a whole bunch of incompatible PC-oriented LAN systems. Mixed-vendor shops were unpopular with the old-line computer makers. The idea of everything talking to everything else, and it all just working, was new then. It took a lot of effort on the buyer side to make that happen.

angersock · on Dec 26, 2014

In the field I'm in, companies that are effectively running a lot of infrastructure are comically bad at providing developer support. Like, they actively go out of their way to avoid providing documentation (that they themselves certainly have).

The reason? They do big enterprise installs and it is probably safer just to treat all their gear as black-boxes than to allow any other companies to potentially botch site integrations.

Then again, everyone else ends up just packet sniffing production networks anyways, and that's a lot safer, right?

...right?

:(

percept · on Dec 26, 2014

"...you have teams as well as individual employees in companies able to manage the technology, but these tech-savvy people are never high enough in the company's hierarchy. What happens time and again is that those who have the expertise get overruled by those who don't."

micampe · on Dec 26, 2014

Remember that that's been written by someone who never worked at that kind of job, and then Stallman, who is even further far removed from that environment, making a generalized statement that he wants to apply to all companies and employees.

Aldo_MX · on Dec 26, 2014

Even when tech-savvy people is high enough in the company's hierarchy, the business culture usually is to avoid fixing a bug until the bug costs more money than its fix.

In my honest opinion this sucks, but oh god, most companies live with a "only money matters" mindset...

yourad_io · on Dec 26, 2014

> Even when tech-savvy people is high enough in the company's hierarchy, the business culture usually is to avoid fixing a bug until the bug costs more money than its fix.

That's not quite 100% correct.

> In my honest opinion this sucks, but oh god, most companies live with a "only money matters" mindset...

That's quite 100% incorrect.

Good businesses will do risk assessment. If the risks are communicated correctly, the choice of mitigating them for a few man-hours is obvious.

There's bad businesses, but I don't think that's most businesses.

Aldo_MX · on Dec 27, 2014

I completely agree with you... in theory.

But in my experience, companies tend to look down at issues which affects customers, but don't affect the company directly, the risks are understood perfectly, but the choice almost always is "we will be as negligent with our customers as we can be, no matter if our customers die".

An old example which features "no matter if our customers die" is the Ford Pinto[1].

Fortunately, there is not a current example as fatal as the Ford Pinto, but the practice to overlook what affects customers remains in effect. At the end of the day, Risk Assessment is just a way to say "what matters to partners and investors", which is was what I tried to summarize with my previous point of "only <s>money</s> profits matters".

[1] http://myweb.whitman.syr.edu/pjcihon/LPP%20255/Ford.Pinto.Me...

sokoloff · on Dec 27, 2014

The Takata airbag and GM ignition switch issues are comparable, IMO.

"If the air bag housing ruptures in a crash, metal shards from the air bag can be sprayed throughout the passenger cabin—a potentially disastrous outcome from a supposedly life-saving device."

omnibrain · on Dec 27, 2014

Not at all.

We had Windows XP Systems that had explorer.exe crashing randomly when specific Users ran Outlook 2003, the release of SAP GUI we used and the release of SnagIt we had licensed at the same time. These crashes only occured when the users had all 3 programs running at the same time. The programs did not talk to each other in the sense you think of.

It did not depend on the hardware installed, so we suspect it was something in the roaming profile of the specific user, but we never figured out what it was. Perhaps a registry key related to some shell extensions? In the end we cycled those people through various PCs, deleted the temporary and configuration files several times till they could work again.

Someone · on Dec 26, 2014

Today, with "agile" being popular, that specification almost certainly is implicit, and it _was_ in the heads of both parties when they were writing the systems (but probably not completely; if both parties fully understood the protocol, chances are there would be little reason to look into it).

Also, the heads the specification _was_ in may have left the company.

But I agree that "closed source" most of the time is only a small part of the problem. Having the source of your shell script that talks to my perl script helps, but only a bit.

SideburnsOfDoom · on Dec 26, 2014

As someone who has worked in various agile ways, I can assure you that agile certainly does not mean "do not provide your users with explicit specification of a public api".

In the Agile manifesto (1), the phrase "we value ... Working software over comprehensive documentation" refers to the big upfront waterfall design and architecture documents that precede implementation and are usually obsolete before the software is working.

If you are working in an agile way and you find that your system's behaviour is under-specified to people trying to use it, then you have a problem, and agile methods encourage you to change your ways of working to fix the problem. For instance, by providing sufficient specification with each piece of working software. See: Definition of Done (2)

You could however just be using "agile" to mean something else such as "lazy", "sloppy", "hasty" or "inexperienced".

1) http://agilemanifesto.org/

2) http://guide.agilealliance.org/guide/definition-of-done.html

kordless · on Dec 26, 2014

The 'spec' is how I want my data stored, in the way I want it stored, with fields I want present, and fields I don't want no where to be seen. Period.

h_r · on Dec 26, 2014

Such a "spec" would not be able to describe how systems interact: protocols, failure modes, SLAs, etc. In other words, the kinds of things you want specified when engineering a system. These are also specs

cmdkeen · on Dec 26, 2014

The Guardian like to run these stories about how terrible competition is without acknowledging that no other system in reality actually fixes the problem, no-one is actually ruled by philosopher kings. IT security is notoriously bad in many governmental areas, non-profits, universities etc as Manning showed us. The Soviet Union after all gave us the Stakhanovite movement.

The difference is that in a competitive environment you get creative destruction, where things that go wrong benefit competitors who then learn from the problem. Yes there are all sorts of problems when banks become too big to fail, it's far worse when there is only one bank.

Competition is an amazing thing - this salesman who moaned who the Guardian has an opportunity to start selling better encryption software, disaster planning and testing consultancy services, virtualisation software or toolkits that stop random people messing with production server they don't understand.

Retra · on Dec 26, 2014

Competition is only valuable as a means to attain greater cooperation. Once you start sacrificing cooperative results by taking shortcuts, worshipping selfishness, or keeping your failures secret, then your competitive environment is doing more harm than good.

It is good when a team _gets together_ to beat an opponent. It is good when a company _works with it's clients_ to build a better product.

So cooperation is the other system that, in reality, actually fixes the problem. Almost every significant human achievement has been a result our mastery of language, which is clearly a tool of great cooperation, not of competition.

cmdkeen · on Dec 26, 2014

Why do we cooperate though? Internet Explorer didn't get better through cooperation, despite having a big team, until there was competition. The Space Race was competitive with the Soviets.

Cooperation isn't the inverse of shortcuts, selfishness or keeping failures secret. Cooperation between companies that are supposed to compete, the forming of cartels, is a really bad thing after all.

Humanity is competitive, social creatures. Cooperation starts at the family and tribal level which morphed into nation states (eventually), but it is fundamentally driven by competition.

gsnedders · on Dec 26, 2014

It's on Comment is Free, which is relatively open-access (they exert no real editorial control over what gets published).

geographomics · on Dec 26, 2014

One doesn't need access to source code to fix bugs. However, it does make it a great deal easier, and saves time that would otherwise be wasted reverse engineering the system to understand exactly what is going on. But it's not a necessity.

userbinator · on Dec 27, 2014

I think this is something that should be understood far more - I've seen far too many developers basically give up when the execution of the code they're debugging goes into something they didn't write or an exception happens in there, and are shocked when I'll just keep on tracing in Asm. In fact for languages like C++ where a single statement can do a lot of implicit things, I prefer to debug in Asm since I can see exactly where things went wrong.

Maybe education is partly to blame, as many students are taught that the only thing they interact with is the source code, the binary being an opaque blob that comes out of a one-way process, and trying to go the opposite way is somehow seen as "wrong" (legal issues notwithstanding); the emphasis on low-level architecture and Asm as "you're not supposed to know this" further adds to the notion, along with the increasingly closed nature of software and hardware. Contrast this with those who grew up with early computers of the late 70s/early 80s, where it was almost second-nature to use a disassembler and analyse the firmware even without its source code (the source was usually Asm in those days, so you didn't miss much, but my point still stands.)

name_censored_ · on Dec 27, 2014

> I think this is something that should be understood far more - I've seen far too many developers basically give up when the execution of the code they're debugging goes into something they didn't write or an exception happens in there

Those other devs may be suffering from a kind of "appeal to authority" fallacy - they assume that this release-grade third-party code section must be correct.

Or, it may be that they know that even if the problem does lie in the third-party binaries, the only fix is to submit a bug report. Since a bug report with a comprehensive steps-to-reproduce is more likely to get addressed than one with an ASM trace, it makes sense to focus effort there.

Or, it may be that they're clock punchers/overworked, and tracing someone else's code jeopardises them leaving on time/shortening their "in" tray.

yourad_io · on Dec 26, 2014

Are we discussing this[1] or did I somehow get lost?

> Then he tries to copy the contents back, which is impossible with encrypted files and this is how he discovers what he's done [...]

facepalm

> To unlock the encryption you need special keys, which are stored in one central place [...] They went through the system and thank God, the switches had not yet been reset, meaning the keys could be retrieved

Thankfully, God duplicated the keys onto... switches?

After a while, my eyes rolled too much and I stopped for fear of epilepsy.

[1] http://www.theguardian.com/commentisfree/joris-luyendijk-ban...

lisper · on Dec 26, 2014

I actually posted both this and the non-link-baity version at the same time:

https://news.ycombinator.com/item?id=8799505

It was kind of an experiment to see which one would be upvoted more. Can't say I'm surprised at the result.

chrismcb · on Dec 27, 2014

The premise of the redirectable blog post is interactions if a variety of systems. The original post talks about new technology being a problem and no one has procedures for it. YET the anecdote that is being solved is an uninformed user doing something he knows little about. Yes new technology should make it harder to break systems. But a user mixing around in something he knows little about us an age old problem. And it had nothing to do with proprietary systems. Interoperability, nor new technology

andybak · on Dec 26, 2014

And how does OSS fix this? It's still possible to create impenetrable and poorly understood systems even if you take proprietary software and SaaS completely out of the equation.

aceperry · on Dec 26, 2014

The working assumption is that anyone who has a problem somewhere along the software chain will have access to the source and have the ability to fix problems when they need to. They're not dependent on one company to make fixes when there is a problem.

SixSigma · on Dec 26, 2014

My biggest shock in this story is the people in the comment section that find the premise of the story incredulous; that many institutions - small, big, large and very large - have inadequate disaster recovery plans and it is only a matter of time until one makes it into the national news.

jamesaguilar · on Dec 27, 2014

In reality, when this happens, the companies in question call each other up and fix it, generally speaking.

lotsofmangos · on Dec 26, 2014

After reading the linked article, I am thinking of writing a bash script called 'lookBusy' that makes a machine look as though it is doing vital work.

This can then can be run on any machines that are idling but important, to reduce the chance of idiots appropriating them.

yourad_io · on Dec 26, 2014

Or, you know, have a super-critical "zone" which is accessible by people of the highest clearance in the company.

The One And Only Private Key Holding Server would be super-critical.

Also, back ups.

There's hardly any new lessons here - if anything at all.

myhf · on Dec 26, 2014

   hexdump -C /dev/random | grep  --color '\W\(\w\)\(\w\)\W*\(\w\)\(\w\)\W*\4\3\W*\2\1'

dexen · on Dec 27, 2014

Looks great! On Linux, /dev/urandom gives better results.

Incidentally, this resembles jwz's homepage quite a bit: http://www.jwz.org/

click170 · on Dec 26, 2014

You need to put more effort into being lazy. Google, my good man, Google. Those scripts already exist, all you have to do is download and run them.

Karunamon · on Dec 26, 2014

Is that more or less effort?

lotsofmangos · on Dec 26, 2014

Much less fun.

lotsofmangos · on Dec 26, 2014

I know, but it will then look like someone else's idea of vital work. It is much better to make a bespoke one for each instance.