Thoughts on Conway's Law and the Software Stack

sarcasmic · on March 25, 2019

Not sure if the comparison to Conway's Law works. The key goal of layering is abstraction, so that one can be productive without having to know details about the layers below, but much of optimization is about exploiting details in the layers below for gain. Clearly, these goals are in conflict.

After posing a hypothesis, the post talks about security and process isolation. But the problems raised aren't in line with the hypothesis: the challenge in these cases isn't "insufficient communication" between the levels of the stack, but rather a discrepancy between the abstraction's actual behavior vs. a human's desires and expectations about big-picture topics.

These abstractions often compromised by information leakage through side effects that executing code can observe or deduce, leading to the class of vulnerabilities that have long been around, but have received far more attention since Spectre.

Protecting against timing attacks and other side-channel attacks requires the observable state of the system to not vary due to execution in a different security domain. Timing attacks are particularly frustrating, because processes can estimate their execution time even without external timers, so it can compare the time taken between different calls. Cryptographic operations often take special care to avoid leaking information through timing, but the same discipline isn't commonplace in system calls or userland code. And shared caches will leak info in timing but greatly improve performance.

Hardware isolation is an effective solution for curbing timing attacks for systems that don't communicate over a network. It's not sufficient in the case of networked systems, because the network and its connections form another source of observable state that's likely full of unrelated side-effects.

laughinghan · on March 25, 2019

> the challenge in these cases isn't "insufficient communication" between the levels of the stack, but rather a discrepancy between the abstraction's actual behavior vs. a human's desires and expectations about big-picture topics

Isn't the thesis of the post that that exact discrepancy is due to insufficient communication between the people at different levels of the stack?

An abstraction is an interface between two levels of the stack, right? I think the thesis of the post is that the exact discrepancy you describe, between the abstraction's actual behavior and the desires and expectations of the people one level up, is due to insufficient communication between the people one level up and the people implementing the abstraction's actual behavior, which due to Conway's Law are separate groups of people.

You mention "information leakage", but that's leakage between software components at different levels of the stack; perhaps you have that confused with the "insufficient communication" referred to by the post, which is between groups of people at different levels of the stack?

mjw1007 · on March 25, 2019

This idea seems surprising: « you’d be crazy to think hardware was ever intended to be used for isolating multiple users safely »

Surely the era where it was common for multiple users to be logging into one computer was long enough, and central enough, that it still informs a great deal of current hardware design.

wmf · on March 25, 2019

I thought the consensus was more like: multi-user security is a solved problem except for side channels which are everywhere, there's nothing you can do about them, and they're impractical to exploit. That last part turned out to be wrong.

marcosdumay · on March 26, 2019

> That last part turned out to be wrong.

Well, it was true when the abstract machine described by a CPU datasheet was closer to the one actually running the code. But people decided it was better to run all the code through a real-time optimizing interpreter, and didn't check the interpreter for security.

We'll get secure machines back at some point. One or two manufacturers may die on the process.

cryptonector · on March 25, 2019

> The last pare turned out to be wrong

...and that upended everything about CPU multi-tenancy.

nickpsecurity · on March 26, 2019

If they secure the hardware, that was known true as of early 1990's in VAX VMM. The only thing they couldn't eradicate was the timing channels in CPU's (esp caches) and system. They could just minimize bandwidth per TCSEC certification requirements. Stopping leaks had negative impacts on performance, though. They found plenty in the Intel processors later calling for making them leak-proof. Physical partitioning with optical connections is still best. Add TEMPEST shielding and power filtering if on high end.

https://en.wikipedia.org/wiki/Trusted_Computer_System_Evalua...

http://lukemuehlhauser.com/wp-content/uploads/Karger-et-al-A...

https://ieeexplore.ieee.org/document/213271

https://pdfs.semanticscholar.org/2209/42809262c17b6631c0f653...

Industry and hackers just ignored this stuff for different reasons. The TCSEC certification, despite its problems, cause those along with all kinds of vulnerabilities by forcing systematic design and evaluation with methods proven to get the job done. NSA pentesters shredded most commercial products during evaluation. Today's still do for lots of F/OSS, too. The B3/A1, or high-assurance, systems took 2-5 years of pentesting with hardly any issues found. Details were published in conferences and products marketed. Aside from formal verification, only apathy explains security-focused folks totally ignoring all that outside CompSci which often cites it.

What's causing the problem Jessie Frazel speculates about is merely that people running companies are incentivized to not give a damn in all kinds of ways. That includes adding insecure features to drive sales (mgmt bonus), cutting or ignoring security along with other costs (mgmt bonus), and no liability for damage that happens (externalities). The solution is, like TCSEC or DO-178C, carefully-designed regulations that force them to use known-good methods to knock out preventable problems. If they don't, the fines are higher than the cost of preventing them.

That and/or liability in court for anything they could afford to use. Obviously, we might make provisions for smaller businesses that had less money for consultations. A lot of what we see in the marketplace, though, is avoidable with almost no effort. Free guides experts keep up to date on the basics might constitute the minimum with liability going up based on what resources they had available to do more. Reusable solutions and solution providers would show up to make it easier like they did for both TCSEC and DO-178C (still do for it). There's also existing offerings that are more secure-by-default which can be extended. On native side, you have stuff like OpenBSD, lwan web server, and TrustDNS in Rust. Even PHP has Airship CMS to help people throw web apps together with fewer problems.

These companies will come around when they're forced to come around with the threat CEO's, project managers, and others will loose their money and/or position for not having security. Before that, they'll try to bribe, argue in court, bullshit, and so on. There will be some fight to get there. Lots of improvements such as no telnet on routers can come with almost no money spent. We might make progress by just reminding decision makers during those battles how inexpensive certain improvements really are vs the regulatory and legal actions.

rbanffy · on March 25, 2019

You can have physical partitioning on very high-end machines.

GauntletWizard · on March 25, 2019

You can have physical partitioning on low end machines, too. The cpuset cgroup allows you to partition processes so that they only run on certain cores, and then you can prevent them from accessing memory through other cores, and and and... And you'll still find side channel attacks that you didn't think of, because the system is complex.

Or you can buy a dozen cheap systems and run each of your customers on one of those. There's still the opportunity for side channels there - We were finding timing exploits in crypto libraries long before VMs.

It's an arms race, and we're neither winning or losing, but as the market grows the spread gets wider.

skywhopper · on March 25, 2019

The hardware that was built for multiuser operation existed, for sure, but it lost out to up-jumped desktop CPUs that tried to paste it over the top of a single-user core design. But more importantly, the tricks employed to get around the physical limits of straightforward speed scaling worked by going around that pasted-on layer of security.

wmf · on March 26, 2019

IBM seems to have no official statement about whether z is vulnerable to Spectre but Power7/8/9 are/were definitely affected. Fujitsu SPARC is also vulnerable but interestingly Oracle SPARC is not. If Alpha, MIPS, or PA-RISC still existed I would expect most of them to be vulnerable as well.

Are there specific hardware mechanisms in "real" server processors that can provide high performance without side channels? What are they?

nickpsecurity · on March 26, 2019

I wouldn't trust them. One might mitigate a bit using the classic patterns against side channels: randomization and masking. I came up with an idea for randomizing execution at CPU level but the person that beat me to it patented it. Possibly something like ARINC partitioning combined with randomization plus operations similar to those that process secrets, but process fake data.

One of the simplest methods, though, is focusing the leak-proofing on the barrier between partitions handling secrets and everything else (esp w/ comms capability). Then, you just make sure you have trusted, hard-to-hack code on as much of the system as possible. No web browsers and such. Safe, system language with pointer and overflow protections along with performance hit that comes with. Otherwise, back to physical separation. I used that with KVM switch back in the day.

rbanffy · on March 26, 2019

> high performance without side channels? What are they?

One I can think of is cache partitioning by process id and security context. Not sure any CPU implements that. We can also use cpusets to assign processes to physical cores or sockets, limiting how much opportunity you have to see what other processes are doing. This can even increase performance for high priority processes because they may compete less with other processes that get fewer cores.

Now that I mention that, I'll probably give Slack a single core.

int_19h · on March 26, 2019

MIPS still exists, it's not uncommon in cheap routers.

Here's their statement on vulnerabilities: https://www.mips.com/blog/mips-response-on-speculative-execu...

rbanffy · on March 26, 2019

In order to be vulnerable to Spectre, it needs out of order speculative execution. Not many low end CPUs do that.

di4na · on March 25, 2019

The answer is to accept it. Problems happens. We are dealing with systems that are inherently complex.

You can try to alleviate some problems but in the end this lack of communication is also what create the success. It allows for "slack". It build learning.

We have "incidents". This is good. This is where we can learn. What we need is to understand that and learn better.

I suppose jessie knows John Allspaw...

jopsen · on March 25, 2019

We can try not to build abstractions that are easy to misunderstand.

But yeah, communication won't scale.

perfunctory · on March 25, 2019

> Or is the answer simply, own all the layers of the stack yourself?

Like Apple?

cryptonector · on March 25, 2019

That doesn't make Conway's Law inapplicable. There is still organizational structure within the self (Apple in your hypo).

Consider for example Windows 2000. In the years during which it (NT5) was developed, Microsoft merged security and directory teams. That yielded Active Directory, and it was a resounding success. Other companies, like Sun, failed to see the brilliance and importance of this, and treated their directory products as a cash cow -- that it was then an ever decreasing cash cow wasn't good, but hey, milk it for all it's got.

Now imagine Apple owning the whole stack. But they too could make a mistake like, e.g., Sun's, or they could get do better than Microsoft in my example. There's no guarantee either way, and it's not necessarily easy to spot the mistake / opportunity as it comes up.

Besides, owning the whole stack is really difficult. Once you get to where you own 90%, the benefit of each additional 1% is more and more outweighed by the cost.

discodave · on March 25, 2019

I think AWS and the other cloud providers owning the whole stack is what Jessie is thinking of with that statement.

AWS has their own ARM processor (Graviton), and hardware designs. All completely proprietary. GCP has TPUs, and so-on.

rbanffy · on March 25, 2019

This is also a nice way to differentiate and not compete in price/performance alone.

stcredzero · on March 26, 2019

> Like Apple?

The way Bose succeeds is that they are willing to market to the majority tastes, and they are willing to use vertical integration to deliver a great user experience with ok to good performance using cheaper hardware.

I have no idea if this can apply to security and Conway's Law. Apple is serious about security, because they are serious about payments. They are willing to put clean architecture before backwards compatibility.