Hacker News new | past | comments | ask | show | jobs | submit login
System design and the cost of architectural complexity (2013) (dspace.mit.edu)
464 points by damethos on April 6, 2023 | hide | past | favorite | 167 comments



In the late 90s I travelled the world with a couple of CD-ROMs installing NT 4.0 on individual physical servers. I understood the entire stack. I was the only engineer and I was often very remote.

In the mid 2000s we installed Server 2003/8 on a whole bunch of physical servers in a data centre. This was for one of Australia’s ‘big four’ banks. Me and a team of less than ten, most of whom I still know, managed the entire thing. We were the ‘3rd level infrastructure’ team. Most of us understood most of the stack end-to-end, although there were some specialities.

In 2023 I work for an IT integrator. These days we use Azure, because Microsoft has fooled us in to thinking that it’s cheaper.

My current project is, essentially, a website migration. Not even a big or complex website. The project has been running for 6 months. I only joined recently so I can’t be sure but we’ve probably burned AU$2m. The schedule I’m managing has us doing the cutover in June. That is optimistic.

The architects are currently trying to work out how Azure [details] connect to Azure [details] while still [security] and being able to [complex integration].

Every day some new issue appears that the architects have to figure out how to work around.

No single person has a goddamned clue how the end-to-end thing fits together. Not one, not a clue. That scares me.

‘Architectural complexity’ has crept down to the level of infrastructure. It’s painful to see. I hope it stops soon but I am not hopeful.


We might be working on the same project... or the "same" project, if you know what I mean.

The level of abstraction that is commonplace these days is insane.

At $dayjob, there is a reverse proxy in front of a API gateway that is itself a load-balanced service. There are load-balancers behind it, pointing at a reverse proxy on a service fabric. Within that, some "architect" decided to add additional "mid tier" servers because that was the cool thing to do in the 1990s. Behind all of this is an Azure App Service, which is in turn a load-balanced server that includes a reverse proxy.

They want to move this to Kubernetes, so they can now have proxies seven layers deep, nested hypervisors running containers, and services ping-ponging across six zones (data centre locations) in two different clouds. That connectivity will go through a software virtual router platform, so it's not even direct point-to-point connectivity.

I pity the poor fool who will have to diagnose an operational issue in this madness.

No mere mortal will be able to make it go fast.


Someone is paying for all of that, and someone is getting paid.

Look no further beyond those getting paid to find the reason such monstrosities exist.


No I don't think so. That explanation only works for simple systems, if at all. But all the many players and their very diverse needs and ideas are very complex.

I think i's more likely that it's something described in studies about human problem solving, which leads us to preferring overly complex solutions because we prefer adding something.

https://www.nature.com/articles/d41586-021-00592-0 (you may need Sci-Hub - paywalled)

> Adding is favoured over subtracting in problem solving

> A series of problem-solving experiments reveal that people are more likely to consider solutions that add features than solutions that remove them, even when removing features is more efficient.

https://www.nature.com/articles/s41586-021-03380-y

> People systematically overlook subtractive changes

https://www.scientificamerican.com/article/our-brain-typical...

https://www.washingtonpost.com/business/2021/04/16/bias-prob...

I think we also have forces in play that add to this bias. Or they are the result of it, for some of them. For example, when there is an easy and cheap or even free solution there is no incentive for marketing and selling it. It's hard to tell a customer that they don't actually need you, especially if it isn't a one-off, or that they should go somewhere cheaper and more simple. It is much easier to get paid for adding than for doing nothing or for subtracting, both as an employee and as a business.

It certainly looks that way when I look around. We have frequent complaints about a lack of skilled labor - but that's also because everything keeps getting more and more complex.

For example, once it was enough to once every half year update a piece of paper on each bus or tram or train station with the schedule. Now it's some highly complex networked computer solution and display to display when the next bus is going to arrive. Sure there is added value - but there also is a gigantic amount of new complexity for every small "convenience". When it comes to something like waiting for the next tram, specially in an urban environment with 10 minute schedule, the benefit of knowing if it's 2 or 7 minutes that you'll have to wait is negligible compared to the effort of providing and maintaining that service. (we have this situation in my town).

Of course there is added value for much of it, maybe all. But I see the complexity and size of systems I knew in a much more simple form when I grew up exploding.

The problem is, continuously educating new people to know ever more and to be able to handle more and more sophisticated systems has a continuously increasing cost to society too, since we don't live any longer than before. Never mind everything we do, from planning to maintenance, is ever more complex too.

In addition, the new system may not be as robust as the more simple ones of the past. It all works as long as it does. If we see another chip crisis for example, and I just read a pretty dark forecast for China/Taiwan, all those solutions that require working world-wide supply chains, here, chips from Taiwan, are a problem.


In enterprise systems, it’s always easier (organizationally) to add a feature than to subtract one. Subtraction usually puts some work in the queue of some team, which is usually not politically viable.

For instance, say Team A is used to receiving a report in Excel every week. If an upstream system is added or changed, Team A will fight like hell to keep their Excel report even if it’s no longer needed. Whereas “the system shall generate a weekly Excel report” is just another feature in the backlog and much easier to sell to the organization.


Excellent response, but I suspect you were agreeing with the person you were replying to, instead of making a counter-argument.

I just watched a simple four-building network get over-architected into a monstrosity involving multiple consultancies, vendors, and products.

Why? Nobody gets paid to not add things. Especially as a third-party, you get no money at all for doing nothing, even if that’s the best solution for the customer.

Worse still, in this case the vendor relationship was managed by a customer staffer who is themselves a direct contractor! If the work to manage external consultants dries up, their own contract won’t be renewed.

“You can’t convince someone of something if their salary depends on them not believing it.”


I’m the OP of this thread. Many, many of the people on this gig — yours truly included — are contractors…


I'm a consultant also, and more than a few times I've advised the customer not to overcomplicate things. I can only "afford" to do this because I'm already over-worked and not in need of additional projects. Some other consultants in those meetings looked at me like I had grown a second head that started speaking in Klingon.


There will always be lots of advice and action simplify this or that. That does not change the aggregate outcome. It's not a never-decreasing number and it's chaotic messy life.

I had a doctor trying to make me pay less and come less. I had many more trying to sell me useless garbage, from network marketing vegetable juice to magnetic mats, after they did not find an obvious disease and concluded "it's in my head" (I'm fine now, reason was eventually found at a university clinic and solved for good).

As I already mentioned in my post, to actively market things you need to make money. The teams making he most money are able to do the most marketing and the most sales. Those pointing to free solutions are unable to hire sales people. There is no pharma company that can sell unpatentable already easily available substances, or only with minor improvements for packaging it more conveniently or in a cleaner form, the ones making billons can run lots of ads and sende out sales reps and give gifts and benefits to those making purchase decisions.

Web forum and word of mouth compete with large companies with dedicated and very sophisticated marketing and sales organizations for reach and ability to convince, and you can only make contracts with the latter.

The forces out there don't prevent you from "doing the right thing", they just slightly, or maybe not so slightly, favors those wo don't.

The chemical company that finds their product does not really help much and potentially has very severe environmental effects is much more likely to go out of business or to be bought than the ones lying about it and increasing the marketing budged and sales effort (see tobacco, oil companies and climate change research, etc.) .

"Let's do nothing instead of this" is highly disfavored by our system - and by our mindset. The people now found responsible for making and selling products and substances they already know are bad, if they were out of work, society would shun them and "paying them to do nothing" (even if it would be much better) is frowned upon by most of society. Everybody is forced to do something, anything. On the other hand, it's actually quite hard with all the basics long taken care of - for those who can pay. Or, the problems are so intricate it's hard to impossibel to solve in a private initiative, often even for billionaires, who too have to chose their battles. So we end up with lots of cheaters and people selling products, substances, services with little to no regard for consequences. They may be bad people, but our system creates a lot of pressure on everyone and people react.


Most architects have limited influence on the entire stack, so they try to achieve results in the sub part they can control. This leads to overly complex sub parts, as you can’t optimize the entire stack. Probably another example of Conway s law.


How high is the load? (From what I'm seeing between the lines, it's not much?)

> I pity the poor fool who will have to diagnose an operational issue

Why? They're getting paid, have a job, it's great

Pity the business owners instead maybe. But they're clueless and happy too?

Hmm. I start thinking these things are the natural course of events. Techies adding more techies and complexity, maybe just like middle managers can hire more managers and bureaucrats and make not the tech, but the organization, more complex (and thus better, from their own personal perspective?)


>> I pity the poor fool who will have to diagnose an operational issue

> Why? They're getting paid, have a job, it's great

Since they won't be able to diagnose the issue, except by pure luck, let alone fix the issue, they may be reprimanded and fired, even though they did nothing wrong. Not really great to tell their next employer that they were fired, or getting references.


Yes if that happens. The company can't keep firing everyone who touches that part of the system though? Maybe bad luck for the first ones, and later, when the company has come to accept how things are, it'll mean more jobs and money for the tech people


> Pity the business owners instead maybe. But they're clueless and happy too?

A federal government department so yeah, that statement is accurate enough.

Unfortunately we all pay the bills at the end of the day.


well LLMs have proved themselves by stacking layers upon layers - maybe your architects have figured out the magic also (without interpretation/comprehension obviously).

Good luck as we all live in interesting times


Automation moves complexity from one place to another. In this case the complexity moved into silos.

Before, all the pieces of the architecture were in a couple hundred *.cpp files in one directory hierarchy. Now it's millions of files across thousands of directories each turned into a service run by a dedicated team, and none of them can see each other.

You can become an expert in all those pieces. But it requires actually using them all together, to discover the parts that don't fit and how to work around them. This is why the modern ideal of completely independent APIs is a terrible design. There is literally no way to know if anything works with anything else until you try to run it in production. Monoliths are terrible at scaling, but easy to understand. SOA apps are great at scaling, but impossible to understand.

Good luck on your migration. Changing the wheels on a moving tractor trailer always sucks.


Isn't the point that there is a line somewhere that one app stops being one single app and becomes "eco system" - and you stop having deterministic understanding and start having "town planning" and "social expectations".

I mean my use of bad metaphors kind of underlines the point we don't really understand his problem - but the large organisation that builds these is itself an example of such impossible to understand interactions - maybe we will learn from Azure etc and take those learnings into taking our own orgs.


Well anything designed as SOA is just Conway's Law enshrined in architecture. Most older software was too. The org is the problem.

The super bizarre outlier is the Open Source community, which doesn't really follow any organizational plan (except the corporate projects and foundations and committees that pretend they're open source). For whatever reason, Open Source succeeds despite no hierarchy, no clearly defined architecture, no particular strategy.

But then again, maybe that's why it works so well. An organization has a fixed structure, rules, priorities. It can fail, lose its intellectual property, its funding, be competed against. Open Source has no such limits. It's the wild west. Do whatever you want, whenever you want, and whatever works, survives. Lke natural evolution. An organic ecosystem, rather than a plastic one.


This is the essence - that a system needs to fit inside one persons head. All of it. There maybe a few people who understand all of a Boeing airplanes systems - there certainly were when they first flew 747s. Maybe a couple now.

but once it stops fitting inside one persons head then the thing is literally only possible to design by committee - it cannot fit together and perhaps should stop being called a single system


> it cannot fit together and perhaps should stop being called a single system

The magic is the enterprise message bus. It is meme crap until you actually need one. When it works, it really works. This is the only thing I've ever seen realistically tie together certain industrial environments.


There is no one who understands FULLY how a single computer works (at least not to the specific levels of detail on how to make every part from scratch... how to mine the heavy metals, how to chemically process them, etc) That bas been impossible for hundreds of years.

The important part is where you draw up the system boundaries


> That bas been impossible for hundreds of years.

You mean tens of years?


For computers specifically, sure, but I was talking about almost any complex machine in the last few hundred years.


I'm fairly certain the chief engineer on the System 360 understood everything down to the transistors.

Anything past that though... maybe QNX down to the bytecode? It's at least elegant enough to suggest so.


You think they knew how to mine the ore and process it into usable material to manufacture those transistors?

You might argue that that isn't required to fully understand the system, but that argument is no different than saying you don't need to understand the hardware at a detailed level to fully understand your system.


I lived across the park from the best mining school in Canada so yes I believe that's totally possible. A few months of intensive study, or a year or 2 of more casual study is all that's needed with the right guidance.


These days we use Azure, because Microsoft has fooled us in to thinking that it’s cheaper.

All the cloud providers do that. You could say that peoples' judgment has been clouded...

More seriously, this is how they operate; complexity is a good thing to them, because it means more lock-in and ways to charge you in non-obvious ways, and the whole "consulting" aspect of the industry feeds off that.


I wouldn't be surprised if most people don't learn it in-depth because:

A) you can google or ChatGPT things more easily these days

B) there's a truck load of services and ways to connect and do things. It's hard to learn all of these or pick a best option for everyone

C) SSO + identity models + API permissions generally made everything more complex

D) Many things in Azure are changing often, so you don't want to get too invested in a particular "version"

So things got more in-depth while expanding your available options (which are usually just as complex)


> you can google or ChatGPT things more easily these days

This should not be a valid excuse to not understand your job.

> there's a truck load of services and ways to connect and do things. It's hard to learn all of these or pick a best option for everyone

Agreed, and I think this is where people go wrong. Giving devs the ability/permission to pick whatever random tooling they want is not a good path for maintainability. "It's in AWS, so we can use it." I'd go so far as to argue that if you want to use Managed Service X, you need to first successfully launch it on an EC2 with no other help than the official docs; otherwise how can you possibly hope to understand it when it goes wrong?

> SSO + identity models + API permissions generally made everything more complex

Fully agree, IAM is a nightmare. But again, if it's your job, that's not really an excuse for anything other than higher pay.


I think people don’t learn it [all] in-depth because to do so is essentially impossible.

Back in the bank days all of understood Windows, multi-tier AD, DNS, DHCP, SMS* (now MECM via SCCM), that distributed file system whose name I forget, file & print, Exchange, and all the other things that had a plugin to mmc.exe — we could do ‘em all. Oh and a handful of networking because that was also much simpler. Oh and all of the client configuration.

You could give me a server and a day and I could stand you up a basic infrastructure. I could know it all.

(*Though the maniacal way that SMS 2.0 did its magic with text-based log files shuttled in from the endpoint always drove me bonkers.)


And, by the time you "know it all", the playing field has changed.

RESTful APIs? Nope, we are going full-gRPC.

No hand-rolled OAuth - we are using AuthSolution1. Forget that, AuthSolution1 has a security vulnerability, we are going with AuthSolution2. Wait, AuthSolution2 won't work with cloud providers database offering... let's try AuthSolution3.

Let's use IFTTT. No, how about Zapier. But, we need gRPC and now GraphQL.

The databases are over-provisioned, lets look at serverless as a critical path item. Executive VP #2 thinks we should use a NoSQL solution.

BTW: Cloud provider X is dropping support for database version 1.1, we need to migrate to 2.0... or skip ahead to the just-released 3.0.

And, then the CEO comes and asks: "Can you spend some time trying to get ChatGPT to build this whole thing for us?"


Every god dammed day of my life.

Opened up an AWS app earlier in the week, asked wtf is this and how does this even work. Logged out and said not my problem.

I feel like the complexity and the knowledge pulled back and hidden by the layers of abstraction still must reside in a mind to make things work. I am troubleshooting issues in code that I can’t write or fully read, I ask how it work, what it relies on, what it writes to, what it’s intended function. By the time we get through explaining the problem to me, we usually come up with a solution or understanding of why it doesn’t work and what needs to change.

Some really smart folks writing this stuff but cannot keep it all or understand how these systems connect and work together. 300 lines of code can hit 10-20 different technologies and it made perfect sense when you wrote it, but when it’s broken…


I'm always thinking about "Can I (or anyone) get back into this easily 6 months from now?"

In my situation, I probably will have to do that so there's a selfish reason there for sure.

I recently had a whole series of frustrating situations where I dug through rediscovering how old code / systems work to make small changes or to find out the small change was enormous. Really deflating stuff. It's not my fault but it can be so demoralizing. Feels like a weight on you... I was done for the day after both of those horror shows.

Then yesterday I had a 3 day project start and in 2 hours I ... did the thing. It was super flexible / powerful, handled errors gracefully, and easy to change / test. All because a year ago someone (well myself and anther person) took the time to simplify the original spaghetti code that originally existed and break it into more digestible functional-esque chunks. Dropping something "in between the chunks" (fancy technical terms here) was easy to do, test and read. Completely the opposite experience, it was energizing and fun.


For my consulting, I primarily practice "reference first architectures."

The idea is we identify the rough shape of what we are going to build and the components needed to deliver it (Linux? Terraform? K8S? HTML/CSS/JS? etc.).

Next we measure up what we can "take for granted" for the engineering skillset the organization hires for. Then we pick books, official project documentation, etc. that will act as our "reference." We spend our upfront time pouring ourselves into this documentation and come away with a general "philosophy" of an approach to the architecture.

Then we build the architecture, updating our philosophy with our learnings along the way.

At the end of the project, we commit the philosophy to paper. We deliver the system we built, the philosophy behind it, and the stack of references used to build the system.

What this means is I can take any engineer at the target level they hire for, hand them the deliverable and say "go spend a week reading these, you'll come back with sufficient expertise to own this system."

It also acts as documentation for myself for future contracts if I get brought back in. Prior to starting the contract I can go back in and review all of those deliverables myself to hit the ground running once I'm back on the project.


Sounds like an architecture decision record. Here's an example ADR template: https://github.com/joelparkerhenderson/architecture-decision....


This sounds like the right way to do it. For me it has been tough to come up with principles that don't sound like they apply to any system. You start off with a generic CRUD app but as it grows the default/usual web framework constructs tend to leave you with a ball of mud. You can couple anything in there together and since you're pressed for time, you tend to do it. Abstractions feel premature and when they start emerging there's lack of conviction to push through with them and clean up the whole thing.

Do you have any starter resources to come up with principles for a system? Maybe something showing how certain principles lead you to implementation choices that would've been different under another philosophy.


For a 2 week Terraform audit, these are the high level philosophy points I put together. The final doc was 10 pages. Each point lists the reason for choosing this approach and any trade-offs that come with it.

* Small composable Terraform modules

* Don't manage IaC declarations alongside code in a polyrepo

* Direnv for managing env configs across repos

* Manage k8s using k8s manifests and not terraform files (kubectl provider gives us this)

* Delegate flux management to flux-cli

* Auto-unseal Vault to capture and protect the vault token

Then a list of recommended reading:

* Terraform Up and Running

* Building Microservices

* Site Reliability Engineering

This list is more tactical since we didn't build the system, we were auditing their current setup.


> Don't manage IaC declarations alongside code in a polyrepo

Can you elaborate on this one?

(And thanks for the interesting comments)


> The source code for services should not be coupled to their deployments when managed in SCM. The lifecycle of changes for infrastructure are different than the lifecycles of artifacts for services. Any artifact should be able to be configured and deployed into any (supported) infrastructure configuration. For example, running git revert on a service should be able to yield a deployable artifact regardless of how the infrastructure is configured. By coupling these, you tie changes to infrastructure to changes in services. A rollback for your service can also unintentionally rollback how that service gets deployed – and avoiding that requires an engineer hold both the context of the infrastructure and the context of the service in their head whenever they are manging git history. It becomes difficult to deploy an older version of a service for testing. It also breaks git-bisect since, now, searching for a regression in software also changes how that software is getting deployed. This is an extension of managing IaC as small composable modules. The source code for a service should, itself, be viewed as a composable module. That module may take on a different format than other pieces of IaC (i.e. a .tar.gz, a .deb, or a docker image instead of a terraform module or a terraform provider) – but it’s API contracts are still drawn around the unit of deployment and not the monolithic infrastructure stack that will be deployed with it.

This, of course, does not apply to projects using a monorepo.


how would one manage independent infra and services in a monorepo?


The philosophy is the same, but the implementation is different.

You still keep them separate but your monorepo tooling handles that separation. When multiple changes go out together, some being infra some being code, the tooling should be aware of those dependencies (just like any other) and handle resolving the infra first.

The muscle memory of devs in a monorepo tend to be different too. Folks are used to scoping their SCM changes to folders instead of working at the top level of the git repo (i.e. you usually don't find yourself doing a `git reset --hard HEAD~10` in a monorepo outside of a feature branch - other teams get grumpy when you blow away their changes on the mainline branch).

I make this distinction between polyrepos and monorepos for IaC because I've seen this advice result in folks splitting their monorepo into a birepo, or using IaC adoption as a driving reason to migrate their company to a polyrepo. There isn't anything wrong with the birepo approach, but it can be accomplished inside the monorepo all the same.


> All because a year ago someone (well myself and anther person) took the time

I've been saying for half a decade or longer:

"Going slower today means we can go faster tomorrow".

It took a long time for some of my team members to process this, but I believe they've all taken it to heart by now. The aggressive, rapid nature of a startup can make it very difficult to slow down enough to consider more boring, pedantic paths. Thinking carefully about this stuff can really suck today, but when its 3am on Saturday and production is down, it will all begin to make a lot of sense.

Having empathy for your future self/team is a superpower.


“Slow is smooth, smooth is fast.”


This has won countless races for just about every top F1 driver you can name for decades, prolly WRC too. That old analog world transfers nicely to digital in video gaming. Sadly, it's not more widely accepted in software development though software design and software deployment seem to have caught on.

As an old C++ hacker, I'm waiting for the day when modern C++ shops read Accelerated C++ from Koenig and Moo circa two decades ago. Then, I could rejoice in someone anywhere writing C++ code that more closely resembled the python-esque C++ masquerading as pseudo-code in that book.

More sadly, I just keep seeing people emulate bit-twiddling from yesteryear when the compiler likely optimizes a fair bit of this.

The cyclomatic complexity scores in the paper look off by an order-of-magnitude but they may be better than the laugh riot I've measured in the last few years and my math may be failing me at runtime.


Racing is a very flawed analogy: the big difference between software development and racing is that:

1. F1 paths are known in advance.

2. The major unknowns in F1 are your competitor behaviors.

Compare that to a typical startup: you're mostly riding in the dark on a track you see for the first time, and your major unknown is customer behavior.


Interestingly, I learned this adage in object manipulation (festival fire or LED dancing - hoops, poi, staff, etc). The community breaks down things into “flow” and “technique”. Flow is highly improvisational, tech is highly practiced, and you really cannot do one without the other, even if everyone had a lean.

So: the steps in the path are known in advance, but not the order, presence/absence, quantity, arrangement, etc. The major unknown is what you will do next. The best performers are highly reactive to, and involved with, the audience and colleagues (musicians).

(This ofc changes for choreographed performance)

As a full stack dev, I’ve got a stack of patterns (techniques) in my pockets to pull out for this or that situation, but I don’t really get to know which one will be the next one I’ll need. And I do my best work when I can get involved with the end users, interacting with them to grok their needs; and with my coworkers, so we’re a team.

Slow is smooth, smooth is fast.


Reductionist history doesn't help here. Software development goes twenty or thirty years beyond startups.

I've crashed and burned startups while never comparing any of them to driving half-blind without my glasses at night.

Human sciences and user research provide excellent solutions to customer behavior. Like F1 cockpits, the risk scales with the domain.

F1 is not just a vector sport. If it were, math may be enough to win. Turns out F1 takes engineering, mechanics, and a driver.

However, viewed through a macroscope, F1 dynamics are closer to a cooperative game, as in software development.

While an unknown in software is competition, much larger unknowns are given by shifts in teams, machines, and their methods.

Turns out that comparing F1 and software development from 1970 to 2020 are remarkably the same story for much the same reason. Neither exist in stasis.

F1 and software development have more in common than is obvious from the grandstand.


Yuup. Unfortunately, there's profit disincentives to this. Time to market for new features is a thing. Getting out features fast gets you kuddos from the suits. So you get a class of dev that spins out code like wickedly fast while at the same time leaving a mess for others to clean up.

It's hard to correct that sort of behavior (without being an actual manager that knows code and can spot bad architecture).


There’s a point in a company’s trajectory where quality becomes more important that quantity (speaking specifically of software features here). Early on it usually makes sense to throw things at a wall and see what sticks. But once there’s a sense of product market fit, the engineering org needs to buckle down and focus on doing things slow, methodically, and correct.

There’s also engineers that prefer doing these different kinds of work. They thrive on quick wins and kudos from founders. Early engineers probably need to be ok with bugs and edge cases that they’ll never go back and fix. Personally I don’t like doing work like that, but I’m definitely in the second class of engineers who needs systems to be modular, composable, and well defined.


Have you ever measured this alleged speedup when "tomorrow" comes?


OP did a 3 day task in 2 hours.


Measured relative to what?


You said:

"Going slower today means we can go faster tomorrow".

So I guess, relative to yesterday?


> I'm always thinking about "Can I (or anyone) get back into this easily 6 months from now?"

As I age, my memory is getting worse and worse and I realize that quite clearly. Therefore, I always try to write documentation as I'm writing code, so that I can remember why I did something. It helps a lot so that 6 months later, I can do exactly that... but I also know that anyone else looking at my stuff will also realize why things are the way they are.


I’m the same way, notes, good documentation, etc.

Sometimes I think I get some tasks done faster than when I was younger…


> I'm always thinking about "Can I (or anyone) get back into this easily 6 months from now?"

People I work with get very annoyed with me because of this, but I am obsessive about documentation for this reason. Sure, it requires a lot of tedious writing and screenshots, etc., but it has saved me countless times. I still can easily get back into things years later thanks to documentation.

The caveat is that when people who are not as passionate as you maintain a product and seemingly forget about documentation.

In the old days, documentation was a very strict requirement on many of projects I was involved with. Now, in modern agile projects, it’s an afterthought at best, despite having amazing documentation tools that we’ve never had before.


What ways have you found for keeping the documentation in sync across frequent changes?


Leadership, Process and Discipline


Exactly. As in, the documentation doesn't actually stay in sync.


Do you mind expanding on which tools you are using for documentation (creating, maintaining etc) please?


I’m a big fan of wiki-type tools such as Confluence, but Markdown is even better because it’s just code and can often be stored in the repo along with the code. There are of course pros and cons to both. Wikis are easier to use for more complex cases and especially for screenshot support, tables, charts, etc. On the other hand, Markdown is far more portable and better for long-term maintenance since it’s not subject to the whims of the documentation provider tool itself.

One thing I’ve done is to maintain a separate Git repo that only hosts documentation. This in combination with a simple UI that dynamically converts .md to HTML on-the-fly (or renders a cached version) seems to be a good compromise.


Like real clouds, the cloud wont look the same in 6 months. I got a stream of daily emails from Azure, end of life this, upgrade that, secure this etc. Those bits will rot unfortunately.

There might be sense in renting that machine from Herzner and sticking Ubuntu on it after all.


"we found that differences in architectural complexity could account for 50% drops in productivity, three-fold increases in defect density, and order-of-magnitude increases in staff turnover."

I think I can speak for many of us technical professionals when I say, been there, done that.


This is exactly why I think the "myth of the 10x engineer" is so obviously false.

An average software engineer will create an overly complex system.

If a skilled engineer can come in and create something that doubles team productivy, decreases bugs by 60% and improves retention across the org by 10x, that's huge! That's not a 10x engineer, that's a 100x engineer.


Engineers that make simple systems are not rewarded is most organizations. In a ten person company, sure thing! In a large corporate org, no way.


Man the dead weight around development in corporations are mind numbing..


The institutionalized brain damage is real, I see bright new hires coming in, within 3 months they are towing all the lines, spouting tautologies, seeking alignment with corporate velocity vectors to gain momentum, generate impulse. Talking in non-specific platitudes that could literally mean anything and thus mean nothing.

The biggest problem with a simple architecture and often why someone would complexify it up, is as a protection mechanism so you don't get some performance review motivated yokel doing a drive by to add in a feature and look like a rock star because it was so damn easy. NSS, that was the next step. You see this in open-source as well. Someone builds a beautiful foundation, and Steve rolls up with a submarine pull request to slam that cherry on top.

A complex architecture ensures that you can adequately gate keep.

The next aspect is you have to relentlessly say no to feature requests that literally destroy the system to add some new feature, the pressure to smash it to bits and make it a no longer simple system is such a perverse incentive that unless you have support from on high, every simple system will decay into a total mess, then the finger-painters will have moved on to some other system to predate on.


FYI, it's "toeing" the line. It's an expression from track and field, where at the start of each race the competitors place the tip ("toe") of their foot on the start line when they're ready to run.

Similar to the expression "Up to scratch." In a boxing ring, there is a line in the ring that boxers must come to to begin the fight. To be "up to scratch" is to be ready and worthy of the fight, and from that, to be of acceptable quality or capability.


Minor correction for your correction:

"Toe the line" most likely has it's roots in military tradition [1]. To toe the line is to stand at the line for inspection - it's still used that way today (I had to do it myself in boot camp).

Relevant, as in this context, it means being obedient to the hierarchy.

1 - https://en.wikipedia.org/wiki/Toe_the_line#:~:text=The%20mos....


Interesting, I hadn't seen #:~:text=abc before. It does nothing on my browser (Firefox 109.0b9 on macOS) but is it inteded to highlight text?


It does on Chrome. I believe it’s used from Google search results to link you to the relevant place in the page.


Most interesting, I do believe you're right!


> non-specific platitudes

Anything specific that tends to be used often?


Appealing to "best practice" without concrete, specific reasoning is a big one.


It is org specific, I have already triggered myself and need to find a safe space.


"We need to build something that scales" to justify virtually every single source of complexity.


Especially when then understanding of scaling only distinguishes between "unmaintainable monolith" and "full-blown web scale".


"Move fast and break things".

At 48 I've stopped fixing other people's things.


Prob OT but ‘break things’ old Zuck sure did. A bit sad given the billions folks made in the meantime.

I almost said that I’m surprised they weren’t held accountable… but we all know better.


> The next aspect is you have to relentlessly say no to feature requests that literally destroy the system to add some new feature, the pressure to smash it to bits and make it a no longer simple system is such a perverse incentive that unless you have support from on high, every simple system will decay into a total mess, then the finger-painters will have moved on to some other system to predate on.

Would you rather have a simple system that does nothing, or a complex / difficult-to-work with system that solves customer needs? It's easy to go too far in either direction.


> Would you rather have a simple system that does nothing, or a complex / difficult-to-work with system that solves customer needs?

With respect, this is clearly a false dichotomy.

Creating the "simplest possible system" that solves customer needs is the entire point of this discussion. Avoiding as much complexity as possible brings enormous benefits. (Less bugs, less employee turnover, higher productivity, etc.)

Never accept the notion that systems must be "difficult-to-work with" in order to solve customer needs.


It would be a false dichotomy, if I meant to say that you must choose one of these paths. I suppose "it's easy to go too far in either direction" was not clear enough. The argument I'm making is that the OP is creating a false dichotomy: that you must reject feature requests, or ruin the system.

My point is that there are many, many developers who "relentlessly say no to feature requests that literally destroy the system to add some new feature" when the reality is that they do not want to put in the effort required to add the (meaningful) feature to the system.

They are choosing system purity and laziness over delivering customer value, which requires more work in order to keep things simple.


In my experience, most of the software complexity is accidental, i.e. it doesn't directly emerge from customer needs.

And even if your framework is to use customer needs as justification for complexity, people will still use something like Kubernetes for an app with 10 monthly users, because it "helps us deliver features faster, have zero-downtime deployments and be scalable".

Most (non-product) engineers have a natural inclination to introduce more complexity, and will use mental gymnastics to justify it.


The majority of human intellect is used to justify actions they have already taken.


I never really had to make this choice. But I often had to choose between a system that is reasonable and actually solves the needs of customers, versus a system that is complex, made in a hurry and full of badly made features that only some project manager cares about. I know which one I prefer.

Just today, for example, I removed several thousand lines of code from a very complex form that no customer ever really used other than for testing. My boss told me "just nuke it". Life goes on.


I don't think it has to be so nefarious. What I've experienced is developers taking the path of least resistance and failing to perform proper analysis to arrive at a less complex architecture.

Why are folks taking the path of least resistance becomes an interesting question.


[flagged]


On the chance that this is a real cry for help and you are in a really bad place, please seek help.


This is why small organizations produce stuff, then large organizations buy it and sell it as a product.

As an engineer you just have to ensure you have equity in the small company before it's purchased and/or hope the big company buys your company rather than just copying it and fighting you in court for a decade.


Is this because as groups grow in size, high level decision making power tends to move up to less informed management individuals?

You'll end up with a "I programmed a little bit back in the day" person who makes decisions based on how to currently keep their staff best utilized. Tools choices, boundary points, stacks, become chosen based on goals that are not rooted in design simplicity.

I've watched this story play out again and again and again at companies. Upper management will promote these types of individuals into lower/middle management because

a) they're deemed actually reassignable whereas a really talented Ux designer is obviously adding the most value doing that

b) people buried/invested in tech stack/design details are harder for owner/operators to relate to than individuals more like themselves.


As groups grow in size it becomes increasingly hard to be informed as management. Eventually it becomes near impossible.


And yet, somehow open source projects manage to have some degree of success with very little of anything representing “management”.


How many of the extremely successful open source projects have clear strong management that's one dude(ette)? Linux, Python, Ruby, Blender...

Great delegation and engaging with community feedback productively are of course part of that strong management.


How many do, and how many of those see success outside their core highly technical user bases?


At a place I worked, they explicitly rewarded ‘solving complex problems’. It's actually surprising there weren't more cobra farms.


Reminds me of “I apologize for such a long letter - I didn't have time to write a short one.”― Mark Twain


That was Blaise Pascal.


Funny… I’d actually always heard it was Newton. I checked and looks like you’re def right here.

I actually expected to find that it was a fake quote. Lately, I keep finding that many of the good quotes are actually fake. It’s like they actually survive on the value of the quote and, if we’re honest, whether someone decades or centuries ago actually said it is only tangential.

But thx for the correction!


Exactly this. I've done well for myself by focusing on building things that work, and shipping them on time. Anything else is extra and not to be done at the expense of the first two things.


Would you build something simpler and more reliable if it took you from shipping a week early to a week late?

I agree that anything else is extra, but most of my difficult professional engineering decisions are tradeoffs between those first two things: I can build on an older system less well adapted to current needs and deliver more quickly, or build something that fits current needs better but takes a bit longer. I sadly don't have enough days in life where there's a better and quicker option!

Put differently, it's a lot faster to put another layer of duct tape on top of the last duct tape patch of a pipe leak. But every layer of duct tape you put on makes it more difficult to replace the pipe segment underneath. Developing judgement on when to do which is the (difficult) trick.


If given that choice, we usually build both.

We deliver the duct taped one one time so the software folks can start their work, and then we build the non-duct taped one wheel it in a week later and quietly swap it out for the duct taped one.


That's not a 10x engineer, that's a 100x engineer.

You're speaking of ChatGPT!


Yep, working with well-architected software systems is the difference between a low-stress well-paying job and rapid burnout. No surprise that the latter results in massive turnover.

It's remarkable how "add feature X" can be a 1 hour task or a 1 month task depending on whether the system was designed to evolve and scale in that direction. But the non-technical management or customer just sees it as a simple request, and the (Nth) software developer is left to pick up the pieces.


> order-of-magnitude increases in staff turnover.

I suspect this goes both ways. I have worked at a place characterized by frequent internal shuffles, and more than once spent ages reverse-engineering code when the original team could have handled the problem in a fraction of the time.


MIT has an amazing program for System Design and Management. Dan and I are both graduates of this program. Some of the courses I would recommend include System Architecture, System Safety, and System Dynamics. Most of the content is available on OCW.

What is taught is not software-specific, but is entirely applicable to software, outside of the world of 'throw everything on the wall and see what sticks' as long as the venture capitalist can be shown growth at all costs. I wish more software developers were mindful of complexity and architecture.


What the author is up to now https://www.silverthreadinc.com


Sounds super interesting. Can you link to the specific OCW courses? I see a few from different departments.


ESD.34 ESD.342 16.863J 15.871 15.783J ESD.33 to name a few Also, 15.965 my favorite and offered by my thesis advisor Michael A. M. Davies: Based on which it is likely that OpenAI won't be walking away with the cake.


Can you elaborate on the cake comment?


That is the share of value capture from the market that they just made real. There are legitimate scenarios where first movers do not have an advantage.


Could you elaborate a bit more on what you got out of the program? I never heard about it before but i'm intrigued. Did you find a particular course memorable?


The tag line within SDM was that it is a program for those who want to lead engineering and not leave engineering (MBA) I think the meta framework for thinking and being able to step away from the madness of releasing a v1 product and having tools for thinking about the bigger picture. Also, MIT.. it is a very rewarding ecosystem to be in.


I've gotta say, this study really hits home for me. As a developer who's worked on a few projects with some gnarly architectural complexity, I can totally see how that would lead to these kinds of costs. Just last year, I was part of a team that had to deal with a super convoluted codebase that felt like a patchwork of different styles and approaches, with no clear hierarchy or modularity. It was an absolute nightmare to navigate and make changes to, and our productivity took a nosedive. Not to mention the bugs that kept creeping in and the insane amount of time we spent debugging. I even saw a few of my colleagues jump ship because of the frustration. I wish management would've realized the potential benefits of investing in some proper refactoring efforts to improve the architecture. It might've saved them a lot of money (and headaches) in the long run!


Two things that get in the way of "long run" thinking.

What's the timeframe that teams should adopt? If it will pay for itself in three years, is that too long? What's a good argument to make here that will appeal to the bean counters that are used to thinking in terms of quarterlies? I personally am comfortable with "eventually/infinite" but that is a tough sell.

Capitalization vs Expense. Maintenance/refactoring is not capitalizable, and thus gets discouraged in businesses that care about P&L. New (capitalizable) projects/features are encouraged instead. What are some good ways to encourage maintenance/refactoring in this kind of environment?


Stuff like this contributes massively to a culture of low expectations too. When it becomes accepted that simple things take a long time to do the normal reward system goes flying out the window. Both internally and organizationally.

Internally because the sense of accomplishment from seeing the new bit of functionality is tiny compared to the effort it took to get done. Organizationally it becomes harder to reward productivity because with no sense of how long something should take there's no way of knowing if an engineer is fast or not.

It's a productivity death spiral. Engineers slack off because there is no internal or external reward for doing good work. That slacking off slows development and then that velocity gets accepted as normal. Engineers continue to slack off against the new "normal" and establish an even slower normal to slack off against. That's how hours long tasks turn to days and then weeks and then months until the project dies.


I still dont know what is better. Create quick throw away code you replace in a few years. Or build a well structured system that lasts longer. I have worked all my carreer in only the first 0-8 years of company startups. Most of the joy and success I got from 'quick hacking together working systems'. A lot of people arround me dont share that opinion and are better suited at structured companies. Maybe there is no better.


I've been a software engineer now for 12 years and I think there's an obvious answer: architecture. If you build software with a solid architecture from it's inception, then "hackily building features" will happen much faster and cleaner than a mess of a codebase. I wish more people realized that "build things fast now without caring about quality" translates to years of "God, I wish we could just throw all this out" and "no one knows how this part of the code base works but it does so we don't touch it." Plus, realistically speaking, adding that solid foundation shouldn't take _that_ much longer if you know what you're doing.


But you don't know the architecture you really need when you first start.

I think the key is that you accept that this is true, and that having a good architecture is a continuous process that never ends. You build the best architecture you can imagine, given the current level of knowledge. Then you have to be willing to refactor every single day after that, as you gain new knowledge. Too many people think that refactoring is only for special occasions, when things have gotten really bad. Every single PR can contain a small refactor. Small changes can accumulate and eventually lead to major architectural shifts.


Every messy project starts from ‘architecture’. At the beginning it is clean, and super well organized. Then, instead of, these well thought out before extensions, some breaking feature needs to be implemented. And it is implemented in the most aligned to starting ‘architecture’ fashion. After couple of years of these implementations a mess is created.

It is better to have simple solution at start, then, after each breaking feature, whole thing needs to be refactored, because initial assumptions about that software might have been changed.


This assumes you know where you are going. All code is a liability and all architecture decisions are tradeoffs.


When you have a savvy competitor that is taking the opposite tack and killing you in feature bake-off, you don't have this luxury. Your Ferrari is being built in the garage while paying customers are driving around in your competitor's Yugo.


No plan survives first contact with reality: "well structured systems" usually get thrown away in the same few years. The only difference is that they often get thrown away along with the company that built them.

Quick throwaway code is orders of magnitude better (unless you're landing airplanes or building x-ray devices). Especially if you're consciously treating it as a throwaway.

The only way to build well structured system that lasts long is to have vast domain expertise, meaning having done literally the same thing multiple times in the past. This rarely happens in general, and pretty much never happens if you're innovating.


I've never seen an innovation that wasn't the same as some other piece of code with different variable names. It's like the old joke about how every piece of software, given enough time, gains the ability to send email.


It's like saying that all languages are the same as other languages with just different tokens and semantics. Sure, but those differences are what defines a language.

When something is "the same as some other piece of code with different variable names" it becomes it's own isolated solution, be it a library, framework, service, or software. We typically don't run our own SMTP servers, don't write our own databases and don't implement TCP/IP stack from scratch.

When we're not doing all those things, typically what's left is innovation: i.e. building something that doesn't have an existing generic solution.


> Or build a well structured system that lasts longer.

What if you had the ability to build a system that was initially structured well enough such that it could be made to last indefinitely? From a cost/benefit standpoint, is the Ship of Theseus approach not the most ideal for a business owner?

Even for a developer, the notion that you have to constantly "trojan horse" your new product iterations into production via the existing paths means you will achieve mastery over clever workarounds and other temporary shims. Once you gain competence in this area, it is possible that you will never want for a new boat again. You start to get attached to the old one and all of the things it takes care of that you absolutely forgot about years ago.


> if you had the ability to build a system that was initially structured well enough such that it could be made to last indefinitely

If you could have that, it would obviously be amazing. Do you (or anyone) have that ability though? So far it seems the answer is no. People are just not smart enough to predict the future.

As a more nuanced answer, a system that is scalable enough to grow to 1000x the current load is usually way too expensive to build with the resources you have "right now". The best you can do is build a system for 10-100x the current load and hope you haven't forgotten any edge cases, but usually you encounter them way sooner than 10x the load. Building so that you can easily refactor your current system is the way to go, but even then you will sooner or later run into problems your original design did not consider.


The point I am trying to make is that if your system can survive in production under realistic workload for some period of time (i.e. the boat floats & makes it out of the harbor), then what is preventing you from taking that success and incrementally moving it further into a preferred direction?


In startup/mvp land it is a genuine tension between shipping it and over engineering at different extremes. It is quite possible to correctly think “this is bad engineering” and still ship it and all of those decisions to be correct. Bootstrapping and early stage code almost inevitably gets replaced so isn’t worth polishing too much. It feels totally wrong and requires some real soul searching for some engineering personalities but in the end it’s optimizing for the most important outcome, the actual business success.. speaking from exp of not doing a few times and then the whole thing failing..


Speed is the main reason I do architecture.

In the end it mostly comes down to reducing complexity, but the goal is always allowing new features to be added as fast/easy as possible.

Because I'm lazy.


it is basically never the case that the time you spend typing code into your editor is the bottleneck that impacts how quickly you deliver a system or feature

quick hacking together of prototype systems delivers a dopamine high that is quickly reduced to zero when those systems need to be maintained and extended into an indefinite future

building systems that are well structured and not terrible requires knowledge and experience, but absolutely 100% does not take more time than building shitty hackathon prototypes


It's a balancing act:

"There is no theoretical reason that anything is hard to change about software. If you pick any one aspect of software then you can make it easy to change, but we don’t know how to make everything easy to change. Making something easy to change makes the overall system a little more complex, and making everything easy to change makes the entire system very complex. Complexity is what makes software hard to change."

https://martinfowler.com/ieeeSoftware/whoNeedsArchitect.pdf


Worth pointing out that this study does -not- equate "architectural complexity" with abstraction. Many consider use of "hierarchies, modules, abstraction layers" to be 'unnecessary complexity' where as the thesis clearly states they "play an important role in controlling complexity." OP is not a call to get rid of software architects, arguably it says ~'hire competent architects as bad architecture negatively impacts faults, staff turnover, and product success.'

"Architecture controls complexity", and under-/poorly- designed architecture while superficially "simple" will give rise to un-intended complexity. Microservices are the du jour poster child here.


Not a direct response, but some thoughts that come to mind without any direct conclusion:

Good abstractions make simpler software. Leaky abstractions multiply the complexity of software by a lot.

Some respond to this by making "simple" software that dispenses entirely with abstraction. This ends up in a lack of architecture where complexity still multiplies, though perhaps less than the typical mix of mostly leaky abstractions and a few sound ones.

However, it's kind of the nihlism of software and throws away the opportunity for us to actually improve our craft... so I'm not all too interested in it.


The uncomfortable truth is that software development is a series of decisions made day after day, and many people simply cannot reliably make one good decision after another. You see this in poker and chess too, there are many people who will never be good at those games (like me), and there are many people who will never be good at software. Demand for good developers outstrips supply and then you have the whole measurement problem on top of it. At least with poker and chess we know who is good at it, and who is not.


People regularly forget Rule Zero of patterns: Don't implement a pattern if it doesn't solve a problem. That's the difference between unnecessary complexity and controlling complexity.


IMHO people aren't forgetting it — it's pretty commonly the other way around.

Anecdotally, the people I see advocating for spaghetti architecture are often utterly convinced they're solving some critical problem. Conversations with stakeholders then turn into fearmongering — often helped by the fact that most stakeholders don't care about architectural nuance — and it becomes easy to wave off any competing simpler architectures by deriding them as "taking on tech debt." In general, it's surprisingly easy to play corporate politics to bring a convoluted architecture into reality.


Can't wait to read this, really resonates with me as I'm dealing with this right now at a big FAANG company.

In my experience the problem is that as with all things it's all about balance. We shouldn't throw away architecture entirely and write stupidly simple, quick solutions because they will be messy. But we also shouldn't over-abstract things so much that only the person/people who built the system can understand it and work with it. Both could have dire consequences for an organization, making building new features and delivering value to users slow, difficult and costly.

Once I finish reading this MIT paper, I want to dig further into exactly what makes software 'complex'. In my experience:

- too many layers of indirection - overly theoretical concepts not grounded in real world thinking - lack of documentation - bad naming - lack of testability - tight de-coupling - following 'best practices and patterns' without clear justification - trying to solve for problems before they exist

We should be building systems that are grounded in concepts that are easily understandable - which is exactly why Object Oriented Programming has been so successful. We write programming languages as a means of communicating with each other about program logic, why not do it in terms that we as users already understand in the real world?


Software needs regular refactoring just like cars need regular maintenance. It is very difficult to determine how much time and effort to spend working on making it extendible when you're writing it the first time (even the product owner might not know how it will be used initially) but after a few months in production and a few feature requests, you'll get a better idea of what the pain points are and will be in a much better position to refactor. The problem is convincing the people who cut the check to allow engineers the time to do it


The hardest part here is the tradeoff between architectural complexity as you build systems and speed of shipping product. Earlier stage companies will ship ship ship and ignore good architecture practice. At some point, it will come back to bite you if your company lives to see another day.


This is why you need good people when you start: because good architecture is the difference between fast iteration and slow iteration.

A good architecture will allow you to make changes easily. A bad one doesn't. It's actually pretty simple, conceptually speaking.

If you believe that "late stage" companies make correct architecture choices you're probably incorrect. It's not about late stage or early stage, it's about knowing how to build software from scratch in a way that you don't hamstring yourself (and others) down the road.


I would say it goes beyond early stage companies and extends to later stage product-driven companies, especially those that value time-to-market than anything else.


I would argue it's the later stage company who doesn't take the time to fix it / pay off that tech debt who fails.

I'm not against picking up some tech debt here or there if you pay it off.


Always a tradeoff. You can build a Ferrari but may end up caught in the garage while a competitor has paying customers doing laps around the feature track collecting $200 at every turn.


They find architectural complexity accounts for defects (x3), productivity (50%) and staff turnover (order-of-magnitude).

The "McCabe cyclomatic complexity metric" doesn't predict the last two. Instead they use the "MacCormack" method (who supervised this PhD...), so I wonder how it differs?

funfact: they reference an evaluation of Mozilla'a refactoring (the one Joel said you must never do).

It's a PhD thesis, and it's long, detailed and academic (tautology), yet lacks an introduction, so:

  The Missing Introduction
"Architectural complexity" here doesn't mean the complexity of architecture features themselves (e.g. architecture astronauts making too many layers etc), but direct and indirect interactions between parts.

Instead of diagrams of arcs and nodes, they believe a matrix representation helps show complexity (vertical axis: using; horizontal axis used - or maybe vice versa - with dots at that location), called a "Design Structure Matrix" (DSM).

The key sections seem to be:

  3.6-7 [p.34-42] about DSMs
  5.1 [p.67-83] about complexity by the "MacCormack approach" [pasted below, from p. 70]
[5.1.2.1] Capture a network representation of a software product's source-code using dependency extraction tools

[5.1.2.2] Find all the indirect paths between files in the network by computing the graph's transitive closure

[5.1.2.3] Assign two visibility scores to each file that represent its reachability from other files or ability to reach other files in the network. [Visibility Fan In, Visibility Fan Out]

[5.1.3] Use these two visibility scores to classify each file as one of four types: peripheral, utility, control, or core.

They also have "network density" and "propagation cost" measures [p.76]

---

Aside: A PhD thesis takes some time to read before commenting. I wish there was a way to facilitate discussion of the submission, in addition to the title-based opinion and experience that we have now. The patient_hackernews experiment (https://old.reddit.com/r/patient_hackernews/) didn't seem to work out.

Perhaps just a re-submission (or a bump?) labelled "for readers only" 24 hours later?


In my experience this problem arises from a lack of consideration of two related things, and not doing a third thing:

1. Not considering how inevitable development issues and bugs will affect the entire system. That is, if a bug gets introduced here, how bad will the systemic effects be, and how do we prevent it? A severe case is if you start writing bad data from one component that then needs to be manually backfilled or reverted (especially if this then generates yet more bad data further down the line).

2. Not considering how entire-system failures will present themselves, and not having a good way to diagnose them.

3. Failure to develop testing (integration or whole-system blackbox) that catches the first case, failure to develop tooling (tracing, logging, synchronizing changes across components) that assists in the second case. Or instead adjusting the system so that the first or second cases are considered.

It’s easy to get stuck in a local optimum where a few old hats who understand the entire system are the only ones capable of predicting and diagnosing failures across components. It’s also easy to say that less skilled or new engineers just need to put in the time to get good, but it’s often the case that the old hats have a potentially-automatible procedure for narrowing down the problem or tracking where the issue came from, or that the benefits of separate components don’t justify the increased rate of bugs and time spent tracking them down to fix them.


I've long had an allergy to accidental complexity but recently had some new aspects illuminated for me.

I had a block of code that had been added to by others, making it a bit of a recursive mess with a handful of bugs that were hard to reproduce let alone fix. When I finally got tired of it all, I sat about to replacing the recursive code with some iteration, and ended up with DP code, which avoided a bunch of duplicate errors and the need for caching.

Without all the DP terminology, it's what other people would call building a (programmatic) plan and then executing it, rather than a depth first search algorithm. I used to use this quite regularly, but have only used it a couple of times on my current gig. Strictly speaking planning an entire action before starting it might result in a bit extra data hanging around because of building the data structures ahead of time, but at the end of the day it allowed me to eliminate a whole lot of steps that seemed necessary, and also remove duplicate warning messages. It wasn't necessarily faster than it could be, but it was way faster than I ever expected it to be.

Writing obvious code with clear actors and data flow steps makes it a lot easier to add features, and make performance improvements. Obscure code leads to more obscure code (qualitatively and quantitatively).


Can you define your use of DP here? I'm not sure if you mean dynamic programming, or design pattern or something else, and am curious about your insight.


A wise engineer who I worked closely with told me "the job of the engineer is to manage complexity. Engineers don't like complexity".

There are those hard-learned sutras that just make life so much easier down the line.

Down the line always comes (unless you are playing career musical chairs and are willing to gamble that you won't be around to have to deal with the code mess when there's a sev2-, but as they say.. karma's a b..."

Avoid premature optimization

TDD

Avoid complexity

Readable code

Documented code

Probably helps avoid most problems.


This is the economic value of refactoring, right on paper.

'Complexity is the mind-killer' - Linus Atreides


Architectures need to be judged by their difficulty to make code and data changes.

Changes in a stratified architecture are much simpler than in a layered design. (Changes in a layered design almost always mean changes to all the layers)


I cannot believe this is free. This is going straight to my read soon queue


This thesis provides empirical support that a specific measure of software architectural complexity is costly. Specifically, they look thru source code in an automated manner, construct the graph whose nodes are source code files and whose edges are the following cross-file relationships (page 73, section 5.1.2.1):

- The site of function calls to the site of the function's definition

- The site of class method calls to the site of that class method's definition

- The site of a class method definition to the site of the class definition

- The site of a subclass definition to the site of its parent class' definition

- The site at which a variable with a complex user-defined type is instantiated or accessed to the site where that type is defined. (User-defined types include structure, union, enum, and class.)

Then they compute the transitive closure of this graph.

Then they compute two metrics for each node by looking at the transitive closure graph (page 76, section 5.1.2.3):

- Visibility Fan In (VFI): how many other nodes have edges that go from the other node to this node?

- Visibility Fan Out (VFO): how many other nodes have edges that go from this node to the other node?

They observe that by looking at the VFI metric across various files, files tend to sharply cluster into either 'low VFI' or 'high VFI', and similarly for VFO (although some files may be high in one metric and low in the other) (page 79, section 5.1.3).

They then classify each file as:

- low VFI, low VFO: 'peripheral'

- high VFI, low VFO: 'utility'

- low VFI, high VFO: 'control'

- high VFI, high VFO: 'core'

They then find that 'core' files are the most costly, in terms of defect density, developer productivity, and probability of staff turnover.


Having worked in and out of FAANG and with ex-FAANG in startups, FAANG engineers have a tendency to treat everything like a FAANG problem that demands FAANG solutions.

You'll see these massive systems with event driven architectures and layers of nested microservices deployed to hand rolled K8S clusters with custom plugins and Cloud Native CI/CD systems strapped over the top of a complex repository setup and a complex metrics/tracing/logging stack to make sense of all the inter-dependencies and lifecycles of a request in this system. All of this for a system that is taking <100 RPS and likely won't see more than that for many years.

I'm not exaggerating when I say you could host the entire company on a single R620 (+ a cold backup) with PID1 managing a couple of bash scripts and a few Python/Go/Java/JavaScript processes wired up to an SQLite db and a nightly backup.

But their architecture choice does create jobs in our industry. These services need teams to maintain them. You have a K8S team, a DevTools team, a CI/CD team, teams for the API services, teams for the backend microservices, SREs to respond to incidents, etc. Those teams all need to be managed, coordinated, staffed, paid, etc. so you hire on the management and administrators to handle the size of the engineering organization responsible for this thing.

So instead of two 4-5 figure servers and a few $100 a month in hosting, plus a handful of well compensated engineers who keep this thing running, your engineering department is burning 6-7 figures per month in cloud costs and millions in headcount.

To be clear, some companies need these setups. I do platform engineering consulting and help companies build these exact org structures and systems I'm talking about here. But I only take those contracts when the work is justified. There is a spectrum, and very few companies fall on the end of the spectrum that demands these solutions.

Today, a majority of the companies in the American market should be on serverless offerings that scale to near-zero - simply to reduce the need to staff an engineering org to maintain the "infrastructure" under the system. You buy that engineering org from the serverless vendor with a support contract. I'm not setting a mom-and-pop shop up with an R620 because they'll be 100% dependent on a high-skill engineer for the rest of their existence, which isn't likely to yeild an ROI for them.

Past that you start migrating to VMs or bare metal. That'll get you pretty far. Modern computers are Very Fast, you can serve a lot of traffic from a single 1U slot.

Very few companies get to the scale where any of these FAANG architectures start to make sense.

If I were to take a guess, I think a lot of the FAANG stuff is salary chasing. To justify a $250k - $500k salary folks think they need to build these fancy architectures to demonstrate their value, and the junior engineers and management are all in because they're chasing the salary ladder and need to get this experience to get into FAANG. But the reality is, with back of napkin math, you can save a company _millions_ per year with conservative architectures.

For me at least, negotiating contract rates against that pool of savings is an unsolved problem.


> serverless offerings

Serverless winds up being complex, in my experience.


I've been using Cloudflare and have found the opposite.

You can accomplish an incredible amount for a small business with Pages and a single monolithic function. Once you outgrow that, IMHO, it's time to consider moving to a server - not to a more "complete" serverless offering like lambda.

The migration path for this is pretty staightforward since the API surface area of a cloudflare worker is pretty small (depending on whether, and how, you are using Durable Objects, KV, D1, R2, etc. things can get a little harder). I can usually port a cloudflare worker to a Node.js service in < a day of work.


This is my jam. It’s really difficult to stop myself from chasing shiny things, but the last year running an app with a dead simple architecture has been very peaceful.

Other than some hairy legacy stuff here and there, it all just works and the run rate is exceptionally low.

When it occasionally breaks, it’s obvious what’s wrong and quick to fix.

We spend most of our time making it better.

Long live the 3-tier monolith.


I believe that the great examples always arise from OpenSource projects. The design and modular code always play a great role in increasing the ability to customize or add features and also have an increase in collaboration from more volunteers. Developer life gets more interesting with a massively improved code quality when some tough decisions are taken much earlier.


There are great examples in open source, but my main issue is open source projects are usually frameworks. Frameworks solve a lower level problem than end user application code. I often see excess abstraction and unneeded flexibility in business code that just makes the problem domain harder to understand, without ever providing value.


Looks like we need a better system architecture for the MIT website hosting the material as it’s overloaded with too many requests


What exactly did decrease the complexity? And if, apparently, we can measure it, shouldn't it be part of any development process, similar to a linter?


Should have (2013) in the title.


Added. Thanks!


Complex means many, not challenging - https://www.dictionary.com/browse/complex

The most practical measure of complexity is duplication. Are there two, or more, pieces of code accomplishing the same or a similar job? That is complex. The solution is to refactor the many parts into fewer parts.


composed of many interconnected parts; compound; composite: a complex highway system.

characterized by a very complicated or involved arrangement of parts, units, etc.: complex machinery.

I would say you could have a software system with many interconnected parts, and/or with a complicated or involved arrangement of parts, without having duplicate code or duplicated functionality.


Duplication is just one of many potential measures. The end goal though is converting many to few.


No one uses the word complex to mean many.

Further, sometimes 1 thing is overloaded in difficult-to-understand ways, and so there should be more things, not fewer. Sometimes there are many things that should be 1 thing.

There isn't just one good measurement of complexity, as complexity isn't inherent to things or systems in themselvesas, rather complexity is a feature of perception, which gets confused in all sorts of irreducible ways.


This popular language author invalidates your comment: https://www.infoq.com/presentations/Simple-Made-Easy/


The usage in that case is a term of art. So silly. Again, no one uses the word complex to mean many (except in very narrow circumstances when the new definition that no one uses has to be indicated explicitly in order to prevent confusion given the fact that it is a novel usage).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: