There are some web apps still in production that I wrote almost a decade ago in Node+Express in the simplest, dumbest style imaginable. The only dependencies are Express and some third-party API connectors. The database is an append-only file of JSON objects separated by newlines. When the app restarts, it reads the file and rebuilds its memory image. All data is in RAM.
I figured these toys would be replaced pretty quickly, but turns out they do the job for these small businesses and need very little maintenance. Moving the app to a new server instance is dead simple because there's basically just the script and the data file to copy over, so you can do OS updates and RAM increases that way. Nobody cares about a few minutes of downtime once a year when that happens.
There are good reasons why we have containers and orchestration and stuff, but it's interesting to see how well this dumb single-process style works for apps that are genuinely simple.
Built-in first-class concurrency (ala node, golang, rust, etc.) is a huge win for simple architectures, since it lets you avoid adding a background queue, or at least delay it for a very long time.
I think people are also too quick to add secondary data stores and caches. If you can do everything with a transactional SQL database + app process memory instead, that is generally going to save you tons of trouble on ops, consistency, and versioning issues, and it can perform about as well with the right table design and indexes.
For example: instead of memcache/redis, set aside ~100 MB of memory in your app process for an LRU cache. When an object is requested, hit the DB with an indexed query for just the 'updatedAt' timestamp (should be a sub-10ms query). If it hasn't been modified, return the cached object from memory, otherwise fetch the full object from the DB and update the local cache. For bonus points, send an internal invalidation request to any other app instances you have running when an object gets updated. Now you have a fast, scalable, consistent, distributed cache with minimal ops complexity. It's also quite economical, since the RAM it uses is likely already over-provisioned.
This is exactly the approach that EnvKey v2[1] is using, and it's a huge breath of fresh air compared to our previous architecture. Just MySQL, Node/TypeScript, and eventually consistent replication to S3 for failover. We also moved to Fargate from EKS (AWS kubernetes product), and that's been a lot simpler to manage as well.
> For example: instead of memcache/redis, set aside a ~100 MB of memory in your app process for an LRU cache. When an object is requested, hit the DB with an indexed query for just the 'updatedAt' timestamp (should be a sub-10ms query). If it hasn't been modified, return the cached object from memory, otherwise fetch the full object from the DB and update the local cache.
I've never built something with this type of mechanism for a DB query, but it's interesting. I don't think I've ever timed a query like this, but I feel like it's going to be an "it depends" situation based on what fields you're pulling back, if you're using a covering index, just how expensive the index seek operation is, and how frequently data changes. I've mainly always treated it as "avoid round trips to the database" -- zero queries is better than one, and one is better than five.
I also guess it depends on how frequently it's updated: if 100% of the time the timestamp is changed, you might as well just fetch (no caching). Based on all the other variables above, the inflection point where it makes sense to do this is going to change.
Interesting idea though, thanks.
> For bonus points, send an internal invalidation request to any other app instances you have running when an object gets updated. Now you have a fast, scalable, consistent, distributed cache with minimal ops complexity.
Now you have to track what other app servers exist, handle failures/timeouts/etc in the invalidation call, as well as have your app's logic able to work properly if this invalidation doesn't happen for any reason (classic cache invalidation problem). My inclination is at this point you're on the path of replicating a proper cache service anyways, and using Redis/Memcache/whatever would ultimately be simpler.
It definitely does depend on various factors, but if your query is indexed, both the SQL DB request and the Redis/Memcache lookup of the full object are likely to be dominated by internal network latency. If your object is large, the DB single-field lookup could easily be faster since you're sending less back over the wire.
In other words, a single-field indexed DB lookup can be treated more like a cache request. Though for heavier/un-indexed queries, your "avoid round trips to the database" advice certainly applies.
With this architecture, the internal invalidation request is just an optimization. It isn't necessary and it doesn't matter if it fails, since you always check the timestamp with a strongly consistent DB read before returning a cached object.
> Built-in first-class concurrency (ala node, golang, rust, etc.) is a huge win for simple architectures, since it lets you avoid adding a background queue, or at least delay it for a very long time.
>For example: instead of memcache/redis, set aside ~100 MB of memory in your app process for an LRU cache.
Erlang/Elixir for the win with (almost transparent multi-core) concurrency and ETS ;)
> database is an append-only file of JSON objects separated by newlines. When the app restarts, it reads the file and rebuilds its memory image. All data is in RAM
Apps like this tend to perform like an absolute whippet too (or if they dont, getting them to perform well is often a 5 line change). It's really freeing to be able to write scans and filters with simple loops that still return results faster than a network roundtrip to a database.
The problem is always growth, either GC jank from a massive heap, running out of RAM, or those loops eventually catching up with you. Fixing any one of these eventually involves either serialization or IO, at which point the balance is destroyed and a real database wins again.
Another issue with "just a JSON file" as a database is that you need to be a bit careful to avoid race conditions and the like, e.g. if two web pages try to write the same database at the same time. It's not an issue for all applications, and not that hard to get right, but does require some effort. This is a huge reason I prefer SQLite for simple file storage needs.
It can definitely be a problem in Node.js. Assuming the workflow is read from disk -> modify -> write to disk, and that you're using the async fs functions, two async code paths running at the same time will have last-write-wins semantics and will lose data.
That's the naive scenario. If all code paths write out a global data structure, then it'd be fine. Or if the file is written append-only instead of as a single, atomic data structure, then it could be fine.
Hmm. I wouldn't think so, but I don't actually know
Still, given the strategy at hand, the in-memory JS object (exclusively single-threaded) is the source of truth, and just gets mirrored in the file system (and doesn't get read again until the next startup). So you should have an eventual-consistency situation in the worst case (any racing issue between file-writes would just put the file in a stale state, and the next file-write would bring it back up to consistency)
Your write will be fine; that is, it's not as if data from one write will be interspersed with the data from another write. It's just that the order might be wrong, or opening the file multiple times (possibly from multiple processes) could be fun too. The program or computer crashing mid-write can also cause problems. Things like that.
Again, may not be an issue at all for loads of applications. But I used a lot of "flat file databases" in the past, and found it's not an issue right up to the point that it is. Overall, I found SQLite simple, fast, and ubiquitous enough to serve as a good fopen() replacement. In some cases it can even be faster!
> Your write will be fine; that is, it's not as if data from one write will be interspersed with the data from another write.
Are you sure? I thought it could be if the first write had more data than the size of the kernel/fs-driver buffer, not all of it would be written, and then it could be interrupted when another thread calls write() with a small buffer that gets written in one go.
No, I'm not sure haha; but in my experience it usually works like that, but no doubt there could be edge cases there, too. Another good reason to use SQLite.
Although not a POSIX requirement, in practice for unix-like systems, file writes are atomic across concurrent writers.
You maybe thinking of stdio buffering, where calls to printf etc get split into multiple write calls. Then in those cases, it's possible to get errant interleaved writes.
It eliminates them if they're smaller than PIPE_BUF (IIRC, Beltalowda, dmoy, and stevenhuang are wrong about this), but the thing that prevents data races with regard to writes is running the application in Node, which is completely single-threaded.
> The problem is always growth, either GC jank from a massive heap, running out of RAM, or those loops eventually catching up with you
Absolutely. The challenge is having enough faith that it will take long enough to catch up to you.
Statistically speaking, it won't catch up to you and if it does, it will take so long you should have seen it coming from miles away and had time to prepare.
In my systems that use an in-memory/append-only technique, I try to keep only the pointers and basic indexes in memory. With modern PCIe flash storage, there is no good justification for keeping big fat blobs around in memory anymore.
Pointers are tuples of (Id, LogOffset) and are used to map logical identities to positions of those objects in the append-only log.
Indexes are usually a tuple of (Some64BitKey, Id) and are used to map physical business keys to logical object identities. These entries are only candidates in the case where the key material needs to be hashed and inspected for actual equivalence.
One big advantage with this approach is that you can stream big blobs directly out of the log to a caller-supplied buffer or stream. No intermediate allocations required aside from some small buffers.
Yes, you need to be sure that you understand the growth pattern if you want to YOLO in RAM. If your product aims to be the next Instagram, this is clearly not the architecture.
But a lot of small businesses are genuinely small. They may not sign up new customers that often. When they do, the impact to the service is often very predictable ("Amy at customer X uses this every other day, she's very happy, it generates 100 requests / week"). If growth picks up, there would be signs well in advance of the toy service becoming an actual problem.
An application I have written recently for personal use is a double entry accounting system after GNUcash hosed itself and gave me a headache. This is based on Go and SQLite. The entire thing is one file (go embed rocks) and serves a simple http interface with a few JS functions like it is 2002 again. The back end is a proper relational model that is stored in one .db file. It is fully transactional with integrity checks. To run it you just start program and open a browser. To backup you just copy the .db file. You can run reports straight out of SQLite in a terminal if you want.
This whole concept could scale to tens of users fine for LOB applications and consume little memory or resources.
>> This whole concept could scale to tens of users
I strongly suspect this approach scales to tens of thousands of users. Maybe 30-40k users would be my guess on a garden variety intel i5 desktop from the past 3 years or so.
I say this because that hardware (assuming NVMe storage) will do north of 100k connect + select per second (connect is super cheap in sqlite, you're just opening a local file), assuming 2-3 selects per page serve gets me to the 30-40k number. The http server side won't be the bottleneck unless there's some seriously intensive logic being run.
sqlite really does very well with reads but not as much for locking writes. not saying it couldn't scale to many users but I think the other person is a bit optimistic on a double entry accounting app being only reads. I would imagine it could certainly easily serve a few hundred though if not a few thousand.
Check out alpinejs or stimulusjs and combine it with htmx to get to a SPA like experience with very little additional complexity! Htmx let’s you serve partials over the wire instead of a page load so you can update the page incrementally and alpine and stimulus are both tools to add JS sprinkles like you’ve described in a way that is unobtrusive.
I appreciate the notion but my objective was to do the exact opposite of this and keep away from external dependencies and scripts where possible, apart from the solitary go-sqlite3.
The result is about 30K of source (including Go, CSS, HTML templates) which is less than minified alpinejs!
This. Not that I'm all about janky, but my road is littered with stuff I didn't think would make it through summer, and everything I check is still ticking 5, 7, 10 years later.
LONG ago I was amused by a Sun box in a closet that nobody knew anything about. I heard about the serial label printer that stopped working eight months ago, which was eight months after I shut off the Sun. I brought it back up again late one Friday, and the old/broken label printer magically worked again.
How do you feel about SQLite? Because when I read this architecture description, it mostly vibes with me, until I think about what happens to the data in the event of a power cut.
Is there logic in your app to potentially throw the last line away (incl. truncating it from the file) if it's invalid due to being the result of a non-atomic write? If so, seems a bit Not-Invented-Here compared to just using a (runtime-embedded) library that does that for you :)
> SQLite responds gracefully to memory allocation failures and disk I/O errors. Transactions are ACID even if interrupted by system crashes or power failures. All of this is verified by the automated tests using special test harnesses which simulate system failures.
from the sqlite about page. it's one of the most bullet proof and hardened pieces of software out there I think. it was made basically for exactly the ops use case of a file on disk. but who knows what the use case is for them. maybe writes to the db are few and far between so it's a fairly moot point.
The AOF reader will discard and emit a warning about lines that can’t be parsed, but that’s the extent of it.
These apps are on Digital Ocean, and I don’t remember ever having unplanned downtime with them. They do sometimes migrate instances with advance notice, but that’s a clean shutdown.
I’m sure SQLite is a better choice for almost any app. My reason for not using it was to try avoiding dependencies out of curiosity, and also that I honestly really don’t like writing SQL — it just feels boring and error-prone. (Like eating celery, I know objectively it’s good for me.)
> My reason for not using it was to try avoiding dependencies out of curiosity, and also that I honestly really don’t like writing SQL — it just feels boring and error-prone.
Well, sure, but you can just read and write JSON blobs to a single-column table in SQLite. See also, "SQLite makes a better BLOB store than the filesystem does" (
SQLite doesn't do "lines" like a text file, or truncate anything(Or, it probably does internally, but that's not how users think of it).
It's a real SQL DB with real records and transactions, and it is one of the most trusted and reliable pieces of software ever made. Like, check out the change log to get a sense of how they do stuff.
I think you misread what I said, or maybe didn't read the post I was replying to. I was pointing out that flat files have a problem with truncation; and that SQLite, despite being a very different thing than an append-only text file, can be effectively used as "an append only text file, but ACID." My question on how the GP poster feels about SQLite, is down to SQLite being the obvious solution to a problem I wasn't sure they realized they had — but also potentially still being "too many dependencies" for them.
If you're not writing data very often, this isn't a concern. For example, if you assume 5 microseconds to write a line, and 1 line written per hour on average, then the chance that the power goes out while you're writing a line is 10^-9, i.e. will never happen.
Well done on building an easy-to-maintain single node app with few dependencies. You would be the SWE I would send prayers of thanks too after onboarding (and for not making me crawl through a massive Helm chart/CloudFormation template hell).
I have literal dozens of these kinds (Node+Express under PM2) of small apps running around everywhere in production (almost 8 years = almost a decade ;). Using SQLite (when you need an actual DB) makes things a lot easier in this regard as well.
I've tried doing some thing in Python (my initial programming love) over the years but I keep going insane because every time I'm forced to read up on the state of ecosystem (choosing versions and package managers and whatnot) and it drives me insane. I just install Node+Express and can get to work immediately (and finish quickly).
I suspect that 95% of business applications could be implemented just fine with that architecture. However I would use SQLite instead of a plain file. Just for added commit safety.
I did a very similar setup to this as a boy using a PHP file that re-wrote itself as a sort of key-based data store. This was fantastic on free shared web hosting sites since no free db options were available. This was circa 2005.
I do almost this exact thing for all my personal stuff. I have 5 or 6 going in a vm for simple things like my bookmarks, etc... works great. I could definitely see it solving many small business use-cases.
I was interviewing for software jobs recently, and while I was studying up on the "system design" portion I kept circling around the same insight that Dan Luu writes about so well here.
I would sit down at an interview and try to create these "proper" system designs with boxes and arrows and failovers and caches and well tuned databases. But in the back of my mind I kept thinking, "didn't Facebook scale to a billion users with PHP, MySQL, and Memcache?"
Before 'eventual consistency' was coined as a phrase, there was an old, powerful and deeply unsexy form of eventual consistency called "batch processing".
For small batches you do an interval at a convenient time, such as a time of day where the hardware is undersubscribed for some other task. As the batches grow then you have an online system that continues to work until you get rather far down the list of consequences in queuing theory. Once you get 24 hours and 1 minute of tasks per day you never catch up (it never ceases to amaze me how often I can find someone who will fight me on this point), and you must be aware that substantially before that breaking point, you can experience rather long average queuing delays.
But if your workload is spiky, you can smear 250 minutes of peak traffic out over 6-18 hours with no problems at all. You need a safe place to stash the queue and a little sophistication around recovering from failures/upgrades. Those aren't necessarily Simple tools, but if that's the most complex part of your system you're doing pretty okay.
I think that’s a large oversimplification of Facebook. While it’s true a lot of FB storage is MySQL backed they also created many complex systems such as:
- Cassandra (based on dynamo/big table)
- wrote a custom KV store named RocksDb that is open source/now a company
- wrote a custom photos storage system that replaced an NFS based design
- wrote another custom binary object store
- wrote a custom geo distributed graph db (Tao)
- wrote an in house distributed FS replacement for HDFS
Cassandra was released in 2008. Facebook hit 1 billion users ~ 2012, and 2 billion ~ 2017. Back when Cassandra was released, they had a 'mere' 100 million users.
Thanks for the correction! So the more correct statement would be that Facebook scaled to 100 million users with PHP, MySQL, and Memcache, and then to a billion users with Cassandra and Haystack, and then built all the other stuff after their first billion?
Yeah, there’s also a lag sometimes ie. They write the paper 1-2 years after building the system. They’ve published some really interesting papers over time. If you filter by year you can see them all!
Some people say that Leetcode is nothing more than rote learning. Some disagree and I disagree. There's a minimal amount of rote learning that helps but you need more than that. On the other hand "system design" interviews are...strange? You, someone who never built something at the scale of these giant tech companies, are being asked to come up on the spot with a design for a system that'd scale to billions of users. There's a 100% chance that if you were to attempt building such systems and you'd never done it before, you'd discover holes in your original design left and right. Leetcode tests your logic, how you think, your IQ maybe. But system design interviews consists in regurgitating knowledge that you've crammed in your head by studying books/articles/videos on system design. It's closer to reciting poetry than problem solving. In my experience it could be replaced with multiple choice questions and you'd get the same result.
I ask a lot of system design interviews based on Facebook's products. What I look for is the ability to propose something simple, proposing the right metrics and data collection to understand scaling needs, then making reasonable guesses about which parts of the system need to scale. PHP + MySQL + Memcache is great until you also need to do ML inference (high CPU/GPU load), need to store user-uploaded video, or want to stream new content to users in near realtime (live comments).
They key is to add the minimum amount of "stuff" to a simple design to convincingly scale for some new hypothetical need.
Yeah I just did an interview where my design was a database, a few lambdas and a webserver and after I was thinking they must think I dont know much, I should have beefed it up a bit.
It's important to justify the design you come up with. Explain why the design is simple, pros and cons and when you'd opt for a more complicated one to solve which particular issue.
I think the biggest problem for most developers is not understanding what one computer can actually do and how reliable they are in practice.
Additionally, understanding of how tolerant 99% of businesses are to real-world problems that could hypothetically arise can help one not frustrate over insane edge case circumstances. I suspect a non-zero number of us have spent time thinking about how we could provide deterministic guarantees of uptime that even unstoppable cosmic radiation or regional nuclear war couldnt interrupt.
I genuinely hope that the recent reliability issues with cloud & SAAS providers has really driven home the point that a little bit of downtime is almost never a fatal issue for a business.
"Failover requires manual intervention" is a feature, not a caveat.
Some people don't even realise how much traffic a simple web app with server side rendering (decently written), hosted on an average dedicated server can hold... They dont need cloud, autoscaling, microservices, kafka, event driven architectures, etc.
We've lost our way in the masked marketing the cloud providers are creating to help us solve problems we will never encounter, unless we are building the next Netflix or Facebook.
If you just need plaintext services, something like ~7 million requests per second is feasible at the moment.
By being clever with threading primitives, you can preserve that HTTP framework performance down through your business logic and persistence layers too.
Thus, in my case those numbers might be closer to the following:
- plaintext: up to 2'500'000 requests per second, most technologies go up to around 500'000
- data updates: up to 14'000 requests per second (20 updates per request, so 280'000 updates per second)
- fortunes: up to 300'000 requests per second (full CRUD and sorting)
- multiple queries: up to 32'000 requests per second (20 queries per request, so 640'000 queries)
- single query: up to 530'000 requests per second, most technologies go up to around 100'000
- JSON serialization: up to 970'000 requests per second, most technologies go up to around 200'000
Of course, their setup also plays a part, since the VPSes that i'd go for probably wouldn't be comparable to a Dell R440 Xeon Gold.
It's really nice to have this data, but the code that's written also plays a really big factor - i've seen people who write code with N+1 problems in it and call ORMs in loops and adamantly defend that choice because "such code is easier to reason about" instead of a simple DB view that would be 20-100x faster. With such code, it'd be closer to the "multiple queries" test.
Then again, these tests basically tell you that in 90% of the cases you should go for Java or .NET, abandoning Python, PHP and Ruby for them (though one could also introduce Rust into the mix and say the same), which realistically won't happen and people will use whatever technologies and practices that they feel comfortable with.
I've seen applications that work fine with hundreds of thousands of page loads per minute (multiple requests per load) and i've seen systems that roll over and die with 100 concurrent users, lots of variety out there.
Also, those complicated architectures are often quite unreliable anyway - just in ways that don't show in metrics. Slack comes to mind: not only its functionality is poor compared to eg IRC, but it fails in hilarious ways, eg showing duplicated messages, or not showing them at all. Another example is YouTube - the iOS app gets confused when displaying an ad, which results in starting the playback at a wrong time offset. I guess it's because companies like those don't care about actual reliability - what they do care about is availability.
Wtf are you doing with it? My slack instance (on linux) is resting around 300 MB resident set size and 0% cpu. 300 MB is still a lot for a chat app, but it is definitely not gigabytes.
If you just add up memory usage for subprocesses you are likely to over count due to shared memory. The number you typically want to add up in Linux is ‘proportional set size’ which is, I think, the sum over every page of the process’s memory of page_size / number of processes which can access the page. I don’t know what happens if you mmap some physical memory twice (I think some newish Java GC does this).
I did not know that. Maybe it would have been better to spend more time trying to find the original comic (I guess maybe an image search. Doomscrolling @bruised_blood is a lost cause).
Why yes, Slack does seem to be written poorly - just as poorly as every other web app that I use including VS Code, Discord, MS Teams, etc..
Maybe the Slack developers are just stupid, uneducated, malicious, poorly managed, or ambivalent (or all of the above) but the platform does to be conducive to creating clunky and bloated software.
While i agree with your overall point, i think that VS Code is one of the better (only?) examples of really good web technology based software. It's snappy, reliable and has very few bugs.
When you type something into IRC that message shows up in the log and every online users client pretty reliably. Furthermore the high degree of diversity among clients provides a pretty extreme amount of client side functionality that Slack completely lacks (scripting is a huge one.)
The versatility of clients is indeed a huge benefit of IRC. I used to use IRC at work and always had my Weechat window split with a small pane up top showing either my highlights or a channel I needed to monitor at the time. With Slack, you can’t do that, which means you have to repeatedly click between channels if you need to pay attention to multiple at a time.
Slack has much better history because you don't need to have been online when messages are sent to log them. Slack is absolutely more reliable in this regard.
IRC is easy to script because the protocol is so simple. But you leave so much on the table for that cost.
Obviously if your use case is text only that you don't care about being persistent and you lean heavily on scripting to get things done then IRC will do the trick. Otherwise it's such a crutch to do anything besides beyond that.
Slack is not instantly "better" than IRC, it's just a different approach to the chat problem and it's arguably more approachable for people that don't want to learn about the chat space.
Logging is just different between the two.
For IRC, logging is outside the scope of the IRC protocol. Anyone can log anything anytime anywhere with whatever policies and procedures they want. This usually leads to each channel/project having some "official" log of the channel somewhere, using whatever they feel is good for them.
Slack on the other hand centralizes the logs, which removes lots of control into the administrators/slack developers.
So Slack's logs are likely easier to find, but that doesn't necessarily make them easier to use.
Persistency is also just different, IRC makes it your problem, but it's a solved problem if you care about it. and both offer persistence in different ways as two differing examples to the problem.
Slack of course centralizes the problem and removes some control.
I personally think Slack and approaches like it (I prefer MatterMost) are great for internal things where administrators need central control of stuff for various reasons. For public things, I think Slack is a bad solution, and something like IRC or Matrix is a better solution to the problem of public chat.
> For IRC, logging is outside the scope of the IRC protocol.
Nope, the community has understood that server-side logging (and making it available to clients who missed stuff happening) is a useful thing.
IRC has logs for history, they're fast and you can run your own logger to control the retention policy if you want. These heavy weight IM tools have extremely short log retention (months) and searching through the logs is extremely slow and frustrating IME.
Past the proof of concept, "developers" should frankly not be making these decisions. People who understand systems and failure analysis should be. You might have devs with that experience, but they're comparatively rare.
As far as complexity... if you get big enough, you can't avoid it. My meta-rule is to only accept additional complexity if solving the issue some other way is impractical.
It is almost always far, far easier to add additional moving parts to your production environment than it is to remove them after they're in use.
These requirements don't come out of nowhere. Normally they come from:
1. CEOs/whoever that don't listen to how much additional complexity it is to build a system with extremely high uptime and demand it anyway.
2. Developers with past experience that systems going down means they get called in the middle of the night.
3. Industry expectations. Even if you're a small finance company where all your clients are 9-5 and you could go down for hours without any adverse impacts, regulators will still want to see your triple redundant, automated monitoring, high uptime, geographically distributed, tested fault tolerant systems. Clients will want to see it. Investors will check for it when they do due diligence.
Look at how developers build things for their own personal projects and you'll see that quite often they're just held together with duct tape running on a single DO instance. The difference is, if something goes wrong, nobody is going to be breathing down their neck about it and nobody is getting fired.
If the additional complexity is just "Use this premade thing" and it only adds a half hour here and there of work, while also giving you essentially a premade and pre-documented workflow than new people will instantly know(Whatever your "bloated" tool tells you to do), then it might be a net win anyway.
If the extra complexity is microservices and containers you might have an issue, but microservices are kind of a UNIX philosophy derivative, I'm not sure the complexity is really intentionally added(Like when someone uses an SPA framework or something), it just kind of shows up by itself when you pile on thousands of separate simple things without really realizing the big picture is a nightmare.
> "Failover requires manual intervention" is a feature, not a caveat.
I was scarred by the DDoS of Linode on Christmas Day 2015 (as a Linode customer at the time). I believe that was the only time my Christmas was ever interrupted by work. Of course, one might respond that being the one perpetually on-call sysadmin isn't ideal.
The vast, vast, vast majority of organizations don't need micro services, don't need half of the products they bought and now have to integrate into their stack, and are simply looking to shave their yak to meet the bullet list of "best practices" for year 202X. Service oriented architectures and micro services solve a particular problem for companies that are operating on a massive scale and can invest (read waste money) on teams devoted to tooling. What most companies should do is build a monolith that makes money, but hire good software engineers that can write packages/modules whatever with high levels of cohesion and loose coupling, so that one day when you become the next Google, it will be less of a pain to break it into services. But in the end it really doesn't matter if it's painful anyway, because you'll have the money to hire an army of people to do it while the original engineers take their stock and head off to early retirement.
I'd never worked with micro-services before this latest freelance project. I start working with this platform that is basically "note taking but with a bit of AI/ML". So okay, a bit of complexity with the ML stuff, but otherwise a standard CRUD app.
The application itself is a total of 3 pages, encompassing maybe 20 endpoints at the most, with about 100 daily active users. For the backend, some genius decided to build a massive kubernetes stack with 74 unique services, which has been costing said company over $1K/month just in infra costs. It took me literally weeks to get comfortable working on the backend, and so much stuff has broken that I have no idea how to fix.
Not only that, but the company has never had more than 1 engineer working on it at a time (they're very small even though they've been around a bit). If there were such a thing as developer malpractice, I'd sue whoever built it.
This is so common, sadly. I've seen this happen a lot. And those geniuses advocating for these insane infrastructure usually leave the company after they scratched their itch with kubernetes or whatever they were interested in playing with
I'm s strong believer that your first few engineers are very very, very important and you need to hire very experienced people.
In the cases I've seen this, honestly I think the ones to blame where the stakeholders, for hiring very young people, for the cheapest rate they could and giving them full responsibility and the Senior Architecure Something Something title when those people don't have more than a couple years experience and are just building what they read in a blog two weeks ago.
In all honesty I'm very angry about it. It was built a while ago, and the dev is no longer here, but I almost want to track him down and make him help fix it. This founder isn't technical, so he's been leaning on developers for guidance, and this guy basically built him a skyscraper when what he really needed was a shed. It hurts to think about all the time and money that he's poured into just maintaining it. Crazy.
Ok I feel very validated now - I'm not used to microservices so didn't know what was typical. It felt crazy, so good to know based on this comment's responses that it is indeed crazy.
For example, in order to sign up a user...the client hits the /signup endpoint, which first lands on the server-gateway service. Then that is passed along to an account-service which creates the user. Then the accounts-service hits a NATS messaging service twice - one message to send a verification email, and another to create a subscription. The messaging service passes the first message along to the verification-service, which sends out a sendgrid email. Then the second message gets passed along to a subscription-service-worker. The subscription-service-worker adds a job to a queue, which when it gets processed, hits the actual subscription-service, which sends along a request to Stripe to create the customer record and trial subscription.
6 services in order to sign up a user, in what could have been done with about 100-300 lines of Node.
Even the most fanatical microservices proponent will tell you that's just bonkers.
At my very first programming job many years ago I was given a bunch of code written by a string of "previous guys" (mostly interns) over a period of 10 years or more and was told "good luck with it". I was the only developer, with no real technical oversight. It was my first "real" programming job, but I had been programming for many years already (mostly stuff for myself, open source stuff, etc, but never "real" production stuff).
In hindsight, I did some things that were clearly overcomplicated. I had plenty of time, could work on what I wanted, and it was fun to see if I could get the response speed of the webpage down from 100ms to 50ms, so I added a bunch of caching and such that really wasn't needed. Varnish had just been released and I was eager to try it, so I added that too. It was nowhere near the craziness you're describing though, and considering the state of the system when I took things over things were still massively improved, but I'd definitely do things different now because none of that was really needed.
Maybe if it had been today instead of 15 years ago I would have gone full microservice, too.
Ha, yeah I created my fair share of complex stuff when I first started.
For some reason, even today, my personal projects tend to get very complex. But I think that’s just because I’m working on hard problems since they’re passion projects.
I think especially for small teams starting out, complex architecture can be a huge trap.
Our architecture is extremely simple and boring - it would probably be more-or-less recognizable to someone from 2010 - a single Rails MVC app, 95+% server-rendered HTML, really only a smattering of Javascript (some past devs did some stuff with Redshift for certain data that was a bad call - we're in the process of ripping that out and going back to good old Postgres)
Our users seem to like it though, and talk about how easy it is to get set up. Looking at the site, the interactions aren't all that different from what we would build if we were using a SPA. But we're just 2 developers at the moment, and we can move faster than much larger teams just because there's less stuff to contend with.
That doesn't sound like it's really any simpler than a json API server (written in node, python, go, or anything else), and a SPA. Maybe the lesson is "build with what you know if you want to go fast".
In my experience SPAs bring a lot of headaches that you just don't really need to think about with traditional HTML. Browser navigation, form handling, a lot of accessibility stuff comes out of the box for free, and there's one source of truth about what makes a particular object valid or how business logic works (which is solvable in the SPA world but brings a lot of complexity when you need to share logic between the client and the server, especially when they're in different languages).
Frankly out of all the things that make our architecture simple and efficient, I would say server rendered HTML is by far the biggest one.
Probably depends on the requirements. If the product should basically feel like a static web page, and you are OK making design and product decisions that work easily in that paradigm, then a server side framework built to make static web pages is going to be simpler.
If you have product or design requirements that it should feel more dynamic like a native app, then trying to patch that on top of a static webpage might get messy.
IMHO the important thing is where your data is. If can all be client side then write a SPA. If it's on the server then the more you do on the server the better.
Returning HTML and doing a simple element replace with the new content is 99.9% indistinguishable from a SPA.
Every app will have a bunch of serverside data so IMO that's not really a consideration.
The great thing about SPAs is that your server now just returns a simple JSON API. It greatly simplifies the server side. It also tends to make things a lot easier to unit test on the server side. Also, it cleanly segments your staffing requirements. If your app is in some way non-trivial, you can have domain experts working on the backend, and just dumping json into http responses, and have a frontend engineer who doesn't understand any of the magic work on the frontend.
> Returning HTML and doing a simple element replace with the new content is 99.9% indistinguishable from a SPA.
Once you're doing this, your backend engineers are dealing with html in addition to whatever their real job is. If you want to have something other than your website consume your backend, you're rewriting stuff to output json anyway.
If you have no domain-specific computation happening and your service will only ever be consumed as a more or less static website, serverside web frameworks can be faster. For example, if you are building a blogging site, or maybe a CRM.
>Every app will have a bunch of serverside data so IMO that's not really a consideration.
Every app will not have so much data that they can't send it all to the client. 99% of them will which is why 99% of web apps should be server side. That 1% is when you want a heavy client side app.
>Also, it cleanly segments your staffing requirements. If your app is in some way non-trivial, you can have domain experts working on the backend, and just dumping json into http responses, and have a frontend engineer who doesn't understand any of the magic work on the frontend.
You can separate frontend and backend code without introducing a client server call in the middle.
>Once you're doing this, your backend engineers are dealing with html in addition to whatever their real job is.
Why? You can still have front end engineers. They just write server side HTML templates which is hardly any different than writing JS based HTML templates.
In my experience, adding SPAness doubles the complexity of your application. Now you're maintaining and synchronizing the same state in two places and adding extra code in a different language (if you're not using JS on the backend).
Yep. Failed an interview because I used EJS (SSR) and Node to build a simple Twitter in 30mins. The interviewer saw that it was three files and did not seem impressed.
I guess they wanted me to use lots of little components in an SPA which I did in my day job, but it didn't seem nessisary for the task...
We use one of these "aggressively simple" architectures too. At this point, I would quit my job instantaneously if I had to even look at k8s or whatever the cool kids are using these days.
> look at k8s or whatever the cool kids are using these days.
I'm fine with complex architecture and would actually welcome someone choosing something complex but the issue is that we have perverse incentives at work to introduce stuff just to pad our resume.
Kubernetes was designed for companies deploying thousands of small APIs/applications where management is a burden. I've seen companies that deploy 3 APIs running Kubernetes and having issues...
Man, kubernetes is so much easier than the smattering of crap that you have to jungle together before it. Puppet and co? No thanks. Terraform? It's fine, but only a part of a CI/CD picture. If you think the alternatives are better, I really have to wonder how much of the trenches crap that people in your org deal with regularly that you're insulated from. That, or you're a release-quarterly kinda company?
Nomad is pretty great for a lot of things, especially self hosted. The only reason I prefer k8s is the ecosystem. Even though there are standardized specs like CSI, they were written with k8s in mind, so some drivers are completely broken on Nomad. Also, most cloud providers offer managed k8s, but very few offer managed Nomad.
We wrote our own tools for most things. Our build is a single dotnet publish command, followed by copying the output to an S3 bucket for final consumption.
That output is 100% of what you need to run our entire product stack on a blank vm.
Monolithic pays for itself in so many ways. Sqlite and other in-process database solutions are a major factor in our strategy.
In terms of the choices they're unsure about; I'd say it's best to stay away from Celery / RabbitMQ if you don't really need it. For us just using RQ (Redis backed queue) has been a lot less hassle. Obviously it's all going to depend on your scale, but it's a lot simpler.
RE the sqlalchemy concern; you do need to decide on where your transactions are going to be managed from and have a strict rule about not allow functions to commit / rollback themselves. Personally I think that sqla is a great tool, it saves a lot of boilerplate code (and data modelling and migrations are a breeze).
But overall the sentiments in this article resonate with my experience.
for starters i wouldnt use kubernetes. love the system, but boy is it complicated. i'd use a few cloud function or stick them in VMs behind a load balancer and call it good.
> As for Kubernetes, we use Kubernetes because knew that, if the business was successful (which it has been) and we kept expanding, we’d eventually expand to countries that require us operate our services in country. The exact regulations vary by country, but we’re already expanding into one major African market that requires we operate our “primary datacenter” in the country and there are others with regulations that, e.g., require us to be able to fail over to a datacenter in the country.
Nah, I don't much like the tone of this article. Not at all.
The engineering message should be: keep your architecture as simple as possible. And here are some ways (to follow) on how to find that minimal and complete size 2 outfit foundation in your size 10 hoarder-track-suite-eye-sore.
Do we really need to be preached at with a warmed over redo of `X' cut it for me as a kid so I really don't know why all the kids think their new fangled Y is better? No we don't.
If you have stateless share nothing events your architecture should be simple. Should or could you have stateless share nothing even if that's not what you have today? That's where we need to be weighing in.
Summary: less old guy whining/showing-off and more education. Thanks. From the Breakfast club kids.
Because he says things that are true and less well known than they should be, and then gives a clearly written argument that shows you that they are true using logic and empirical evidence.
I don't disagree with you, but other people write on these topics more compellingly, and do not include off-putting Wolfram/Doctorow-style self regard. I chalk it up to his being astoundingly prolific.
These two things can be true at once: Intelligent good people with good taste and good will find an author's writing compelling, interesting and valuable. Simultaneously, other intelligent good people with good taste and good will do not. It happens. Neither are wrong.
Thanks for sharing. Personally I also add
background-color: #edd1b0;
for any site I plan on read more than five minutes. For me is more pleasant to read compared with than a white background.
I find it's a lot easier to add a simple CSS to an understyled website, than remove huge fixed banners, weird low-contrast thin fonts, etc. from overstyled websites.
Just boils down to not optimising until you need to. Start with a 3 tier web app (unless your requirements lead you to another solution), then start with read replicas, load balancing, sharding, redis/RabbitMQ etc
In almost all performance areas -- gaming, PCs, autos, etc -- there are usually whole publications dedicated to performing benchmarks and publishing those results.
Are there any publications or sites which implement a few basic applications against various new-this-season "full stacks" or whatnot, and document performance numbers and limit-thresholds on different hardware?
Likewise, there must be stress-test frameworks out there. Are there stress-test and scalability-test third-party services?
Fossil SCM is a great example of a sqlite application that has stood the test of time. I don't know what's traffic is like, but it's not tiny and it runs on a tiny VPS without issue(and has for years now).
True, but it's great that they have a certain comparable hello-world set, so you can just take the 2-n benchmarks for the tech you're interested in and either ungamify it or just implement your own prototype and measure. The benefit of already having a non-random-tutorial-from-google app is a huge win to me personally.
I understand his point but I actually think micro-services can be simpler than monoliths.
Even for his architecture, it sounds like they have an API service, a queue and some worker processes. And they already have kubernetes which means they must be wrapping all of that in docker. It seems like a no-brainer to me to at least separate out the code for the API service from the workers so that they can scale independently. And depending on the kind of work the workers are doing you might separate those out into a few separate code bases. Or not, I've had success on multiple projects where all jobs are handled by a set of workers that have a massive `switch` statement on a `jobType` field.
I think there is some middle ground between micro-services and monoliths where the vast majority of us live. And in our minds we're creating these straw-man arguments against architectures that rarely exist. Like a literal single app running on a single machine vs. a hundred independent micro-services stitched together with ad-hoc protocols. Micro-services vs. monoliths is actually a gradient where we rarely exist at either ludicrous extreme.
It is pretty much impossible for a micro-service architecture to be simpler than a welldesigned monolith. To create a micro-service architecture from a welldesigned monolith you need to take the N libraries the monolith is built from and add protocols/serialisation/deployment etc. to each library. Each of which adds new distributed failure scenarios you now have to test/handle.
So, I definitely agree with this. Most of us don't have to do any thing at FAANG scale. But what counts as simple?
It's quite easy these days to deploy an app using AWS Lambda, DynamoDB, SNS, etc., all with a single Cloud Formation template. Is that simple? In one sense I've abstracted away a lot of the operational work that comes with self-hosted, but now I've intertwined (Rich Hickey might say complected) myself into Amazon's ecosystem.
Also, is a document store like DynamoDB, MongoDB, etc., simpler than a relational database like Postgres? On the one hand, a document database's interface is very simple compared to the complexity like SQL. On the other, that simplicity is generally considered a necessary sacrifice to scale. If you don't need to scale, why make the sacrifice?
Also, there can be simple things that are better at scaling. Elixir is a very nice scripting language like Ruby or Python, but it also has much better performance scaling (comparable with NodeJS or Go).
> GraphQL libraries weren’t great when we adopted GraphQL (the base Python library was a port of the Javascript one so not Pythonic, Graphene required a lot of boilerplate, Apollo-Android produced very poorly optimized code)
What do people use instead of Graphene? Strawberry?
Simple architectures work well, until they don't. A good example is ye olde ruby on rails monolith. Dead simple to set up and iterate quickly, but once you reach a certain organization and/or codebase size, velocity starts to degrade exponentially
Complex architectures work well, until they don’t. Fixing complex architectures is much harder than fixing simple architectures. So would you prefer a simple architecture or a complex one? The answer should be obvious.
How far can you get with a single Postgres instance on a single machine? I know things like cockroach and citus existence but generally Postgres isn’t sharded as far as I know.
You can scale up that one machine a lot. If you start with a normal sized machine you have a lot of overhead in increasing ram/cpu on that machine (eg you could start with say 16 cores and 100G ram or less and scale up to like 2TB ram and 64/128 cores). There’s also runway for scaling things by eg shooting down certain long-running queries that cause performance problems or setting up read replicas.
So even if you’re a bit worried about scaling it, you can at least feel the problems are far away enough that you shouldn’t care until later.
We are serving a several tens of TB database with tens of thousands of daily user doing very heavy queries on a single machine that did cost us $15k a few years ago (we have fallbacks and replication and whatnot don't worry). The same machine also has java services. You can really do a lot on today's machines.
That's complicated based on workload, etc. A single PG node will obviously never scale to Google or Facebook levels.
Attend a PG conference and you will run into plenty of people running PG with similar use cases(and maybe similar loads) to you.
I can say we run a few hundred concurrent users backed by PG on a small to medium sized VPS without issues. Our DB is in the 3 digit GB range on disk, but not yet TB range.
The "previous" article at the bottom is the most recent article in his archive, which was apparently published in March 2022. So I'm guessing this year, and either this month or last month. But the archive doesn't seem to have been updated yet with this article.
This doesn't sound very simple at all. It's a single codebase that handles all your mobile API interactions, authentication, account management, presumably usage tracking and notifications, and all your offline processing, all interacting with a single database and queue infrastructure? And that same codebase marshalls all that through a GraphQL API and implements a custom data protocol?
And you're calling that simple?
I've worked on monolithic codebases, and the one thing none of them have ever been is simple. They have complex interdependencies (oh hey, like database transaction scopes); they have that 'one weird way of doing things' that affects every part of the system (like, 'everything has to be available over GraphQL')...
I have worked on massive welldesigned monoliths and they were so much easier to maintain than equivalent micro-service implementations would have been. Monoliths and Micro-services will be easy/hard/impossible to maintain depending on how well they are designed. Not depending on whether it is a monolith or micro-service architecture.
I figured these toys would be replaced pretty quickly, but turns out they do the job for these small businesses and need very little maintenance. Moving the app to a new server instance is dead simple because there's basically just the script and the data file to copy over, so you can do OS updates and RAM increases that way. Nobody cares about a few minutes of downtime once a year when that happens.
There are good reasons why we have containers and orchestration and stuff, but it's interesting to see how well this dumb single-process style works for apps that are genuinely simple.