Hacker News new | past | comments | ask | show | jobs | submit login
How Instagram scaled to 14 million users with only 3 engineers (engineercodex.substack.com)
315 points by thunderbong on Sept 16, 2023 | hide | past | favorite | 188 comments



> we can assume that Instagram was written using Objective-C and a combination of other things like UIKit.

How can they know the internal infrastructure but have to assume the app language?

Edit: So the entire piece is taken almost verbatim as-is from a couple old articles on instagram engineering blog.

It might as well just redirect to: https://instagram-engineering.com/what-powers-instagram-hund...

This is against the guidelines:

> Please submit the original source. If a post reports on something found on another site, submit the latter.


Instagram engineers themself wrote a bit about their backend infrastructure. One of the more important topics was how they shard the data [0] and this is also linked to in this blog post.

[0]: https://instagram-engineering.com/sharding-ids-at-instagram-...


If it wasn't just an image site the potential for hotspotting would be insane!

Size isn't a bad thing anymore since price has dropped exponentially since the inception of Instagram.

I am positive they would use another modern technology today if it was present in the past.

Fantastic read though.


The potential for hotspotting decreases with the number of inserts per second. Like if you only did 1 insert per second and timed it right you could put all of those inserts on one server, but this would likely not overload the server.

It's virtually impossible for anyone to hotspot in a meaningful way with this system.


Nah this will totally be an issue. You can be on the extreme end of replication or the extreme end of sharding and experience performance problems. Sharding is more likely to hotspot depending on where hot data is consistent.

The solution in most cases is a simple database that acts as a pointer database user db -> user's db. That is generated on the creation of a user.

From here you create some simple cold storage models ( if user isn't active ) and some warm models which will scale out the db if the user's db grows it shards and replicates for more read access. But the last thing you want is to slow replication or have one DB that can't move to balance resource utilization. There are some new DB tech that does this without even sweating the deets.


It IS an original source. It's not just a repost or a report on another older article. It's a reworking of those articles.


I suspect they mean it’s a secondary source not a primary one


I wonder how they horizontally scaled shards. If they had 2k logical shards they probably had much fewer real/physical shards. So a single database was holding many sharding keys. So when a new physical shard gets added, the data needs to somehow be replicated. That is only true if w reading is the problem. If only a relatively short time period of data is hot you can probably just move over the logical shard IDs to the new physical shard without moving existing data. This requires keeping track of when which physical shard became active.


> How can they know the internal infrastructure but have to assume the app language?

At that time that was the only solution that would have made sense given the achieved behaviour, performance AND development effort.

I did spend a lot of time Cordova(PhoneGap) and all the other HTML5 app thingies for iOS at the time.

Not sure why that particular in my opinion pretty obvious choice bothers you that much. That is very much the reason why they didn't even bother releasing it for Android until almost two years later.


They had a different person/team working on the front end and/or they don't remember?


[flagged]


Enough with the "rules" just because you don't find it novel enough.

The author is on HN and says "Just my own brain reading through old talks and articles from Instagram engineering and Excalidraw for the diagrams. I did my best to put together all the info I learned from them into a comprehensive and simple manner".

You can take it with them.


If it was compiled from more than 1 article, it becomes an original article. The post should have only novel idea and info. is an arbitrary requirement and not the meaning of that rule.

Though I would add if author were taking anything verbatim, that should be highlighted as a quote with the original source. (edit: reading more, author has already done that.)


[flagged]


Not OP. No anger in his words. Please don't make OP feel inadequate for expressing themselves clearly.


They sure seem bothered a lot by something trivial (god forbid the post which was NOT made for HN anyway quoted some original sources and didn't go into the detail they'd like it to).

Somebody took the effort of compiling an article on several sources, and we're throwing the rulebook on them.


The author is not the submitter. Rulebook is thrown for submitting, not for writing.


Unless you're a mind reader you are way out of line with this comment.


Looks like those wordpress bots have finally figured out HN as well.


[flagged]


Ah, but that's just the sort of thing an AI Wordpress bot would say.


Looks like someone may have been using ChatGPT to produce that post.


Author here. No ChatGPT was used.

Just my own brain reading through old talks and articles from Instagram engineering and Excalidraw for the diagrams.

I did my best to put together all the info I learned from them into a comprehensive and simple manner.


Actually, ChatGTP is very useful for fixing your writing.

Sometimes when a paragraph I write reads a little too harsh to the ear, I ask ChatGTP to rewrite it - it's still my original thought.

It's really effective, but I tend to tone it down a bit to sound like myself since the output can be too formal, dry, and "academic".


I find the simplicity of the stack brilliant. It makes me to think now, to which extent, the industry nowadays simply suffers from a mix of lack of knowledge, CV driving design and big players in the game trying to sell you overkilling solutions and approaches for their own economical benefit. Have we perhaps, fell into a big enchant and, at the end of the day, 99% of companies out there could just use a classic LAMPish kinda stack managed by 3 or 4 dudes?


I just imagined trying to sell this architecture for a new product in an imaginary company, an amalgamation of every place I have ever worked:

You have to change to Azure, because we are Microsoft partners and we have free credit.

The credit is not too much though and we have to spend the same money on useless trainings so we keep being partners.

4 core and 8GB should be plenty for your dev VMs, that’s the largest we can run on free MSDN accounts.

Have you tried the managed API gateway? Why not?

You should use managed caching on the edge.

We already have an on-prem SQL Server, use that to cut costs. Yeah it runs on Vmware and network storage.

Do we have support contracts for Nginx, Ubuntu? We have RHEL licenses, so you should use that.

Can we run this in our OpenShift cluster instead? To cut license cost it will be co-deployed with the developer envs of other product, but just set resource limits. Yeah, we only have NFS storage.

S3? Just use a PVC, it’s the same.

We decided on Datadog for unrelated product A and B, so you must use that. The license is expensive, so only log errors please.

We use Kafka for the workqueue, but limited to 2 cpu cores to make it cheap, so please make sure not to send too many notifications.

Python is not for production things, we will assign an offshore team to rewrite it in Spring Boot. We target Java 11, because our productivity increasing libraries are not yet updated.

Minor change needed for deploy: every service should build it’s own RPM package in it’s dedicated git repository.

You need to submit the architecture diagrams and service documents next week, thanks for the meeting.

What did I miss?


> What did I miss?

The security team is getting weird reports from their internal nmap scanner that is constantly scanning this network. No, they don't know which of their vendor automated scanners are doing it, no they can't turn it off, so please add code to specifically ignore those scans.

Our VMware cluster has plenty of compute left, but it is out of storage space because there are hundreds of VMs from other teams and nobody knows what is still being used, so you can't get any more dev instances until we get more storage added to the NAS (in the next quarter's budget).

Any unrelated director somehow ended up seeing this diagram and had some "suggestions". Please implement these changes asap.


> An unrelated director somehow ended up seeing this diagram and had some "suggestions". Please implement these changes asap.

this user gets it


The bear is sticky with honey.

https://piped.video/watch?v=i92Ws7qPTRg


> What did I miss?

From the top of my head...

* A bunch of opinionated devs pushing for microservices because they heard that's what cool kids do now. No technical reason behind at all, just pure bias flamed by a couple of blog posts/YouTube talks they watched one evening.

* A bunch of devops with intentions of implementing "industry best practices and modern tooling ©" that will end up creating a house of cards in Terraform that nobody, themselves included, will dare to touch in six months down the road.

Am I sounding too cynical?


Funnily enough I've never worked with a dev that thought microservices were actually a good idea as a standard solution. My last brush with it is by proxy of one of my former teammates who has a client contract where the client insisted they do microservices for their app that fits in a t2small with 90% resources to spare and will for the foreseeable future. So far it's slowed down development to probably 25% or so of the speed it was before and 10% of what it should be.


Experience and cynicism are closely related.


> What did I miss?

    Seven red lines, two with red ink, two with green ink and the rest – with transparent. One in the shape of a kitten.


That'd be easy but I believe you're looking for orthogonal lines! :)


There is not way in hell, I will be able to work for such company more then 6 month (well may be it if it is 7 figures, but even then I don't think I last longer then one year), I am sorry who has to go through this.


I run a small startup. We use a monolith and docker and no kubernetes. The tech team is me and two young software engineers. I.e. I'm the only experienced person in the team. My engineers are super productive though and quick learners.

The thing is that with a small team, you need to focus your engineering efforts on things that add values, i.e. mostly functional improvements of your product. Everything else is a distraction. Anytime we hit a problem everything grinds to a halt until we fix the problem.

Non value adding activities like devops are things that we do on a need to have basis. I have used things like teraform in the past but I opted not to on this product. Reason: I'm not planning to take down my servers and recreate them. And when I do that anyway, it takes about half an hour. Not worth spending days/weeks automating. There's a reason companies employ full time devops engineers, it's actually a lot of work. I do most of the devops in my team. When I have time and when it adds value. Which is not often. I do it well and I keep it simple.

We deploy multiple times per day though; that's worth automating. Automating things you only do once just isn't. We might get around to getting some automation for setting up our infrastructure eventually. But it doesn't really solve a problem I have right now.

Like instagram, we can scale if we have to. We're in google cloud and we use a load balancer and a scaling group of simple vms that run the docker container. One nice thing in Google cloud is that you can create a vm and parametrize it with the docker container image and it will run it. No need to install anything. The default os on the vm is docker ready. So, that's one less thing for me to worry about: installing shit on vms and making sure that is updated. I simply build my docker containers, push them to the registry and tell the scaling group to apply a new instance template with the new container. It does the rolling restart. That's just 2 simple gcloud commands in a Github action.

Simple is essential. Easy to understand. Easy to fix when it misbehaves. Easy to explain to others. Simple ensures you don't waste time.


> One nice thing in Google cloud is that you can create a vm and parametrize it with the docker container image and it will run it. No need to install anything.

Never knew about this feature!


Agree. The state of Kubernetes increasingly reminds me of the complexity hell that is Java J2EE. It’s an ecosystem of vendors with a self interest in selling complex solutions to sell high margin consulting services on top. Thus was J2EE.


Give me a cpu, some memory and an S3 bucket, and I'll build you (almost) anything you want.


"When you choose the right people you'll need only a few. When you don't, you'll need them all"


I think this idea hurts the industry. Don't get me wrong. I think it is important to hire the right people, but if you don't hire enough people this 3 man job is 80-100 hours a week per for months to years.

Look how X has diminished in quality as Elon started slashing team sizes.

Then when you design more features, security and other various systems to serve the customers it will creep in complexity. You can not escape that no matter what you do.


> Look how X has diminished in quality as Elon started slashing team sizes.

He had a point there though: He said that there seem to be "3 managers 'managing' one engineer", and I believe this is a common problem in the industry. VC-funded startups are terribly overstaffed and over-inflated.


When you own a company and fire someone to cut costs nobody is gonna say anything, of course. It's the sad reality.

When you cut the company to 1/3 and keep foreigners because of their visa status, nobody is gonna say anything of course, but that says a lot about you!

I do not believe that half the company was just "overstaff". I have been in situations where 1 manager had 1 reporter/reported, but they were single cases - it can't be spread to the entire company and nobody does anything.


> When you cut the company to 1/3 and keep foreigners because of their visa status, ... that says a lot about you!

That... you care about people regardless if they are foreigners and that you try to help those that would have the most problems if they were let go, especially as these problems are a consequence of your hiring of them?

I'm not familiar with the story, but from the way you presented it it sounds like a proper thing to do.

EDIT: not taking Musk's side, just pointing out the issue with the parent's argument.


Or… heart me out, maybe it’s because Elon knows they have the most to loose and are thus more likely to do whatever he wants to not get kicked out of the country. Elon made some statements that made his disdain for the “laptop class” quite clear. He is no saint, or genius. He’s mostly just an extremely ruthless smart man.


Sure, that is also a possible (even likely) explanation. As I said, I don't defend Musk - my opinion of him is quite low and is getting lower and lower. Just wanted to point out the trouble with the argument. Anyway.


He has finally showed the world his true self. I am happy about that - I don't want another capitalist sanctified for being a genius, which he is not. The more this becomes clearer the better for everyone.


Sorry, you're right. I was try to point to different direction: you keep foreigners because you can exploit their visa status. :)


Musk's actions have certainly damaged Twitter and its engineering. When you set the house on fire of course some termites will get killed but then you might also kill some babies and the family dog.

A lot of tech companies have bloat in the form of AI ethics people, DEI people and so on. They need to go. But Musk probably hurt twitter a lot in short term by firing a lot of engineers and making it a place that made people unhappy.


Our company has a position like a DEI director, and honestly, it baffles me. All they seem to do is push what some might call 'leftist woke propaganda' in their never-ending meetings, and I question whether our company's resources should be invested in that.


Could you give some concrete examples from this DEI director of the kind of things you're calling 'leftist woke propaganda'?


I am not OP but one of the things I have seen these people do is to insist on removing words like "nitty gritty" from our developer documentation by calling it racist.


TIL: DEI == Diversity, Equity, & Inclusion (if anyone else didn't know).


Ignorance is bliss. If you do not know what DEI is it is a bit like not knowing how to be an overweight. You are better off not knowing it.


It's just 'brand.' Companies don't generally give two stuffs about their staff, as you find out when anything like unions or HR complaints come up.


> AI ethics people

These do actually have a proper job. They do ethics laundering for the tech companies and are very valuable.


He is squeezing everything as much as he can.

Introduced failing ideas that made him ridiculous world wide, ... to finally hire a CEO.

And now he is using the platform to influence elections and events - free for all. Sure.


There is not a single VC that is three managers to one engineer


"I won't give names but trust me", VC funded tech companies with 3-10 bros 'who have or not a vague technical background but more see themselves as high level thinkers' between founders and their first hires for 'key strategic roles' followed by an engineer and one to three interns to fill production role are a common reality, at least on startup scenes I attended. In the startup process 'marketing' is also glorified (not that i disagree it's importance) so once you have one or two guy who can make demos, hiring marketing people is often the next priority.


Team Lead, Scrum Master and Director.

Not to mention HR managers.

I have seen situations where there are 10% engineers to 50% "assorted management" in tech companies. (the remaining 40% being a mix of sales and support staff such as office management).


I have noticed absolutely no difference with Twitter/X. I am a casual user, sure, but for me it seems to work well.


There was a bit of downtime and some bugs did make it into their prod env after the restructure. These bugs were rectified fairly quickly too.

I generally agree with you on your comment. I am a casual user of X these days and the site seems to be humming along without any user facing engineering issues.

With his products such as Tesla, X and Starlink touching millions of people daily Elon is an easy to reach punching bag.

Even if he is somehow instrumental in solving the massive feat of putting humans safely on Mars there will always be people on the side lines having shots at him.


We don't know how that relates to the traffic. If the load caved, you need less capacity to handle it. The DeSantis launch campaign was a complete disaster, and only recovered mid-stream when 2/3 of the users left.

There is a case to be made that Twitter would have been profitable if it didn't torch money on unnecessary complexity, but the crashing ad revenue suggests Musk is not the business genius that you might expect.


Funny enough, where I've noticed the most bugs is in the ad buying process. They want an active account (which many corporate accounts wouldn't qualify as), then they want you to be a Twitter Blue member(which requires a verified phone number), then my company's phone number wasn't accepted. I gave up after that


Anecdata for sure, but I use Twitter every day and features such as video upload just stopped working for me (failing silently), not to mention the site crashing or becoming completely unresponsive every week or so.


The entire multi-media infrastructure at Twitter is visibly creaking.


I'm not even a casual user, I just go there when a friend sends me a funny tweet or an article point to one, and even I have noticed major bugs like comments not loading or other functionalities not working.


That's not a bug. If you're not logged in or your account looks like a bot, you can't see comments


> your account looks like a bot

That's not a bot if you are not a bot? I know it's 'normal', but I still it as a bug. AI classifying me as something I not is a bug.


That’s exactly what a bot would say.


Doesn’t make blocking innocent accounts automatically less of a bug.


Depends on the account. Eg if you look at unfiltered YouTube comments on youtube studio, you'll find many humans that behave like bots and post link to their videos in unrelated discussions


Media loading is dog slow these days. 10+ second waits for images in tweets to load.

That's the most directly obvious thing that has happened post takeover.


> You can not escape that no matter what you do.

The point is you can.

Choosing not to do is probably more important than being able to do.

Organizations are reflected in the products they create. We shape our teams and thereafter they shape us.

This doesn't mean brain teasers or other arbitrary metrics with standard bell curve distribution so you can pick the statistical outliers and claim you've done this. That's totally wrong because that's not what you're fitting.

Those are filters that produce stochastic results with merely the statistical properties of these rules of thumb.

If you're looking for a programmer, here's a better test: think if some famous programmer walked in and sat down to do your process. Could they pass? If the answer is "dice roll", meaning you'd say, turn down Rob Pike or Larry Wall, then you're doing it wrong.

As far as X, Musk is insane and drunk with power, that is not this.


Having only 3 people forces you to be very judicious about what you do. For most companies that’s a good thing.


> 3 man job is 80-100 hours a week per for months to years.

Is it though? Small, lean, teams have fewer processes, less distractions, better communication, and more flexibility in what they can do. I've been in such teams and built such scalable systems and there was nothing 80-100 hours about it. It turned worse once the company was acquired and management and "specilised" workers were brought in.


> Look how X has diminished in quality as Elon started slashing team sizes.

what happened?


Back when he acquired twitter he fire 2/3rds the company.


I think the interesting "what happened" is about the consequences - has there been some objective drop in uptime/performance/errors or some other technical metric?


Yes actually.

And one massive security breach.


Even before he acquired the company they had security issues. Someone even managed to tweet from his account with a BTC scam.


Could you be more specific?


This response is about what I expected


@falcor84

We hit maximum reply lengths but yes there has been. And a massive security breach.


Which security breach are you referring to? I'm not aware of any major ones that have happened since the acquisition, but this one seems to be the one people point to even though it happened prior. https://privacy.twitter.com/en/blog/2023/update-about-an-all...


No, what is the consequence, how has quality of Twitter diminished? As a “normal” user I can not observe any degradation of Twitter service.


Very famously he had Ron DeSantos on to announce his presidential campaign, but his website didn't work at all.


Experiences vary wildly between and IP addresses and browser profiles. The variance increased drastically post-takeover.


Sure, but I feel like this was because of feature creep more than anything else.

Most common thing people use it for is posting and reading. If you fake feeds as if they are real-time then the viewer will never know there was a system outage.

I am almost positive there was a massive security breach too.


> I am almost positive there was a massive security breach too.

That's a very serious allegation, as it would be illegal to not report it. Can you add some details?


>Look how X has diminished in quality

Ideologically or from a service perspective? I don't see any noticable drop in the latter. Some disruption is expected as huge parts of the team is fired/leaves, but I see it going on in business as usual otherwise.


Considering the hostile takeover of Twitter, the layoffs, lack of leadership and Elon pulling cables out of servers, the platform has been working exceptionally well, at least in my experience. Either it was built like a tank or some of the people working there are extremely competent. To be fair is not like Twitter has never been prone to issues. I'm an old enough user to remember the fail whale and many of the great outages of Twitter's first decade. It didn't became stable until maybe 5 years ago and even recently just prior to Elon coming in they had another hours-long downtime.


> Look how X has diminished in quality as Elon started slashing team sizes.

True but different: even ignoring how many of them were customer support and moderation, once the tech stack gets complex, you can't just snap your fingers and act like it's a simpler stack.

Even if a fresh 3-good-graduates team can reach feature parity with your now-1000-person team, when they've had 6 months from `git init` and you've been at it since the new team were in Kindergarten, the only way for the big corp to do the same is to buy out the new team and then leave them alone.


> Look how X has diminished in quality as Elon started slashing team sizes.

Has it? I've had the exact same experience.


this is what work is.

efficiency shrinks by more team also you get other people. if you are working on something you stay alive until it's done.

≠ a job


Twitter most likely hasn't been able to retain the "right" people.


Only people that don't want to get deported during a tech hiring slowdown


This is true.


Well, it doesnt look like they were doing much of billing/payments which eats so much time. And 14M users on a sporadically used social network with a couple of curated photos per user is not that much. Its a great achievement nonetheless.


Also, the fastest way to get rich is to win the lottery. Doesn’t mean it’s a sound investment advice for everyone.


The right idea too.


The article says nothing about how they instantaneously updated millions of user feeds. It was the most challenging task, as it's way easier to scale reads than writes in distributed systems. Rumor has it early Twitter had a target of 5 sec to update everyone of 50M fan feeds when Justin Bieber touched a screen. I would love to hear some technical details on how they did it.


I remember reading a case study from their engineering blog about this a few years ago - I couldn't quickly find it but maybe it's still out there. It was some think about optimizing for read speed, because one write from a celebrity would cause thousands or millions of reads.


I wonder how far they could scale with modern technology with the same architecture and 3 engineers. Not only are Django, Postgres, Redis so much better now. The hardware per server is also at least 3-5 times faster.

100 million?


Stories like this kind of make sense to me. 3 people is very few but I guess they really knew what they are doing.

Meanwhile, all these orgs with essentially a CRUD app, with 1,000s of engineers..? That I never understood.


These CRUD apps need complex business rules, requiring expertise in the domain and making them configurable on the application level for customer while trying to keep the app not bloated.

Scaling is not the only challenge engineers face, but somehow it's the one that is mostly praised.


They also need to respond to customer requirements, which IG never needed to do while they had no actual customers. And as soon as fun was up and IG had actual customers (spoiler alert, advertisers) what a surprise 3 devs was not enough.

They also need to quickly respond to downtime, because unlike IG if some of those CRUD apps go down in B2B world you are often losing customers actual money not just ad views


Ad views translate to actual money??


Sure. If you are full on marketer you can say "not see an ad for service" is as bad as "service does not work".

But anyway there was never a period where Instagram had a tiny 3 dev team and handled ads at the same time. 3 devs only worked back when there were no customers, no ads, no profits and no real responsibilities.


They stayed true to a single product purpose, did not add frivolous features and experiments, and didn't deal with payments/billing, eventually being "rescued" by VC money? You seem to gloss over that AWS also helps them out a lot already here, by being the load balancer entry points and ordering VMs from a api.


There's no glossing, I just don't understand. It's also far from my corner of technology.

Still, between 3 devs and thousands, that's 3 orders of magnitude. Let's take the core team and 30x them for a broader project. Add 50 (?) for billing. Another 50 for web apps. Another 30 for general devops? That takes us to a "mere" 220 devs.

Then I read that Uber had around 2,000 SEs in 2020, and Airbnb at some point was 1,200. I don't know how to grok that.

You mention AWS, my impression was that even many very large orgs run their their hardware in the cloud (surely not all but still).


There seems to be a pretty big difference in business logic and requirements between a photo sharing app and one that moves or houses people.

Dealing with payments globally to hosts/drivers and specifically compliance is complex in itself.


Not engineer driven orgs _and_ their product people don’t know how to drive the product or set requirements, or engineer driven by the type of engineers who love to tinker without making any real progress business wise



Interesting I wonder if its seems easy because it’s explained simply or if it really is simple to put in place. I want to make a clone now, just to try. At least for this inspiration, this article was well worth the read. Thanks !


It looks fairly standard tbh. A lot of the heavy lifting is done by AWS. They likely didn’t start out with this exact architecture in mind, but adapted along the way. I.e. something like: The timestamped key format improved poor performance. Redis was introduced when Postgres was getting swamped. Didn’t replace the existing memcached with Redis as it worked as it should. And ofc there surely were a ton of oddities/ issues with their AWS setup that they spent time fixing. The nginx description is a bit vague, but could be some “hack” to work around some ELB scaling behavior that wasn’t to their liking.


It's pretty simple if your goal is using boring, stable technologies. Django/Postgres/Memcached/Redis/Nginx are all really stable.

I've built quite a few projects using almost exactly the same stack over the last 15 years. Almost all are still running, those that aren't are for business reasons not technical.

The problem is, they're not exciting The Next Thing real-time javascript somethings, so a lot of devs won't want to use it.


Author here. Comments like this make my day, so thank you!

I’m trying to find old software engineering gems and explain them as simply as possible, so I’m glad you found it simple to understand.

Also, it’s definitely possible to make a clone, but the hard part is getting the users :)


Of course, I do not envision getting users. Only for fun and training - I think having examples of simple solutions to seemingly hard problems in mind makes it easier to come up with one in the future.


Do you have a link to Fabric? My Google Fu is failing me, looks like it might be dead?



Fabric is still an option but nowadays Ansible is probably the closest popular choice for things like that.


I was wrong thruflo pasted the correct link

You are probably right. The Instagram Engineering blog [0] points to the Fabric documentation on Read the Docs [1] which is empty.

"Read the docs" links to the GitHub repo [2] but it's 404 and even the GitHub organization zwsyff888 is gone.

[0] https://instagram-engineering.com/what-powers-instagram-hund...

[1] https://fabric.readthedocs.io

[2] https://github.com/zwsyff888/fabric


Instagram engineers found some remarkable simple solutions to some hard problems. It's not easy to come up with these solutions. Designing the IDs for example is no small feat, but since this is now common knowledge it's probably not too hard to build a similar system.

To get traction from users is the real challenge.


Which makes things like Vercel so depressing to me.

You've got a company paying off influential people in a space where people are looking for guidance, convincing them that they too have hard problems that cannot settle for simple solutions.

Selling the narrative that developers need to be all in on the most irrelevant aspects of building a product, and ignoring the fact that if you instead focus on building simple, easy to maintain software, the fact your LCP isn't hyper optimized by some newly invented mental model for app development won't matter: Google (or any search engine for that matter) will not ignore the fact people just actually want your content.

They do not care how great your web core vitals are if you waste a bunch of time bending over for some irrelevant bullshit problem instead of talking to users and iterating.


The fun part is often all that bullshit hyper optimized complexity can still be beaten on all the web core vitals by a humble LAMP server

So not only are you wasting your time on all the wrong things, you're also gaining nearly no benefit from it

I'm pretty convinced that all these shiny new hotness tools and frameworks of the past decade are actually just meant to sabotage competition and small companies


I agree - for most of them. One exception would be React/Vue/other frontend frameworks that split updating the state and visualizing it. It sounds like a small thing, but it makes a world of difference in non-trivial projects, compared to native js / jquery. Then again, it's the idea itself that matters, overoptimizing the implementation is not beneficial. I.e., React classes are just as good (or better) as hooks as far as I'm concerned, it would be better if they would let it be at that stage.


Vercel and co managed to mangle the one benefit you mentioned.

They introduced the concept of a "server component": one that is rendered once and cannot update state.

And instead of making that opt-in, they made it the default.

That is to say, the default of React is to no longer allow updating state in components.

No value proposition is safe from the forces of financially incentivized thought leaders.


The Meta team introduced Server Components: https://legacy.reactjs.org/blog/2020/12/21/data-fetching-wit...


The RFC was driven by Vercel. Vercel announced RSC landing in experimental on stage before the React team even updated their docs

I'm not even going to get into how ridiculous it is that the default config chosen was to break people's builds for a feature that isn't mature enough to have anything more than a single reference implementation from Vercel.


Yeah, we took some of their ideas about ID design and have applied it in a way that we could improve our storage.


> To get traction from users is the real challenge

Unfortunately yes. Scaling is a problem I would love to have :D.


I'm more interested in how they got 14 million users with only 4-5 people. what marketing tactics did they use? I assume those should be milked to death by now.


WhatsApp used to refresh its app by tweaking its name in early days on Appstore so that it appears new in the algorithm to be recommended. I guess that practice is now prohibited on both the stores.


Luck.


luck is being milked to death, but it only favors the living


Here's recent insiders experience on developing Meta's Threads app on top of Instagram infra: https://newsletter.pragmaticengineer.com/p/building-metas-th...

> In July of this year, Meta launched its latest mobile app, Threads, a microblogging service and new rival to X, formerly Twitter. In the first five days following its launch, the app achieved 100M downloads – a new record for the company by some margin. Meta’s previous record for new app installs in the first 5 days after launch was 1M.

Built with a slightly larger team, considered an agile team: 3 product managers, 3 designers, about 60 engineers.


I tthink the reality of modern development, especially one plugged into the Meta ecosystem is that there is probably a tonne of integration work to be done.

- monetization

- finance

- analytics

- ad placement

- ad bidding

- android client

- apple client

- browser client

Just off the top of my head in 2 mins that's a few of the extra concerns I can come up with...

I remember meeting someone who was responsible for writing an ultra performant JS WhatsApp client for firefox OS. When you have 2B users, the long tail is long...

OG Instagram was pre-monetisation. So yeah maybe a simple image hosting service needs fewer engineers but maybe a profitable business that can monetise the service needs more...


You hit the nail on the head. I'd add moderation to that list. Also, when you're in revenue mode and scaling from 14M users to billions it's no longer sustainable to outsource most of your infrastructure - OS, storage, other basic services - to a cloud provider. So you're going to need people to develop and manage those, even if you tell yourself you don't have to count them because they're elsewhere in a large company (as I was once). Then you have things like observability tools, non-trivial deployment tools, etc. that are also Other People's Effort.

Don't get me wrong, I think it's great that we live in a world where there's infrastructure lying around that makes it possible for a three-person team to achieve so much so quickly, and that team deserves a ton of credit for doing it. But it's not the "these companies are so bloated" lesson that some people are going to try to take away from it. That's just Tall Poppy Syndrome.


Remember Instagram was iOS-only for a good long while. So no need for Android developers or front-end web developers.


Didn't know that, I still don't use Instagram lol. So yeah exactly - it's easy to run a simple app with a few devs.

Iirc in Singapore, Netflix has a reasonably sized team just dealing with payments.

Presumably for a global business, payments is a pretty gnarly long tail of integrations (never worked with payments so pretty ignorant on the subject) not to mention I wonder whether they also are onboarding content distributors onto some kind of automated payment platform so they can very quickly churn content libraries or attribute some kind of viewing related payment...

I guess tldr doing business is complex and needs lots of engineers vs making a small and focused app.


I have chills thinking about the oncall schedule of these 3 colleagues...


What on-call? Was there even anything needing 24h support back then?


Beauty of a free app with no advertisers is there's no SLA's.


Worth keeping in mind that during this period, there was only one frontend: the iOS app. It also offered a fraction of the functionality that modern IG/Twitter/FB/TikTok have now.


> 30 million monthly active users when it was acquired...[they] had just six generalist developers [1]

So they got to 30mm users with 6 engineers...scaling linearly with 5mm customers/engineer. Incredible.

[1] https://review.firstround.com/how-instagram-co-founder-mike-...


14 million doing not so complex things is an easy achievement. When you get into a lot of microservices providing tons of features your teams will balloon.


That's why it's important to keep the number of devs low, it makes it less likely that one starts talking about microservices.


We use the microservices architecture as a single team and don’t have any issues with this for many years. The key is to have a monorepo and stay consistent by following strict coding guidelines.

In my opinion it makes the backend way more resilient than a monolith.

Don’t kill me for this opinion please ;)


How does it make the backend more resilient than a monolith? Do you not realize you have multiple instances of a monolith or something?


There are some reasons it may lead to resiliency; another teams features slow dB queries not being on your db, teams not mutating data in a shared db, memory leaks in someone elses feature not taking your app down. Being able to choose language/libraries and tune the runtime to your requirements.

Of course when you replace function calls with network calls, make everything asynchronous and eventually consistent, there is a lot of work to do to not end up with a less reliable system.


A monolith doesn't force a single database.

A monolith doesn't force a single process. IPC is still simpler and cheaper than network calls.

A monolith doesn't force never having an external service for a specialized use case, or FFI.


> IPC is still simpler and cheaper than network calls.

I specifically called out the extra complexity of network calls in microservices, not sure if you read the full comment.

> A monolith doesn't force a single process

I'm not convinced; if my small/specific code has it's own process, I would say it's a microservice. Sure, we can have replicas for redundancy, that doesn't mean I won't have reliability issues when my process is crashed.

> A monolith doesn't force a single database.

> A monolith doesn't force never having an external service for a specialized use case, or FFI

True, sadly it doesn't usually work this way. People take the path of least resistance.

Also once you add multiple DBs you start to get into eventual consistency; which is one of the harder parts of microservices.


> I specifically called out the extra complexity of network calls in microservices, not sure if you read the full comment.

Calling out networking doesn't preclude me from mentioning IPC. IPC isn't limited to network calls, it can be as simple as shared memory and hit millions of OPS: github.com/OpenHFT/Chronicle-Map

> I'm not convinced; if my small/specific code has its own process, I would say it's a microservice.

And you'd be wrong. A core tenant of microservices is being able to individually deploy your microservices. If I spin up a new process for some high risk, highly memory intensive process I've introduced a fraction of the operational complexity of a seperate server and retained the core value proposition of reducing its blast radius if things go south.

Of course again, if you're having so much trouble handle writing software that's reliable that you being to consider isolating instability as a top benefit from your IPC setup instead of a tiny value add... it might be a sign you're not ready for microservices.

_

> True, sadly it doesn't usually work this way. People take the path of least resistance.

> Also once you add multiple DBs you start to get into eventual consistency; which is one of the harder parts of microservices.

You're making my point: If you don't have the engineering chops as a team to make a robust monolith, you definitely don't have the skills and resources to start looking at microservices.

Eventual consistency is not inherent to having multiple databases. If I have an oft changing ephemeral set of data that only affects one feature and it's creating an impedance mismatch with our main datastore, nothing is stopping us from pulling in Redis for all the queries we were previously sending to Postgres, and as far as anything relying on that feature is concerned, nothing at all changed.

With even half decent engineering, Redis going down doesn't break any differently than it would have for a microservice: you define the same error boundaries as before and the failure case ends up the same.

I mean seriously, if your team can't handle having a second data store, imagine the bedlam when you're trying to handle multiple languages across multiple data sources in a non-centralized manner?

_

Microservices are a pattern for companies where a "microservice" gets the kind of development and devops support that would justify spinning off a new mid-sized enterprise.

When you're Netflix your `api/movies/[movieId]/subtitles` endpoint is serving the kind of traffic most companies will never see in their lifetime and needs optimizations that maybe 100 companies in the world will ever need.

For the rest of us EC2 has 224C/488T CPU 24,000 GB RAM machines with 38 GBPs I/O bandwidth. If your business ever scales so far that you outgrow that, throw some of that X Billion dollar valuation money at the problem and build your microservices.


> Calling out networking doesn't preclude me from mentioning IPC.

You made the same point I made as though it was in contradiction to what I said. Adding a network call adds complexity, yes.

> A core tenant of microservices is being able to individually deploy your microservices.

And why would you not want this to be independently deployable?

> You're making my point: If you don't have the engineering chops as a team to make a robust monolith, you definitely don't have the skills and resources to start looking at microservices.

Firstly, you never made that point. Also, I never argued against it, in fact I agree completely.

> Microservices are a pattern for companies where a "microservice" gets the kind of development and devops support that would justify spinning off a new mid-sized enterprise.

Disagree, netflix has >1000 microservices.


Ah sorry, I guess replying to people supporting microservices by calling out the gaps in technical knowledge they're using to justify microservices is not the same as saying ..."you definitely don't have the skills and resources to start looking at microservices"

Ah, wait it is.

> And why would you not want this to be independently deployable?

Because FAANG has more engineers devoted to managing deployment/observability/version skew/DX/scaling/security than you have engineers. Simplifying your needs in those realms helps you greatly.

And to top that off, it 100% can be independently deployable if it's a big enough separate concern: that's just SOA without the 90's XML/SOAP/RPC spin that was ESB: https://aws.amazon.com/compare/the-difference-between-soa-mi...

_

> Disagree, netflix has >1000 microservices.

That says exactly nothing. At Netflix scale their most random "trivial" endpoints are easily doing scale that entire SMEs won't ever deal with.

When FAANG is your case study in any technical discussion in a public forum, you're default wrong. I work at an AV company, I'm not about to start telling people the insane architecture we need to support ingesting petabytes of data is something that anyone else needs.

Any useful technical discussion needs to be grounded in what the 99% need, and microservices are not it.


> Ah sorry, I guess replying to people supporting microservices

Again, at no point did I make an argument for microservices.

> that's just SOA

Absolutely not, from your own reference:

> Each service provides a business capability.

Spinning a high memory task off into it's own process is not a business capability. Microservices are more granluar than SOA services, your describing a microservice.

> That says exactly nothing. At Netflix scale their most random "trivial" endpoints are easily doing scale that entire SMEs won't ever deal with.

You said microservices are for when a microservice would have the support equivelant to a medium enterprise, this is not true even at netflix scale. They absolutely have services owned by very small teams, or else they wouldn't have more than 1000.

> When FAANG is your case study in any technical discussion in a public forum, you're default wrong.

Well who do we use as a case study on microservices then?

> Any useful technical discussion needs...

A technical discussion requires nuance, not turning into a black and white one side versus the other.

Yes, you can have multiple DBs in a monolith, but you tend not to. In microservices you are basically forced to.

It's a crude and expensive way to force modularisation. However, that is still what it often achieves, it gives you infra that you can keep other people away from and lets you be in charge.


Bad input crashes app, monolith fails over, other instance crashes. Full outage. Assuming proper vertical separation, this risk can be reduced by microservices.


It might just as well be increased by microservices due to fragile dependency chains and just different people working async and what not.


Maybe read my comment again.


Care to enlighten me? Your comment reads like "assuming the best case scenario for X and worst case scenario for Y, X can reduce the risk". Well, you don't say.


This is not what microservices solve.

There'll always be critical microservices that keep your app running. It doesn't matter if all your other services are running if the one serving up core functionality goes down.

If your engineering rigor is so poor that you can't get reliable failovers with a monolith, god help you keeping microservices running.


Hence “assuming proper vertical separation”. The same applies to monoliths, so not really an argument.


It's more important to keep the number of features low. Good devs talk about aligning the architecture to requirements, and that sometimes includes microservices.


This is true. I completely agree with the minimized scope. However a dev needs to be careful to not let their service just sit unmaintained. New teams will always need engineers to maintain and improve.


You talk as if monolithic apps are vastly superior. To be forward it depends entirely on the purpose and life of the application. It is about whatever shoe fits the design.


Sure. It’s just that 99% of applications work fine (or better) as a monolithic design.


Depends on the purpose of the application though. Monolithic is a good architecture when you have a few purposeful features and functions.

But when your design relies on many services to provide a wide variety of features you need to break out this design to allow teams to operate independently.

Mini monoliths are more popular today than traditional monoliths of the old.


You can split things up you don't have to though. Teams operating independently is fairly orthogonal to this.


Yeah no I get you. You just want a monolith to be purposeful when you design one. Not multi-purposeful. This is also at the limitation of a programming language. I am kind of kubernetes guy, but I am dying because it relies heavily on a virtualize network distributed. It would drastically increase the performance if kubernetes clusters were built like monoliths and each kubernetes node handled traffic independently. So of like keeping it all in the same rack. Only leaving the rack if needed. But I keep seeing bad technology decisions repeated over and over. I stopped pushing because some person with a bigger title would say this is good design. Big kubernetes clusters eventually fail. Multiple small clusters survive.


Replacing function call with network call does not really solve any org issues. There is pretty much 0 difference if teams are shipping "modules" for a monolith vs a microservice outside of much simplified CI/CD setup in case of monolith. You can gain some scaling efficiencies from scaling services independently but it's a minor advantage for most projects.


> You can gain some scaling efficiencies from scaling services independently

Not necessarily, because if you scale only one of your services all the other services do not benefit at all.

Having microservices would only be better in that instance if they actually consume the resources they are given.


And you of course have data to prove it, right?


Experience.


In two decades of development I've never once seen a monolithic architecture that with some form of shared database not be terrible for the business it powered. I certainly understand why it's very common, it's what's still being taught to most CS students in my country after all, and, it's frankly a lot easier to implement. The result, however, is always the same. It ends up being a mess where nobody can do anything because the data structures are so intertwined (and undocumented) that nobody knows how it's actually used. What happens is that monoliths become magnets for business logic, and then you bottleneck every change into requireing a select few members of your organisation. As time goes by, you end up with a giant turd that stagnates and directly hurts your business. Not by intention, but because that's exactly what happens when you make things complicated.

It's important to keep in mind that this isn't a technical problem. It's an organisational problem. In fact, there is no technical reason why monoliths would be an anti-pattern, which is likely why they are still being taught as though they weren't at many universities where professors still naively think that the MBA's aren't going to cost-cut IT at every opportunity even though their entire organisation is made up of employees who spend 100% of their working time on IT devices of some form. Similarily, Microservices, aren't really the "technical" response to this. It's how IT and digitalisation had to evolve to keep up with business demands and better generate value. The simpler and more decoupled you keep things, the better you'll be able to respond to business needs. Sure, there are a gazillion different ways to do Microservices wrong, and if you do it wrong, then you'll likely be in the same mess that you would be with a monolith, only so much worse, because now you have 9 million tiny monoliths and shared databases.

Luckily we still live in a world where everyone is somehow still OK with IT not working. We went to an appointment that isn't relevant the other day, and they had a tablet where you could register your license plate to avoid getting a parking ticket. It didn't work, so we talked with the receptionist who was like "yeah, it does that all the time, don't worry, if the systems are down then they can't give out tickets"... Fine for us, but think about that... It turned out the system was down in my entire city, which means that all those hundreds of employees who are out handing out tickets had nothing to do while their IT system was being fixed, hell, the entire company wasn't generating income for my city while their IT was down, and this was a regular occurrence? What my point with this is, is that you can do things really wrong, and still be a "successful" company, it's just that the companies who manage to generate value better (which is frankly always microservices of some form) tend to simply do better. But like I said. You can do "microservices" in a million different ways. Running two different django backends to handle different parts of Instagram could be considered having two microservices after all. The importance is how you deal with the needs of your organisaiton in a rapid fashion.


Agree with most everything you've said. Just want to point out that IT stuff not working properly is only grudgingly accepted when users are captive. Could be a government service (as in your example), corporate monopoly, or a work mandated application. If it doesn't work properly, users are stuck with it no matter what.

But for anything where there's healthy competition, this completely changes. Errors, bugs, conceptual problems, etc absolutely will have an extremely negative impact.

As an example I once worked for a company selling tickets online, but there were numerous bugs, and the system would often crash under load. Long story short, we lost many users to competitors, that company is no longer independent, and all that code is now legacy.

Compare with the monopoly situation of Ticketmaster, they are far worse than this company ever was, and are quite successful, with a large user base. That hates them ;-)


The problem with most companies and monoliths is they broke the first rule of engineering, keep it simple. A tool or service should have one purpose in mind. Mutli-tools are fine if they are used infrequently, but not one tool should share the same burden or it loses efficiency.

Same thing happens in microservices too. You just need good planning an organizing.


Honest question: what justifies microservices in the popular companies compared to instagram?

Twitter: just sharing a bunch of text.

Netflix: platform with static content, where sharing isn't even possible.

TikToķ: Instagram for video, so basically just bigger files.

WhatsApp: they also pulled it off with a small team.

If you talk about Roblox, now there's a challenge! And they pulled it off with way less engineers than Twitter.


For Twitter and TikTok, I bet that the reasons are B2B offerings and not their B2C offerings.


Well it's not like you have have to get into microservices


Can someone confirm this story who has inside knowledge?


Sources are linked at the bottom of the article!



Terrible pop up on that site. Designed by an a hole.


Threads

I find it interesting that Meta choice largely this same tech stack for their newly created Threads service.


It’s not that they made the same choices again so much as they just re-used Instagram code and infrastructure.


They had the choice of using FB/Hack stack or IG/Python stack, and they choose IG/Python … which is the interesting thing.


The product is basically Instagram but with text instead of images. I would expect that would be the primary motivating factor in choosing Instagram as a starting point rather than the language.


Also this was in the days before it was acquired by FB. Hence why only 3 was need it.


I presume this architecture would look very different today? With supabase, spanner, cloudflare, etc? When scaling, the database bit is the least clear to me. How do you create an "isolated" database per customer?


I can't find where it says it's per customer. It says they have "a few physical shards". From what I understand, they had a few physical DB servers which stored their own subsets of data. When a request came in, the appropriate DB server was selected based on photo or user ID.

P.S. Here's more information: https://instagram-engineering.tumblr.com/post/10853187575/sh...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: