A lot of startups engage in a sort of cargo-cult architecture. Their reasoning goes something like this:
1. Amazon/Facebook/Google have a lot of traffic.
2. Amazon/Facebook/Google use X to scale horizontally.
ergo:
3. My little startup should use X and scale horizontally.
What they fail to realize is that most of these companies would be ecstatic if they could scale machines vertically, if they could focus on great user features instead of having to figure out how to shard in the application layer. You should never forget that Amazon, Facebook, and twitter all started out as pretty basic LAMP stacks and built the tools when it was obvious that no other tool would do. I think google's an exception because their MVP was in fact a web scale application. So by all means, vet your idea, get some customers, get traction, and scale the cheap way by buying more ram for as long as you possibly can.
Before Google came along and showed the business people the benefits of horizontal scaling, any software engineer would be automatically considered crazy if they suggested an architecture that wasn't built on a central RDBMS.
So you have to weigh it against the other cargo-cult. How many startups along the way have failed due the inability to scale horizontally?
How many have failed due to too much cost and complexity associated with re-engineering an architecture in which the assumption of fully ACID transactions permeates the entire codebase? (While the phone is ringing off the hook because production systems are falling over under load.)
I realize this may not be a popular opinion on HN, but there's something to be said for planning ahead. I've seen this story before and already know how it ends: you wait until the last possible moment to switch to a more horizontally scalable system and next thing you know you're spending more time and money maintaining the "cheap" solution than it would have taken to switch to something like Cassandra beforehand. To make matters worse, your service is crashing and the short term fix takes a day and requires two or three people to do the replica switch shuffle. The long term fix will take a couple of weeks, if you have the time for it of course.
Long story short, I have to grant that you shouldn't worry about scaling up too soon or too quickly. But that don't go to the opposite extreme by putting it off until the last possible moment.
Granted I'm speaking from second hand experience, but for the last couple years I worked closely with some former and current amazonians who worked on their frontend and order workflow team. I do know for a fact that much of their frontend is/was in perl, and that oracle databases have their place. But I guess that wasn't what they started with.
He mentions Fusion-io drives getting faster. What's interesting about iodrives is that you don't have to buy a new one to get more speed, just updating your CPU or RAM will speed up an iodrive:
Even among startups, web scale data requirements are the exception, not the rule. Facebook and Google are ginormous. There are many, many very impressive applications whose database wouldn't tax a single commodity server. (Similarly, there are applications that make terrible businesses but which consume computing resources like losing a byte of information would doom humanity.)
I mean, go through a list of YC companies or other startups you respect, winnow it down to the ones that exited or otherwise achieved some level of success, and play guess-the-size. How many terabytes of storage do you think e.g. Airbnb needs?
There are a ton of high-traffic websites out there that don't need an architecture any more complex than a standalone DB server + PHP + varnish (or the equivalent).
More so, if devs spent as much time tuning the performance of their apps as they did fantasizing about "web scale" architectural pivots they would typically be farther ahead. StackOverflow.com is a perfect example of this. They run on tiny handful of windows machines, support gobs and gobs of traffic, and have absolutely fantastic performance. And as much of that is due to paying attention to performance and making sure to find and remove the bottlenecks where they exist as it is to using cutting-edge architectures like database sharding, map+reduce, eventual consistency models, etc.
One thing I've found is that scaling rarely means solving difficult problems. Rather, it means putting more time into finding optimal solutions to problems that are trivial at smaller scale. For example, should your startup use Apache, nginx, or HAProxy as a load balancer? If you're just launching, the answer is "Who cares, just ship the fucking thing!". If you reach the point where you start measuring page views in the billions (and yes there are start ups that are at this point), it matters a great deal. Or should you use Postgres, MySQL, or some shiny NoSQL thing? Again, probably doesn't matter for small websites. But for larger services, it matters.
Also, don't underestimate how large log files can grow in a data-driven business (like AirBNB seems to be). I could easily believe that they have many terabytes of data just from logging actions their customers have taken.
> Also, don't underestimate how large log files can grow in a data-driven business (like AirBNB seems to be). I could easily believe that they have many terabytes of data just from logging actions their customers have taken.
Logs don't have remotely the same access requirements as the databases used to serve a product.
Indeed, but it's worth pointing out that in this case "different" doesn't necessarily imply "easier". Instead of having to access the data across many concurrent connections, you have to be able to store the data efficiently so that it doesn't take up too much space and you can do jobs on them that don't take 3 weeks to complete. And let's not get into how you collect and merge them together. There are open source tools to do these things, but you're still looking at a decent amount of infrastructure to make it work.
Perhaps for the applications of yesterday (like Basecamp) this is the case, but the real innovation taking place is around collecting massive amounts of data and processing it in interesting ways. These systems are used every day to make quantified business decisions rather than best guessing based on someone's hunch. 37signals builds questionably good UIs on-top of a database, something people have been doing for decades now. The future is in augmenting intelligence by gathering massive amounts of data and reducing it for human consumption.
I don't disagree with anything you've stated. Explosive growth and requiring massive amounts of data storage are surely the exception not the rule.
That said, the blog post talks about enormous growth and it still fits inside Moore's Law's growth. I guess my gut is just saying it's not really that enormous in terms of startup scaling if it's still within those limits. Not to take anything away from 37Signal's success, but it feels like nothing of value was really added by this post. I present the post of a picture of 864GB of ram as supplementary evidence that is near the top of HN right now.
I think most "other startups" would LOVE to be in the ballpark with BaseCamp in terms of usage. No BaseCamp is not big like Facebook, but you've heard of them right? Most "ordinary" business people who do project planning work have probably heard of BaseCamp too. Most people never hear about most startups, fewer try their services, fewer than that become regular users.
The point of the post is that in many/most cases, it's still easier and cheaper to throw hardware at a performance problem than to devote scarce engineering effort to optimization. And it's only getting more and more so. If you are a startup and you can throw $10,000 of hardware at a problem you can then keep your $100K engineers working on things that hardware alone can't solve.
Most "ordinary" business people who do project planning work have probably heard of BaseCamp too.
I think this might be the tech bubble showing. I've never heard of any non-tech, non-startup companies using Basecamp. I'm not saying that they don't, of course, but I'd be interested to hear some case studies in its use outside of tech-savvy crowds.
Last I checked, I believe something like 20% of our customers were from the tech/startup scene. The vast majority of our customers are regular businesses.
But yes, we're still tiny compared to behemoths like Sharepoint. All the more reason to be excited about the next 20 years!
And there was a great press quote for you in the comments: "$50 is peanuts for what you get in Basecamp. Any business doing $1000 a month should find huge leverage from it."
Exactly. As horrible as it is, sharepoint rules this area and probably has an install base that dwarfs base camp. When the average corp person thinks project and file management they think sharepoint.
Most programmers probably work at either BigCo enterprises (banks, insurance companies, telcos etc) or at "startup" type tech businesses or freelance agencies.
There are other industries that contain a lot of small businesses and probably employ very few programmers, think restaurants , local shops , small law or accountancy practices etc.
These guys probably aren't using sharepoint, most of them are probably using excel spreadsheets combined with paper.
I'm not sure how many of these guys are using things like basecamp but they a lot of them probably should be.
I thought 37signals website has videos of "satisfied" users? and some of them don't seem to be "tech" and "startup". Probably small businesses, but I don't think we can lump all "small businesses" == "startups"
> Most "ordinary" business people who do project planning work have probably heard of BaseCamp too.
I don't think so. Maybe we just have a different opinion on ordinary, but I come across people all the time that would think Excel is the normal thing to use for this (and pretty much any other task...). Using web applications for day to day workflow is still alien to a lot of "ordinary" business types.
I did a presentation on a game I made at Geoloqi that talks about the Fusion IO drives and why I think local hardware with SSD is the best solution for persistent data stores right now: http://www.slideshare.net/KyleDrake/building-mapattack
The one warning I want to provide though is that not all SSDs are created equal: Make sure you get one that writes its cache to the disk on power failure, or you're going to be in a world of hurt.
That's what I thought too. Fusion-io drives all flush their write buffer to the nand flash on a power cut event, I wasn't aware that some SSD's didn't. People use iodrives as a caching layer in conjunction with tradition spinning hard drives, so wondered if possibly he was referring to that kind of setup.
To guarantee on-disk consistency, programs like database servers call fsync for every transaction. By definition an fsync involves events on the hardware level and even in an SSD this gives a slowdown. The SSD can use a write cache to speed this up. However should you lose power and the write cache is not backed by a small battery, you lose your most recent write(s) even though the latest fsync call guaranteed that the changes were on disk. The battery allows the SSD to successfully flush the write cache to disk in the event that the computer shuts down.
Pedantry -- that isn't Moore's Law related, aside from the loosest interpretation that computers as a whole get faster.
This is so true, though. Much of the horizontally scalability need comes from the land of extremely underpowered VPS machines on platforms like AWS. Yet the mind-boggling scale of performance you can inexpensively* (*-term used relatively) acquire for database servers is astonishing. SSDs (and plug-in flash drives) and boundless memory have changed everything.
Actually, RAM would be the perfect application of the exact meaning of Moore's law: more RAM is an almost direct function of more transistors, and Moore's law is:
> the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years
Moore's law precisely predicts you can double your amount of RAM every two years for the exact same price. Which is pretty much what TFA is about.
Given all the problems they've had with AWS, I wonder what the overall cost/benefit would be for someone like Reddit moving from the flexed-EC2-servers model to just one high-powered db server and some caching webservers in front of it.
Reddit's probably was using trying to use EBS as a backend for a transactional database. The issue is that EBS makes no guarantees on latency, only that they'll never lose your data.
Amazon is not saying it, but EBS is probably lying when doing an fsync().
In this post [1] explaining why Reddit was down, one problem was that the DB slave got ahead of the master, with the most probable explanation being that the master flagged the data as being safe for replication, before committing it to disk and I trust PostgreSQL more than I trust EBS.
In this forum answer Amazon is giving about this potential problem [2], they are dodging the question by saying that fsync() guarantees durability for instance failures, but not for volume failures, with the anual failure rate (AFR) being given as 0.1% - 0.5% for volumes (how accurate that is, it remains to be seen).
So EBS is probably lying about fsync's success, especially since the behavior of fsync in virtual environments is always a surprise. So you can definitely lose your data more frequently than if you had your own hardware.
Amazon is not saying it, but EBS is probably lying when doing an fsync().
Given that POSIX says "the nature of the transfer is implementation-defined" and they've defined what fsync does on their implementation: No, they're not lying.
I wouldn't say that they're being misleading, either. On a local physical disk, once fsync returns your data is safely stored unless/until the disk dies. According to the forum post you linked to, semantics on EBS are exactly the same.
fsync does not mean "has been written to spinning magnetic media".
Reddit could easily cut their spend on hosting by like 50-75% just by moving to a dedicated hosting provider. They'd also get bare metal speeds in the process.
Who knows why they've stuck with AWS, though. It's possible Amazon is giving them a discount so they can be used as an example.
What never seems to get mentioned is energy cost. Sure, you can keep throwing hardware at the problem, but eventually that will lead to a substantial power bill.
I agree with what both of you have said. It's possible that, in the not-too-distant future, "green programming" will become a new field. If you can design an architecture (both hardware and software) that demonstrably reduces power costs, that's a valuable skill.
And, I'd argue, on two fronts: lower costs for the company, plus good marketing. Every forward-thinking company (at least from what I've observed) are eager to brand themselves as "carbon neutral", "eco-friendly", etc.
This is nothing more then trying to pay the interest on your technical debt instead of paying off the balance. The longer you wait the more it grows and the bigger the problem becomes down the road. Right now its 20K for a new system, And for your troubles you still get no redundancy. Yes ram is cheap and you can just buy a bigger system next year, but hopefully you will hit a limit (growth is like that) then you will need to scale your now large and complex system. Code will need to be reworked. Interfaces will need to be re architected. The guy who wrote that bit of code for you that runs your whole business he may not be with you any more, even if he is that hack he put in place because you had the horsepower to spare, he doesn't really remember why he did it at 3:00am that one morning when stuff got out of hand. Now you have downtime and huge development costs, because you ignored a problem and instead threw hardware at it.
One of the most common similar occurrence I meet regularly is people grossly underestimating how many bytes per second you can really get from basic hardware such as hard drives, gigabit ethernet cards, etc. People often still remember that in 1998 the fastest FC 10k rpm drives hardly topped 12 MB/s, and that you needed an 8 CPU Origin2000 to push 80 MB/s through a GigE link (given that you had a terrific RAID array at both ends).
Nowadays, even the most basic PC can saturate a GigE link (115 MB/s), and the slowest hard drives go 100 MB/s. Any SSD sustains several thousands IOPS, and so on.
One of the most unfortunate result is that people often buy hugely powerful hardware when very basic stuff would have done the job just fine. How many people have I seen running puny workloads on 100k bucks 3Par or EMC arrays, where a handful of SSDs in a server would have done as well or better? Pulling 40Gb fibre across the room when cat 6 cable would have sufficed?
For those with custom hardware setups like this, how much time do you estimate you've spent designing, building, and maintaining custom servers?
I can see how it would be less effort than sharding, but it seems like throwing hardware at the problem is not a completely "free lunch" when you get to the scale of having to custom design a server supercomputer and source specially made SSDs and terabytes of RAM.
You don't need to "custom design a server supercomputer". This stuff is as mainstream as Dell. You can go to dell.com and order a Poweredge R815 with 64 processor cores and 512GB of RAM for $16,000. If you want SSDs, you can purchase SATA or PCIe ones from Dell as well.
Gah, stop abusing the word punt. The headline is nonsensical.
Punt does not mean avoidance. It has much closer associations to "attempt" (punter, plus rugby usage).
I totally agree with the rest of the article, why shard or otherwise distribute your database before being required? Although I think I would build my DB as shard 1 of 1 to allow for the future case.
You're thinking of British English. In America, punt means avoid, and most people here probably think of an American football punt--which is the voluntary end of a failed offensive possession--so suggests you're giving up for now and will try again later. We probably bastardized it, sure, but that ship has sailed.
1. Amazon/Facebook/Google have a lot of traffic.
2. Amazon/Facebook/Google use X to scale horizontally. ergo:
3. My little startup should use X and scale horizontally.
What they fail to realize is that most of these companies would be ecstatic if they could scale machines vertically, if they could focus on great user features instead of having to figure out how to shard in the application layer. You should never forget that Amazon, Facebook, and twitter all started out as pretty basic LAMP stacks and built the tools when it was obvious that no other tool would do. I think google's an exception because their MVP was in fact a web scale application. So by all means, vet your idea, get some customers, get traction, and scale the cheap way by buying more ram for as long as you possibly can.