Hacker News new | past | comments | ask | show | jobs | submit login
You Are Not Google (2017) (bradfieldcs.com)
1623 points by gerbilly on April 4, 2019 | hide | past | favorite | 572 comments



The big issue I think we miss when people say "why are you using Dynamo, just use SQL" or "why are you using hadoop, a bash shell would be faster" or "why are you using containers and kubernetes, just host a raspberry pi in your closet":

The former examples are all managed! That's amazing for scaling teams.

(SQL can be managed with, say, RDS. Sure. But it's not the same level of managed as Dynamo (or Firebase or something like that). It still requires maintenance and tuning and upkeep. Maybe that's fine for you (remember: the point of this article was to tell you to THINK, not to just ignore any big tech products that come out). But don't discount the advantage of true serverless.)

My goal is to be totally unable to SSH into everything that powers my app. I'm not saying that I want a stack where I don't have to. I'm saying that I literally cannot, even if I wanted to real bad. That's why serverless is the future; not because of the massive scale it enables, but because fuck maintenance, fuck operations, fuck worrying about buffer overflow bugs in OpenSSL, I'll pay Amazon $N/month to do that for me, all that matters is the product.


For anyone who's not used these "managed" services before, I want to add that it's still a fuck ton of work. The work shifts from "keeping X server running" to "how do I begin to configure and tune this service". You will run into performance issues, config gotchas, voodoo tuning, and maintenance concerns with any of AWS's managed databases or k8s.

> I'll pay Amazon $N/month to do that for me

Until you pay Amazon $N/month to provide the service, and then another $M/month to a human to manage it for you.


Exactly. There's no silver bullet, only trade offs.

In this case you're only shifting the complexity from "maintaining" to "orchestrating". "Maintaining" means you build (in a semi-automated way) once and most of your work is spent keeping the services running. In the latter, you spend most of your time building the "orchestration" and little time maintaining.

If your product is still small, it makes sense to keep most of your infrastructure in "maintaining" since the number of services is small. As the product grows (and your company starts hiring ops people), you can slowly migrate to "orchestrating".


There's no silver bullet, only trade offs.

I see this a lot and it bugs me, because it implies that it's all zero sum and there's nothing that's ever unconditionally better or worse than anything else. But that's clearly ridiculous. A hammer is unconditionally better than a rock for putting nails in things. The beauty of progress is that it enables us to have our cake and eat it too. There is no law that necessitates a dichotomy between powerful and easy.


I think you are misunderstanding the phrase trade-off. A trade-off means that something is good for one thing, but not as good for something else. Your examples are exactly an example of a trade-off. A hammer is good at hammering nails, but not good for screwing in screws, because there has been a trade-off to design the tool to make it better for hammering nails than for screwing in screws.

All tools are better at one thing then another. The point of examining trade-offs is to decide which advantages are more appropriate to which circumstance.


Why did you rephrase his analogy as a comparison between a hammer and a screwdriver, when the whole point of the comparison was that a hammer is strictly better than a rock?


A hammer isn't strictly better than a rock, it costs a lot more money/resources to obtain :)


> it implies that it's all zero sum and there's nothing that's ever unconditionally better or worse than anything else

Not necessarily - to start with, a good trade-off is unconditionally better than a bad trade-off.

Also, progress brings with it increasing complexity. Recognizing the best path includes assessing many more parameters and is far more difficult than deciding whether to use a hammer or a rock to nail things. The puzzling nature of many people to over-complicate things instead of simplifying them makes the challenge even more difficult.

By the way, the closest thing I've ever found to a silver bullet in software development (and basically any endeavor) is "Keep it simple". While this is a cliche already, it is still too often overlooked. I think this is because it isn't related to the ability to be a master at manipulating code and logic, but the ability to focus on what's really important - to know how to discard the trendy for the practical, the adventurous methods to the focused and solid - basically passionately applying occam's razor on every abstraction layer. If this was more common, I think articles like "You are not Google" would be less common.


Ever wonder how the Go team seems to get stuff done more efficiently than other groups? It's not Go, it's that they simplify (perhaps oversimplify).


> Not necessarily - to start with, a good trade-off is unconditionally better than a bad trade-off.

Are you better off with a simple dokku server, or a k8s cluster?

There's lots of good/bad on both sides... it really depends on your needs. You could use one locally or through dev layers and another for production.


You can still analyze the trade-offs of your hammer / rock choice, e.g.

Hammer pro’s: very efficient

Hammer cons: costs money, only available in industrialized societies

Rock pro’s: ubiquitous and free

Rock cons: inefficient

The point is that in tech, there is no choice which has only pros, and no choice which has only cons.


True, but nobody is using a rock still to hammer nails, or advocates for it.

Given any important problem field, it's quite likely that among the extant top solutions there is no product that is strictly dominant (better in every dimension) - most likely there will be trade offs.


I used rocks few times as kid to build some forts from scrap wood. You could nail the framing together for any modern house with a rock.

However, if you bend any nails you can't pull the nail out like with a normal framing hammer. Plus you can drive a nail faster with a proper hammer. So just use a hammer.

Problem is with software the tool also become part of what your working on. So it's never quite like this hammer and stone analogy.

Really a better analogy would be fasteners. So for instance we just have screws, nails, and joints. A simplified thing comparison would look like this, and they all become part of what your building.

  Nails
    Pros
      Fast to drive
      Still pretty strong
    Easier to remove mistakes or to disassembly.
    Cons
      Not as strong as other methods of fastening
      May crack wood if driven at the end of a board.

  Screws
    Pros
      Almost as fast as nails with a screw gun.
      Stronger joint than nails.
    Cons
      Can crack wood like nails without pre-drilling.
      Slower to remove.

  Joints
    Pros
      Strong as the wood used.
      Last as long the wood.
    Cons
      Very slow requires chiseling and cutting wood into tight inter locking shapes.
      If it's any type of joint with adhesive it can't be taken apart.


Nails are pretty forgiving to the type/size of hammer used.

Screws can be pretty finicky, for example using a too-small Philips screwdriver can strip the screw-head, and make it very difficult to tighten further or to remove.

I'm pretty fond of those screws that can take a Flathead or Philips screwdriver, though.

How about nuts+bolts?


You can have options that are definitely worse than the rest weight having a silver bullet, aka option that's definitely better than the rest. For nail driving consider a hammer and a pneumatic nail gun. Both are better than a rock, but you couldn't call either one a silver bullet. You still have to think about which one is best for your usage.


Funny enough, I've experienced the largest benefits from "scaling down" with Amazon's managed databases.

For instance I made an email newsletter system which handles subscriptions, verifications, unsubscribes, removing bounces, etc. based on Lambda, DynamoDB, and SES. What's nice about is that I don't need to have a whole VM running all the time when I just have to process a few subscriptions a day and an occasional burst of work when the newsletter goes out.


I have a db.example.com $10 a month VM on digital ocean.

It is strictly for dev and staging. Not actual production use because prod doesn't exist yet anyways.

My question is what kind of maintenance should I be doing? I don't see any maintenance. I run apt upgrade about once a month. I'm sure you'd probably want to go with something like Google Cloud or Amazon or ElephantSQL for production purely to CYA but other than that if you don't anticipate heavy load, why not just get a cheap VM and run postgresql yourself? I mean I think ssh login is pretty safe if you disable password login, right? What maintenance am I missing? Assuming you don't care about the data and are willing to do some work to recreate your databases when something goes wrong, I think a virtual machine with linode or digital ocean or even your own data center is not bad?


Amazon are pushing very hard to convince the next generation of devs who grew up writing front-end JS that databases, servers etc. are some kind of technical wizardy best outsourced, when a few weeks of reading and playing around would be enough to get them up to speed.


Hate to be super paranoid, but isn't it rather convenient the top comment on this section expresses exactly this sentiment? If anything, it proves that this perspective is working, or at least, a huge host of devs online really are front-end JS'ers who have this opinion already.


From an engineering/development sense, this is a good thing, because it means that devs are cheaper. Most devs can't even handle being a good client of things like databases. They barely comprehend what the underlying theories are behind SQL (eg sets etc).

Just like early electricity, people ran their own generators. That got outsourced, so the standard "sparky" wouldn't have the faintest idea of the requirements of the generation side, only the demand side.

The same is happening to programming.


1. Programmers are supposed to be professionals. They certainly want to be paid like professionals.

2. The effects of the back end have a nasty habit of poking through in a way that the differences between windmill generators and pumped hydraulic storage don't.


The difference is that in the case of programming, for the cost of some side-study you can get yourself a competitive advantage.


Very much this. For most use cases, the out-of-the-box configuration is fine until you hit ridiculous scale, and it's not really all that complicated to keep a service running if you take time to read the docs.


It just came to mind Jason Fried's Getting Real chapter titled "Scale Later". Page 44.

"For example, we ran Basecamp on a single server for the first year. Because we went with such a simple setup, it only took a week to implement. We didn’t start with a cluster of 15 boxes or spend months worrying about scaling. Did we experience any problems? A few. But we also realized that most of the problems we feared, like a brief slowdown, really weren’t that big of a deal to customers. As long as you keep people in the loop, and are honest about the situation, they’ll understand."


It’s not about “getting up to speed”. It’s about not having to manage it on and ongoing basis.

I wouldn’t work for a company that expects devs to manage resources that can be managed by a cloud provider and develop.

How well can you “manage” a Mysql database with storage redundancy across three availability zones and synchronous autoscaling read replicas?

How well can you manage an autoscaling database that costs basically nothing when you’re not using but scales to handle spikey traffic when you do?


Right.. so now your developers don't need to understand how to configure and tune open source directory services and RDBMS's and in-memory caches... they just need to understand how to configure and tune a cloud-provider's implementation of a directory service and RDBMS and in-memory cache..... ?

If you think using a cloud "service" out-of-the-box will "just work" your scale is probably small enough that a single server with the equivalent package installed with the default settings is going to "just work" too.


You did just read our use case didn’t you? Yes we could overprovision a single server with 5x the resources for the once a week indexing.

We could also have 4 other servers running all of the time even when we weren’t demoing anything in our UAT environment.

We could also not have any redundancy and separate out the reads and writes.

No one said the developers didn’t need to understand how to do it. I said we didn’t have to worry about maintaining infrastructure and overprovisioning.

We also have bulk processors that run messages at a trickle based on incoming instances during the day but at night and especially st the end of the week, we need 8 times the resources to meet our SLAs. Should we also overprovision that and run 8 servers all of the time?


> Yes we could overprovision a single server with 5x the resources for the once a week indexing.

You assume this is bad, but why? Your alternative seems to be locking yourself into expensive and highly proprietary vendor solutions that have their own learning curve and a wide variety of tradeoffs and complications.

The point being made is that you are still worrying about maintaining infrastructure and overprovisioning, because you're now spending time specializing your system and perhaps entire company to a specific vendor's serverless solutions.

To be clear, I don't have anything against cloud infrastructure really, but I do think some folks really don't seem to understand how powerful simple, battle-tested tools like PostgreSQL are. "Overprovisioning" may be way less of an issue than you imply (especially if you seriously only need 8 machines), and replication for PostgreSQL is a long solved problem.


You assume this is bad, but why? Your alternative seems to be locking yourself into expensive and highly proprietary vendor solutions that have their own learning curve and a wide variety of tradeoffs and complications.

So having 4-8 times as many servers that unlike AWS, we would also have five times (1 master and 8 slaves) as much storage is better than the theoretical “lock-in”? You realize with the read replicas, you’re only paying once for storage since they all use the same (redundant) storage?

Where is the “lock-in”? It is Mysql. You use the same tools to transfer data from Aurora/MySQL that you would use to transfer data from any other MySQL installation.

But we should host our entire enterprise on digital ocean or linode just in case one day we want to move our entire infrastructure to another provider?

Out of all of the business risks that most companies face, lock-in to AWS is the least of them.

The point being made is that you are still worrying about maintaining infrastructure and overprovisioning, because you're now spending time specializing your system and perhaps entire company to a specific vendor's serverless solutions.

How are we “specializing our system”? We have one connection string for read/writes and one for just reads. I’ve been doing the same thing since the mid 2000s with MySQL on prem. AWS simply load balances rate readers and adds more as needed.

And you don’t see any issue on spending 5 to 9 times as much on both storage and CPU? You do realize that while we are just talking about production databases, just like any other we company we have multiple environments some of which are only used sporadically. Those environments are mostly shut down - including the actually database server until we need them and scales up with reads and writes with Aurora Serverless when we do need it.

You have know idea how easy it is to set the autoscsling read replica up do you?


Usually you pay much more and have more servers when you use AWS than a basic server provider


I’m making up numbers just to make the math easier.

If for production we need 5x of our baseline capacity to handle peak load are you saying that we could get our server from a basic server provider for 4 * 0.20 ( 1/5 of the time we need to scale our read replicas up) + 1?

Are you saying that we could get non production servers at 25% of the cost if they had to run all of the time compared to Aurora Serverless where we aren’t being charged at all for CPU/Memory until a request is made and the servers are brought up. Yes there is latency for the first request - but these are our non production/non staging environments.

Can we get point in time recovery?

And this is just databases.

We also have an autoscaling group of VMs based on messages in a queue. We have one relatively small instance that handles the trickle of messages that come during the day in production that can scale up to 10 at night when we do bulk processing. This just in production. We have no instances running when the queue is empty in non production environments. Should we also have enough servers to having 30-40 VMs running with only 20% utilization?

Should we also set up our own servers for object storage across multiple data centers?

What about our data center overseas close to our offshore developers?

If you have more servers on AWS you’re doing it wrong.

We don’t even manage build servers. When we push our code to git, CodeBuild spins up either prebuilt or custom Docker containers (on servers that we don’t manage) to build and run unit tests on our code based on a yaml file with a list of Shell commands.

It deploys code as lambda to servers we don’t manage. AWS gives you such a ridiculously high amount of lambda usage in the always free tier it’s ridiculous. No, our lambdas don’t “lock us in”. I deploy standard NodeJS/Express, C#/WebAPI, and Python/Django code that can be deployed to either lsmbda or a VM just by changing a single step in our deployment pipeline.


Basic replication is maybe close to what you could call solved, but I'd say that there's still complications like georeplicating multi master write machines are still quite complicated, and need a dedicated person to manage. Hiring being what it is, it might just be easier to let Amazon hire that person for you and pay Amazon directly.

I see cloud services as a proxy to hire talented devops/dba people, and efficiently multiplex their time across several companies, rather than each company hiring mediocre devops/dba engineers. That said, I agree that for quite a few smaller companies, in house infrastructure will do the job almost as well as managed services, at much cheaper numbers. Either way, this is not an engineering decision, it's a managerial one, and the tradeoffs are around developer time, hiring and cost.


> I see cloud services as a proxy to hire talented devops/dba people, and efficiently multiplex their time across several companies

It's only a proxy in the sense that it hides them (the ops/dbas) behind a wall, and you can't actually talk directly to them about what you want to do, or what's wrong.

If you don't want to hire staff directly, consulting companies like Percona will give you direct, specific advice and support.


If something goes wrong, we can submit a ticket to support and chat/call a support person immediately at AWS. We have a real business that actually charges our (multi million dollar business) customers enough to pay for a business level support.

But in your experience, what has “gone wrong” with AWS that you could have fixed yourself if you were hosting on prem?


Basic replication is maybe close to what you could call solved,

Locally hosted basic synchronous read replicas are a solved a problem?


"No one said the developers didn’t need to understand how to do it. I said we didn’t have to worry about maintaining infrastructure and overprovisioning."

If you do not do something regularly, you tend to lose the ability to do it at all. Personally, and especially organizationally.


There is a difference between maintaining MySQL servers and the underlying operating system and writing efficient queries, optimizing indexes, knowing how to design a normalized table and knowing when to denormalize, looking at the logs to see which queries are performing slowly etc. using AWS doesn’t absolve you from knowing how to use AWS.

There is no value add in the “undifferentiated heavy lifting”. It is not a companies competitive advantage to know how to do the grunt work of server administration - unless it is. Of course Dropbox or Backblaze have to optimize their low profit margin storage business.


Why not run eight servers all the time? If you are running at a scale where that is a cost you notice at all, you are not only in a very early stage, you're actually not even a company.


There are many, MANY software companies whose infrastructure needs are on the order of single digits of normal-strength servers and who are profitable to the tune of millions of dollars a year. These aren’t companies staffed with penny-pinching optimization savants; some software, even at scale, just doesn’t need that kind of gear.


A multi million dollars tech company with tens of employees cannot run a sane infrastructure with a single digit of servers.

For a trivial website in production: 2 web servers + 1 database + 1 replica.

For internal tooling: 1 CI and build server + 2 development and testing servers + 1 storage, file share, ftp server + 1 backup server.

For desktop support: At least 1 server for DHCP, DNS, Active Directory + firewall + router.

That's already 10 servers and not counting networking equipment. Less than that and you're cutting corners.


Web servers - lambda

Build server - CodeBuild you either run with prebuilt Docker containers or you use a custom built Docker container that automatically gets launched when you push your code to GitHub/CodeCommit. No server involved.

Fileshare - a lot of companies just use Dropbox or OneDrive. No server involved

FTP - managed AWS SFTP Service. No server involved.

DHCP - Managed VPN Service by AWS. No server involved.

DNS - Route 53 and with Amazon Certificate Manager it will manage SSL certificates attached to your load balancer and CDN and auto renew. No servers involved.

Active Directory - Managed by AWS no server involved.

Firewall and router - no server to Manage. You create security groups and attach them to your EC2 instances, databases, etc.

You set your routing table up and attach it to your VMs.

Networking equipment and routers - again that’s a CloudFormatiom template or go old school and just configuration on a website.


[flagged]


Yes I realize SFTP is not FTP. But I also realize that no one in their right mind is going to deliver data over something as insecure as FTP in 2019.

We weren’t allowed to use regular old FTP in the early 2000s when I was working for a bill processor. We definitely couldn’t use one now and be compliant with anything.

I was trying to give you the benefit of a doubt.

Amateur mistake that proves you have no experience running any this.

If it doesn’t give you a clue the 74 in my name is the year I was born. I’ve been around for awhile. My first internet enabled app was over the gopher protocol.

How else do you think I got shareware from the info-Mac archives over a 7 bit line using the Kermit protocol if not via ftp? Is that proof enough for you or do I need to start droning on about how to optimize 65C02 assembly language programs by trying to store as much data in the first page of memory because reading from the first page took two clock cycles on 8 bit Apple //e machines instead of 3?

We don’t “share” large files. We share a bunch of Office docs and PDF’s as do most companies.

Yes, you do have to run DNS, Active Directory + VPN. You said you couldn’t do it without running “servers”.

No we don’t have servers called

SFTP-01

ADFS-01

Etc.

either on prem or in the cloud.

Even most companies that I’ve worked for that don’t use a cloud provider have their servers at a colo.

We would still be using shares hosted somewhere not on prem. How is that different from using one of AWS storage gateway products.


9 servers (8 reads and 1 writer) running all of the time with asynchronous replication (as opposed to synchronous replication) with duplicate data - yes the storage is shared between all of the replicas.

Not to mention the four lower environments some of which the databases are automatically spun up from 0 and scaled up as needed (Aurora Serverless)

Should we also maintain those same read replicas servers in our other environments when we want to do performance testing?

Should we maintain servers overseas for our outsourced workers?

Here we are just talking about Aurora/MySQL databases. I haven’t even gotten into our VMs, load balancer, object store (S3), queueing server (or lack there of since we use SQS/SNS), our OLAP database (Redshift - no we are not “locked in” it users standard Postgres drivers), etc.

AWS is not about saving money on like for like resources as you would on bare metal, but in the case of databases where your load is spiky you do. It’s about provisioning resources as needed when needed and not having to either pay as many infrastructure folks. Heck before my manager who hired me and one other person came in, the company had no one onsite that had any formal AWS expertise. They completely relied on a managed service provider - who they pay much less than they would pay for one dedicated infrastructure guy.

I’m first and foremost a developer/lead/software architect (depending on which way the wind is blowing at any given point in my career), but yes I have managed infrastructure on prem as part of my job years ago, including replicated MySQL servers. There is absolutely no way that I could spin up and manage all of the resources I need for a project and develop at the efficacy level at a colo as I can with just a CloudFormation template with AWS.

I’ve worked at a company that rented stacks of servers that sat idle most of the time but we used to simulate thousands of mobile connections to our backend servers - we did large B2B field services deployments. Today, it would be running a Pythom script that spun up an autoscaling group of VMs to whatever number we needed.


The question is how often is that necessary? Once again the point goes back to the article title. You are not Google. Unless your product is actually large, you probably don't need all of that and even if you do, you can probably just do part of it in the cloud for significantly cheaper and get close to the same result.

This obsession with making something completely bulletproof and scalable is the exact problem they are discussing. You probably don't need it in most cases but just want it. I am guilty of this as well and it is very difficult to avoid doing.


You think only Google needs to protect against data loss?

We have a process that reads a lot of data from the database on a periodic basis and sends it to ElasticSearch. We would either have to spend more and overprovision it to handle peak load or we can just turn on autoscaling for read replicas. Since the read replicas use the same storage as the reader/writer it’s much faster.

Yes we need “bulletproof” and scalability or our clients we have six and seven figure contracts with won’t be happy and will be up in arms.


> You think only Google needs to protect against data loss?

Just have a cron job taking backups in a server and sending somewhere else?

It has been working for the last 40 years...


So a cron job can give me point in time recovery?

Can a cron job give me automatic failover to another server that is in sync with the master? Can it give me autoscaling read replicas?

Who is going to get up in the middle of the night when the cron job fails?

Yes I am sure my company that has six and seven figure contracts would be just as well served self hosting MySQL on Linode.


OTOH the last "cloud" solution I've seen was a database that allowed a team to schedule itself. It was backed on Google cloud, had autoscaling backend. It was "serverless" (app engine), and had backups etc configured.

Cost to run it for 4 years: $70000.

QPS ? 5 was the highest I ever found.

You don't need this capacity. You just don't.

But it's got to be a good business to be in ...


Instead of just assuming pricing based on Google pricing (which we weren’t talking about), you could always just find AWS pricing.

https://aws.amazon.com/rds/aurora/pricing/

Storage cost of 500Gb of data for four years - $2400

Backing up is free up to the amount of online storage.

IO request are $0.12 per million.

Transfer to/from VMs in the same availability zone is free.

A r5.large reserved instance is $6600 a year.

Of course if your needs are less, you could get much cheaper cpu and memory.


Issue is, what would the charge be for a typical application for some basic business function, say scheduling attendance.

So QPS really low, <5. But very spread out, as it's used during the workday by both team leader and team members. Every query results in a database query or maybe even 2 or 3. So every, say 20 minutes or so there's 4-5 database queries. Backend database is a few gigs, growing slowly. Let's say a schedule is ~10kb of data, and that has to be transferred on almost all queries, because that's what people are working on. That makes ~800 gig transfer per year.

This would be equivalent to something you could easily store on say a linode or digital ocean 2 machine system, for $20 month or $240/year + having backup on your local machine. This would have the advantage that you can have 10 such apps on that same hardware with no extra cost.

And if you really want to cheap out, you could easily host this with a PHP hoster for $10/year.

So how do you calculate the AWS costs here ?


If you have a really low, spread out load, you could use Aurora Serverless (Mysql/Postgres) and spend even less if latency for the first request isn’t a big deal.

And storage for AWS/RDS is .10/gb per month and that includes redundant online storage and backup.

I’m sure I can tell my CTO we should ditch our AWS infrastructure that hosts our clients who each give us six figures each year to host on shared PHP provider and Linode....

And now we have to still manage local servers at a colo for backups....

We also wouldn’t have any trouble with any compliance using a shared PHP host.


You keep mentioning six figures contracts as it's something big. Most of fortune 500 companies have their own data center and still read database manuals somehow...


No I’m mentioning our six figure customers because too often small companies are seen as being low margin B2C customers where you control for cost and you don’t have to be reliable, compliant, redundant or scalable from day one. One “customer” can equate to thousands of users.

We are a B2B company with a sales team that ensures we charge more than enough to cover our infrastructure.

Yes infrastructure cost scale with volume, but with a much lower slope.

But it’s kind of the point, we save money by not having a dedicated infrastructure team, and save time and move faster because it’s not a month long process to provision resources and we don’t request more than we need like in large corporations because the turn around time is long.

At most I send a message to my manager if it is going to cost more than I feel comfortable with and create resources using either a Python script or CloudFormation.

How long do you think it would take for me to spin up a different combination of processing servers and database resources to see which one best meets our cost/performance tradeoff on prem?


Depends on how you architecture it, theoretically spinning on prems or in baremetal or vsphere should be an ansible script with its roles and some docker file regardless.

Just for reference we ""devops"" around 2 thousand VMs and 120 baremetal servers and a little of cloud stuff through same scripts and workflows.

We don't really leverage locked in cloud things because we need the flexibility of on prems.

In my business hardware is essentially a drop in the bucket of our costs.

P.s: I totally think there are legit use cases for cloud, is just another tool you can leverage depending on the situation


Spinning up on prem means you have to already the servers in your colo ready and have to pay for spare capacity. Depending on your tolerance for latency (production vs non-production environments or asynchronous batch processing), you can operate at almost 100% capacity all of the time:

Lambda vs VMs (yes you can deploy standard web apps to Lambda using a lambda proxy)

Serverless Aurora vs Regular Aurora.

Tx instances vs regular instances

Autoscaling VMs.

Autoscaling DynomoDB

Etc.


Redundant online storage sucks, doesn't protect you against

  DELETE FROM orders WHERE $condition_from_website;
So that's just useless. You need offline backups. Is there something like, euhm, perhaps checkpoints, that you can use ?


Yes, you can schedule automatic snapshots where it will take snapshots on a schedule you choose. You get as much space for your backups as you have in your database for free. Anything above that costs more.

You also get point in time recovery with BackTrack.

https://aws.amazon.com/blogs/aws/amazon-aurora-backtrack-tur...

The redundant storage is what gives you the synchronous read replicas that all use the same storage and the capability of having autoscalinv read replicas that are already in sync.


> automatic snapshots where it will take snapshots on a schedule you choose

So... A cronjob?


Does your cron job allow recovery with a minute granularity? Is your cron job also redundant? How long would it take you to restore from backup?


You say that as if filesystems with point-in-time and/or checkpoint recovery are anything hard.


So if it “isn’t hard” then show me an implementation that does that with Mysql that can meet the same recovery time objectives as AWS with Aurora?

Why do you have a specific need to read the database periodically polling it instead of just pushing the data to elastic search at the same time that it reaches the database? I don’t know anything about your architecture, but unless you’re handling data on a very big scale probably rationalising the architecture would give you much more performance and maintainability than putting everything in the cloud.


Without going into specifics. We have large external data feeds that are used and correlated with other customer specific data (multitenant business customers) and it also needs to be searchable. There are times we get new data relevant to customers that cause us to reindex.

We are just focusing on databases here. There is more to infrastructure than just databases. Should we also maintain our own load balancers, queueing/messaging systems, CDN, object store, OLAP database, CI/CD servers, patch management system, alerting monitoring system, web application firewall, ADFS servers, OATH servers, key/value store, key management server, ElasticSearch cluster etc? Except for the OLAP database. All of this is set up in some form in multiple isolated environments with different accounts in one Organizational Account that manages all of the other sub accounts.

What about our infrastructure overseas so our off shore developers don’t have the latency of connecting back to the US?

For some projects we even use lambda where we don’t maintain any web servers and get scalability from 0 to $a_lot - and no there is no lock-in boogeyman there either. I can deploy the same NodeJS/Express, C#/WebAPI, Python/Django code to both lambda and a regular old VM just by changing my deployment pipeline.


Did you read the article? You are not Google. If you ever do really need that kind of redundancy and scale you will have the team to support it, with all the benefits of doing it in-house. No uptime guarantee or web dashboard will ever substitute for simply having people who know what they're doing on staff.

How a company which seems entirely driven by vertical integration is able to convince other companies that outsourcing is the way to go is an absolute mystery to me.


No, we are not Google. We do need to be able to handle spiky loads - see the other reply. No we don’t “need a team” to support.

Yes “the people in the know” are at AWS. They handle failover, autoscaling, etc.

We also use Serverless Aurora/MySQL for non production environments with production like size of data. When we don’t need to access the database, we only pay for storage. When we do need it, it’s there.


I agree for production because you want to be able to blame someone when autoscaling fails but we trust developers to run applications locally, right? Then why can't we trust them with dev and staging?

By the way what is autoscaling and why are we autoscaling databases? Im guessing the only resource thst autoscales is the bandwidth? Why can't we all get shared access to a fat pipe in production? I was under the impression that products like Google Cloud Spanner have this figured out. What exactly needs to auto scale? Isn't there just one database server in production?

In dev (which is the use case I'm talking about) you should be able to just reset the vm whenever you want, no?


The big innovation of aurora is they decoupled the compute from the storage [1]. Both compute and storage need to auto scale, but compute is the really important one that is hardest. Aurora serverless scales the compute automatically by keeping warm pools of db capacity around [2]. This is great for spiky traffic without degraded performance.

1. https://www.allthingsdistributed.com/files/p1041-verbitski.p... 2. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...


No, it’s not just the bandwidth. It’s also the CPU for read and writes in the case of Serverless Aurora.

For regular autoscaling of read replicas. It brings additional servers on line using the same storage array. It isn’t just “one database”.

That’s just it. There isn’t “just one database”.


Cloud spanner seems awesome but has a base price of $740/mo, doesn’t seem to autoscale, and doesn’t support a mysql or postgres interface.


I've had servers with identical configurations, identical setup, identical programs installed the exact same way (via docker images), setup the same day

One worked, one didn't ¯\_(ツ)_/¯

That's when I gave up on servers, virtual or not


It stands to reason that they weren't identical because if they were they would both have worked.

The thing separating you (in this case) and Amazon/Google/* is that they make sure their processes produce predictable outcomes that if they were to fail can be shutdown and the process can be redone until it yields the correct outcome.


Unless they were identical running in the same directory and listening to the same port. First one to start won.


That’s a whole different problem in a whole other place. It’s not either of them not working.


I spent this last week finally being forced to deal with Hibernate.

Devs are more than capable of pushing each other to the same conclusions.


I have _more_ maintenance, _because_ of AWS. Most libraries/services reach some sort of a stable version where they are backwards compatible. AWS (and other big providers) changes libraries all the time and things break, then you have to figure out what they did instead of having a stable interface to work with.


Curious about where you would put your DO data. A cheap VPS from DO doesn't come with much storage. You can either buy more storage for $$ or use DO Spaces for fewer $, but do Spaces talk to PostgresQL? I apologize for my ignorance; I'm just beginning to explore this stuff.



Block storage is $.10/GB; Spaces is $.02/GB. I was hoping there was a glue layer that would allow PostgresQL to use Spaces, but such a thing might not exist or be performant enough to be worth building.


Spaces is for object storage, stuff like videos, images, documents, etc. You can store references(links) to this objects in postgres if you want (very common thing to do).

Block storage is for adding volumes to your droplets (like adding a second hard drive to a computer). One advantage is that they are separated from your droplet and replicated, if your droplet disappears or dies, you can reattach the volume to another droplet. So one common use is to place the data dir of PG for example, in the external volume. You should still have proper backups. DO also has a managed PG service.


No need to apologize. I have very little data so the 50GB droplet is enough. But in my use case, the data is expendable. The data had no value in development environment. You probably shouldn't do in production what I'm doing.


I have some hobby projects that are very infrequently accessed by me and my friends andy family. Relative to their usefulness, $10/mo would be way too expensive. With S3 and lambdas, my aws bill has been more like $1 to $3, depending on traffic and amount of active side projects.

Edit: those don't include a propoer DB, though, I have been basically using S3 as the DB. Not sure how much a RDS instance or something would add to the bill.


Why would you need "a whole VM"?

Just run it as a systemd service or a Docker/Kubernetes-managed container on one of the VMs that run the rest of the application.

Also no need to use all those AWS-proprietary services, just use an existing task queue system based on PostgreSQL/RabbitMQ/etc.


You have a database that only needs to be accessible part of the time. Why not keep going to an embedded database?


If you are hunting vampires there are the choice of the Silver Bullet avaible.

Some times, the benefits from something compared to another thing are so great that there are no downsides. If the rasp in the closets doesn't suffice you can always add another one.


The costs associated with "maintaining" usually involve the possibility of a 3am call for whoever is in charge of maintaining. Orchestrating can be done ahead of time, during your 9-5, and that's super valuable. It's still a lot of work, but it's work that can be done on my time, at my pace.


Managed services still have plenty of unexplained goings-on, and 3AM pages.


We moved some stuff from AWS back to on prems because it broke less often and in more obvious ways.


Said like a person who hasn't actually ever converted. It's nothing like that at all.

12 outages since 2011, and none of them are anything like what you're describing: https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...


We've moved from on-prem to AWS fully and we see random issues all the time while their status page shows all green, so I feel you probably have a small amount of resources in use with them or something, because what you're saying doesn't jive with what we see daily. I see you've also copy-pasted your response to other comments too.


Can you actually quantify any of this or are you asking me to trust you? What I gave is an objective standard, what you've given so far is "trust me I'm right".


Are you just ignoring that the cloud isn't a the savior for everyone because it furthers your own agenda, either publically or personally? We spend the GDP of some nations with AWS yearly, so I guarantee you're not at our scale to see these sorts of issues that most definitely are not caused by us and are indeed issues with AWS as confirmed by our account rep.


There are nations that have very tiny GDPs (and AWS is very expensive) so that's not saying much, and I didn't say AWS was flawless, I said that AWS was a hell of a lot better than anything you could come up with unless you're literally at Amazon scale (and we both know you arent), and nearly none of your software's problems are AWS's fault.

Using AWS has been a legitimate excuse for an outage or service delivery issue a dozen times since 2011. The end. Write your apps to be more resilient if you've had more than that many outages at those times (and honestly even those outages weren't complete).


> AWS was a hell of a lot better than anything you could come up with unless you're literally at Amazon scale

AWS is a hell of a lot better than most people here can come up with for solving AWS specific set of problems and priorities. But those problems and priorities generally do not match the problem space of the people using it exactly.

For example, AWS has a lot of complexity for dealing with services and deployment because it needs to be versatile enough to meet wildly different scenarios. In designing a solution for one company and their needs over the next few years, you wouldn't attempt to match what AWS does, because the majority of companies use only a small fraction of the available services, so that wouldn't be a good use of time and resources.

AWS is good enough for very many companies so they appeal to large chunks of the market, but let's not act like that makes them a perfect or "best" solution. They're just a very scaleable and versatile solution so it's unlikely they can't handle your problem. You'll often pay more for that scaleability and versatility though, and if you knew perfectly (or within good bounds) what to expect when designing a system, it's really not that hard to beat AWS on a cost and growth level with a well designed plan.

Edit: Whoops, s/dell/well/, but it probably works either way... ;)


I'm starting to think this person refuses to be logical, not directed at you kbenson. PlayStation Network ran fine for years in our own datacenters, we decided to move to AWS instead of dealing with acquiring and maintaining our own hardware. Trust me, Sony isn't lacking some very bright people, we just don't want to deal with on-prem anymore, so yeah we'll throw ridiculous money at them for that pleasure.

diminoten -- relax, take a breath. Your rude and condescending tone is unnecessary. We don't see eye-to-eye, but I'm not discounting your experience. I can't say I'm getting the same vibe from you.


[flagged]


You do realize we can scroll up half a page and see you were the first person to levy a personal attack, right?


I think the only thing you wrote here I would point out is that you seem to vastly underestimate the operational cost of running your own in-house infrastructure. You're comparing AWS costs to hardware costs, but that's not what AWS gives you, it lets you restructure how you task your entire SA team. You still need them, but they can now work on very different issues, and you don't need to hire new people to work on the things they are now freed up to work on.

And again, I cannot over emphasize how much more control over your own time AWS gives you. It's night/day, and discounting that is a mistake.


Without breaking my employer confidentiality agreement? No. But you basing off reported outages is you trusting Amazon in the same way you don't trust me, that is, on their word.


I'd rather trust a corporation providing some information over an individual providing absolutely nothing, especially when that corporation's information matches with my own internal information.

The reality is, if you're having problems with AWS, it's you, and not AWS, for 99.9999999% of your problems. Continuing to pretend it's AWS is a face-saving, ego protecting activity that no rational person plays part in.


Which is funny because I've had an instance go down. I don't have ridiculously high volume, or distributed:

-One instance -No changes to the config -I'm the only person with access -No major outages at the time

It went down, and the AWS monitor didn't actually see it go down until 45 minutes later, which it needs to be down for a certain amount of time before their techs will take a look.

It was my first time using AWS, and I didn't want to risk waiting for tech so I rebooted the instance and it started back up again. I have no idea why, but it failed with out reason, and on reboot worked like it always did.

My point is that AWS has been solid, but they are like anything else, there are tradeoff's in using their service, and they aren't perfect.


I don't know the last time I had an instance go down. Not because it doesn't happen, but because it's sufficiently unimportant that we don't alert on it. Our ASG just brings up another to replace it.

Many applications won't be as resilient. That's the trade-off. We don't have a single stateful application. RDS/Redis/Dynamo/SQS are all managed by someone else. We had to build things differently to accommodate that constraint, but as a result we have ~35M active users and only 2 ops engineers, who spend their time building automation rather than babysitting systems.

If you lean in completely, it's a great ecosystem. Otherwise it's a terrible co-lo experience.


Funny enough, that exact scenario is covered in the certification exams too, and the correct answer is to do what you did. An ASG will fix too like another poster said, also.


Yeah, you just demonstrated why AWS is keen on you having backups to your services on their platform. You failed to do that (follow their guidance) and suffered an outage because of it. How exactly is that AWS's fault?

MY point is that AWS is very solid, and while there are plenty of trade offs, to be sure, the tradeoff is "operational" vs. "orchestration", and operational doesn't let you decide when to work on it whereas orchestration does.


While I want to avoid getting into this argument, what you are saying is the same as "well it works on my machine" and "there can't be anything wrong with Oracle Database because Oracle says there are no bugs."


No, what I'm saying is, "None of your problems are consistent across use cases, therefore they're your problems not the system being used."

I haven't actually said anything about my own experience, so it's funny you claim I have...


"I'd rather trust a corporation providing some information over an individual providing absolutely nothing, especially when that corporation's information matches with my own internal information."

Ah, I had assumed that your internal information was from experience, either your own or your organisation's. Since that is not the case I am curious where your "internal information" comes from considering you completely disregarded @bsagdiyev's personal experience.


What part of what I said makes you think my internal information isn't from experience? Just to be clear; it is. I haven't described my specific experience, so claiming I said "it works perfectly for me" is not something I've claimed.

What you tried to say was I was claiming that since it worked for me it was therefore good. That isn't the case. I'm saying because it worked for me AND EVERYBODY ELSE, it's therefore good.


It's bad enough that a Chrome extension exists to tease out real info from the lies that is the AWS status page:

https://chrome.google.com/webstore/detail/real-aws-status/ka...


How about network partitions across availability zones? Happens all the time for us, so much in fact that we had to build a tool to test connectivity across AZs just to correlate outages and find the smoking gun.


AWS status page is notorious for not indicating there is an issue either at all, or until well after the event began.


When your Redshift instance locks up, it doesn't end up on Wikipedia.


"For anyone who's not used these "managed" services before, I want to add that it's still a fuck ton of work. The work shifts from "keeping X server running" to "how do I begin to configure and tune this service"."

I have noticed that too. With some managed services you are trading a set of generally understood problems with a lot of quirky behavior of the service that's very hard to debug.


Yes, but I think very broadly speaking the quirky behavior is stuff you bump into, learn about, fix, and then can walk away from.

The daily/monthly maintenance cycle on a self hosted SQL server is “generally understood” but you still have to wake up, check your security patches, and monitor your redeployments.

You can do some of that in an automated fashion with public security updates for your containers and such. But if monitoring detects an anomaly, it’s YOU, not Heroku who gets paged.

It’s a little like owning a house vs renting. Yes if you rent you have to work around the existing building, and getting small things fixed is a process. But if the pipes explode, you make a phone call and it’s someone else’s problem. You didn’t eliminate your workload, but you shrunk the domain you need to personally be on call for.


The problem is that if I run my own servers I can fix problems (maybe with a lot of effort but at least it can be done) but with managed services I may not be able to do so. There is a lot of value in managed services but you have to be careful not to allow them to eat up your project with their bugs/quirks.


So what “problems” were you unable to fix with AWS?


Exactly this


The point is with a managed service, none of your problems will be with the service. That's what the managed service is selling.


I just finished a 2+ week support ticket w/ AWS. We were unable to connect over TLS to several of our instances, because the instance's hostname was not listed on the certificate. This is a niche bug that's trivially fixable if you own the service, but with AWS, it's a lot harder: you're going to need a technical rep who understands x509 — and nobody understands x509.

I've found & reported a bug in RDS whereby spatial indexes just didn't work; merely hinting the server to not use the spatial index would return results, but hinting it to use the spatial index would get nothing. (Spatial indexes were, admittedly, brand new at the time.)

I've had bugs w/ S3: technically the service is up, but trivial GETs from the bucket take 140 seconds to complete, rendering our service effectively down.

I've found & worked w/ AWS to fix a bug in ELB's HTTP handling.

All of these were problems with the service, since in each case it's failing to correctly implement some well-understood protocol. AWS is not perfect. (Still, it is worth it, IMO. But the parent is right: you are trading one set of issues for another, and it's worth knowing that and thinking about it and what is right for you.)


Okay, I'm sorry you thought I said AWS was perfect and bug free. I didn't, however, say that. I said (implied, really) it's better than anything you could possibly home brew. Nothing you've said here changes that.

Further, didn't I say that it's trading one set of issues for another? Or at least, I explicitly agreed with that.

I feel like you didn't read what I wrote honestly, and kind of came in with your own agenda. All I ever said was that the issues you trade off are orchestration issues vs. operational issues, and operational issues are 10x harder than orchestration issues because you don't get to decide when to work on operational issues, you tend to have to deal with them when they happen.


You wrote “The point is with a managed service, none of your problems will be with the service.”

What deathanatos wrote sounds awfully like problems with the service to me.

I don’t think S3 taking 100+ seconds to respond to a GET request can be solved by orchestration alone.


It definitely can. Reasonable timeouts and redundant systems.


It's amazing the length some people are willing to go to to defend AWS marketing slogans as a source of truth. I've seen vendor lock-in before, but AWS seems to be unique in that people actually enjoy working with a vendor whose services go down randomly to the point where they blame themselves for not being "fault-tolerant".

Guess what, if your service is not required to be up because the consuming service is super tolerant to it timing out after 140 seconds, self-hosting it becomes even more of a no-brainer. After all, you clearly need none of the redundancy AWS features.


If it makes you feel better, everything I'm saying about AWS can be said about GCP as well.

Sorry, but AWS/GCP is infinitely better at managing infrastructure than you or your company will ever be.


That's the promise but in reality every software has bugs, including managed services.


Not really, not anything like what you're describing.

12 outages since 2011, and none of them are anything like what you're describing: https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...


We've moved from on-prem to AWS fully and we see random issues all the time while their status page shows all green, so I feel you probably have a small amount of resources in use with them or something, because what you're saying doesn't jive with what we see daily. I see you've also copy-pasted your response to other comments too, so I'll do the same with my response.


I don't feel like copy/pasting all of our comments to each other, so I'd appreciate it if you didn't do that, thanks.


Then don't do it yourself. You're dead set on ignoring people whose experience is different than yours, wrapping yourself in an echo chamber of sorts and telling others they are wrong.


I'm not dead set on anything, I'm trying to have conversations with multiple people, not create an immutable record.

And I don't think you know what an echo chamber is if you think one person can create one alone...


This is not about outages. There are many more things that can go wrong besides outages.


[flagged]


How long have you been working in tech? Just curious.

You sound like someone who hasn't had much real world experience and thinks AWS or whatever is the best thing because it's the only thing you know.


You may want to ask the OP how much time she/he has been working at Amazon instead.


Long enough to know that some dinosaurs refuse to learn anything new (read: AWS) and will bend over backwards to try and keep themselves relevant.


I guess it can't be proved that this guy is a shill for AWS.

But this kind of toxic fanatism(yet trying to sound logical) is just harmful for the HN community.

dang: Can this kind of behavior be punished?


Bro, what're you so upset about in this thread? That people had different experiences than you with AWS..?


I'm not upset, I'm simply pointing out that AWS isn't the problem in any of these examples, it's the various commenter's lack of understanding about how to work in AWS that's caused these problems.

I don't think anyone is actually upset, do you? I certainly hope I haven't upset anyone... :/


When your hammer snaps in half, you don't blame yourself for not using two hammers.

When the tool breaks under correct use, criticize the tool. Maybe the user should also have redundancy. The tool is still failing!


This analogy is what snapped in half, not the hammer. It's more like if your hammer says right on it, "YOU NEED A SECOND HAMMER" and this is true of all hammers, it's still not the hammer's fault you didn't bring a second hammer.


And you're in other threads complaining that the people that had five hammers were still doing it wrong, that all the outages they report are fake somehow...

Even when you're supposed to have redundancy, there are still certain failure rates that are acceptable and some that are not. And redundancy doesn't solve every problem either.


What? No I'm not. Literally no where has anyone said they've built a system with redundancies as recommended by AWS and still had problems.

Of course there are unacceptable failure rates. AWS doesn't have them, and pretending like they do is simply lying to yourself to protect your own ego.


> The point is with a managed service, none of your problems will be with the service. That's what the managed service is selling.

Until the managed service simply goes away, of course, taking your data with it.


It's someone else problem unless it prevents you from living here, in which case it's still your problem too. I think the analogy works quite well :)


So I've worked with AWS and with our internal clusters as a dev. My experience has been that I have to make work-arounds for both, but at least with AWS, I don't have to spell out commands explicitly to the junior PEs.

EDIT: I should be clear, our PEs are generally pretty good, but because their product isn't seen by upper management as the thing which makes money, they're perpetually understaffed.


Also Amazon documents their stuff in a nice public website, internal teams documented the n-2 iteration of the system and have change notes hidden in a Google drive somewhere that if you ask the right person on the other side of the world they might be able to share you a link to.


This. So. Much.

I can't explain just how much developing on GCP has helped me simply by having such amazing documentation. I don't think I appreciated how little I knew: every company where we worked with on premise/ internal services, we would have to use custom services built by others. With GCP, you have complete freedom, not just to design your application architecture from scratch, but to understand how others (coworkers mostly) have designed _their_ applications too! And as a company, it allows the sharing of a common set of best practices, automatically, since its "recommended by Google".

Its kinda like Google/Amazon are now the System/Operations engineers for our company. Which they're good at. And its awesome.


You were able to find documentation? Where do you work?


But you’re talking about a reduction in the number of types of specialized people to the number of specializations per type of person. That makes this more scalable.


The general fact of reality is that if you are building anything technical, then knowing and managing the details, whatever the details are, will get you a lot more bang for your buck. Reality isn't just a garden variety one-size fits all kind of thing, so creating something usually isn't either. If you just want a blog like everyone else's, then that comes packaged, but if you want something special, you will always have to put in the expertise.


> will get you a lot more bang for your buck

And it _really_ is a lot. A company I work with switched from running two servers with failover to AWS, bills went from ~€120/m to ~€2.2k/m for a similar work load. Granted, nobody has to manage those servers any more, but if that price tag continues to rise that way, it's going to be much cheaper to have somebody dedicated to manage those servers vs use AWS.

Also, maybe that's just me, but I prefer to have the knowledge in my team. If everything runs on AWS, I'm at the mercy of Amazon.


If the bill went from ~€120/m to ~€2.2k/m and it was a surprise it is really a lack of proper planning. AWS pricing calculator is there for a reason...

However, often teams will use or attempt to use _a_lot_ more resources than what's needed, or they simply don't optimise for cost when using AWS.

My anecdote is that a company I worked for had an in-house app that run jobs on 8 m4.16xlarge instances costing around $20k/m and they were complaining it took hours to run said jobs. The actual tasks those servers were running were easily parallelizable and servers would run at capacity only for around 10% of the time as those jobs were submitted by users every few-up-to-24 hours. The app was basically a lift-and-shift from on-prem to AWS. The worst way one can use cloud. I created a demo for them where the exact jobs they were running on those m4.16xlarge would run on lambda using their existing code modified slightly. The time a job took to run went from many hours to few minutes with around 2k lambda functions running at the same time. The projected cost went from $20k/m to $1-5k/m depending on workload. I was quite happy with the result, unfortunately they ended up not using lambda and migrating off the app that in its entirety cost around $50k/m for the infrastructure. The point I'm trying to make here is that properly used cloud can save you a lot of money if you have a spikey workload.

Also, for internal apps that are used sparingly one can't beat a mix of client side JS code served from S3, using AWS Cognito tied to AD federation for auth with DynamoDB for data storage and lambda for typical server-side stuff. Such apps are easy to write, cost almost nothing if they are not used a lot and don't come with another server to manage. The only downside is that instead of managing servers, now you have to manage changes in AWS's api..., but nothing's perfect.


Proponents of the cloud love to ignore that for 'small' places, it's often perfectly fine to have a 'part-time +emergencies' ops/sysadmin person/team to manage infra.

Yes, some places will need a full time team of multiple people, but a lot of places don't, and can get a tailored solution to suit their needs perfectly, rather than just trying to fit into however it seems to work best using the soup of rube-goldberg like interdependent AWS services.


Absolutely, and unless you're riding a roller coaster, you can always grow your team with demand. When your operation has grown enough, you can shift from consultants to employees to save money and get the knowledge into the company, you don't have to start with a team of three to manage one server.


Exactly - and good consultants/contractors will still provide you with the in-house docs/guides/configuration management that will help you either transition existing, or hire new staff when the time comes.


Holy cow, that's nuts. A pair of perfectly webservers (m5.xlarge) comes in at ~€250/m. And cheaper if you get reserved instances. ~€2.2k/m for a pair of instances and maybe a load balancer is incredible!


If you can use servers _without_disks_ and without bandwidth, then yeah that's the price. Add some EBS, specially provisioned IOPS.

Not even getting into the fact that an m5.xlarge has roughly as much muscle as an old laptop.


Don't use PIOPS, it's a scam.

Get normal disks of a larger size, it comes with 3 IOPS per GB.


It is, especially when they are running a lot of base load, so it's not about firing up 20 nodes for an hour and then idling along on one node for the rest of the week.

"We'd rather not manage the database ourselves" was the answer from the lead dev, and I do understand that: it's not the developers' job to manage servers. In this company, management has removed themselves from anything technical and the devs don't like managing servers, so they say "let's use AWS, we'll write some code once to create our environment and Amazon takes care of the rest" - and by this point they are committed, with months having been spent on getting things running on AWS, changing the code where necessary etc, so reversing it would be a hard thing to sell to investors and the team.


What really bugs me about this is that it reminds me of the dotcom days. Microsoft ASP stacks made it easier to find less experienced developers to quickly develope sites, no one having an awareness of optimisation and instead would throw hardware at it. Large clusters of DEC Alpha servers would handle the same amount of traffic as a single FreeBSD on a PC, but this cost difference wasn't a problem when investor cash flowed easily and shifting Microsoft and Cisco products fed into the revenues of the system integrators which was fine since revenue growth mattered more than profits.

I've seen this with many AWS deployments, which on a pure hardware cost is 3-5 times more, but with the way it makes it 'easy' to scale instead of optimise instead costs 10-20 times more. When the investor cash starts drying up and the focus is going to be on competing for profits in a market that is much smaller in general, many organisations are going to find themselves locked into AWS and for them it's going to feel like IE6 and asp.net all over again.


Same goes for J2EE, Flash, and a bunch of other technologies. Your focus on ASP.NET seems rather biased.


I don't think you could describe J2EE as "made it easier to find less experienced developers to quickly develop sites". If anything, it's diametrically opposed to that goal.

And back in dotcom days, it was ASP, not ASP.NET. Which is to say, much like PHP, except with VBScript as a language, and COM as a library API/ABI.


Bang on. Realistically the value of Amazon's managed side is in the early stages. At latter stages with people, it's significantly lower cost to tune real resources, and you get added performance benefits.

We make a decent business out of doing just this, at scale for clients today.


Agree. AWS and the likes is an awesome tool to get access to a lot of compute power quickly which is great for unexpectedly large workloads or experimenting during early stages. For established businesses/processes the cost of running on premises is often significantly lower.

We manage about 150T of data for company on relatively inexpensive raid array + one replica + offline backup. It is accessible on our 10Gbps intranet for processing cluster, users to pull to their workstations, etc. The whole thing is running on stock HP servers for many years and never had outages beyond disks (which are in raid) predicting failure and maybe one redundant PSU failure. We did the math of replicating what we have with Amazon or Google and it would cost a fortune.


Would love to hear more about that business — do you help people go from cloud back to on-prem?


We're all about mixed mode. Let's never pretend that we ought focus on a specific one and only business. Clients like Amazon, clients like Azure - clients then get forced by (say the BC Health Authority) to run on physical machines in a British Columbia hosted datacenter.

We help make that happen, and help folks manage the lifecycle associated with it.


Companies just really suck at managing hardware or resources. The bigger and the more consultants they get, the more terrible they get, that's what you come after.

Chances are you will find tens of instances without any naming or tagging, of the more expensive types, created by previous dudes who long left. Thanks to AWS tooling it's easy to identify them, see if they use any resource and either delete or scale down dramatically.


It's not a size thing - it's all companies, and it's about significantly more than just 'finding instances'. The 'finding instances' side of AWS is something we just do for free (AWS, Azure, etc) for our existing customers.

Our goal is to provide actual value and do real DevOps work, regardless of whether you're AWS, Azure, GCP, etc. This includes physical, mixed-mode, cloud bursting, and changing constantly. <----- That's what keeps it interesting.


If that were universally true, then why did Netflix go all in on AWS?


hype maybe? Got a backroom deal for the exposure? Oh, and you probably aren't netflix either. But they have a pretty severe case of vendor lock in now, will be interesting to see how it plays out. As of 2018 they spend 40 million on cloud services, 23 million of that on aws.

Do you think you are gonna get AWS's attention with that $10 s3 instance when something goes wrong?!? You will have negative leverage after signing up for vendor lock in.

I'll take linux servers any day, thanks.


So you think Netflix was suckered into using AWS and if they had just listened to random people on HN they would have made different choices?

I’m sure with all of your developers using the repository pattern to “abstract their database access”, you can change your database at the drop of a dime.

Companies rarely change infrastructure wholesale no matter how many levels of abstraction you put in front of your resources.

While at the same time you’re spending money maintaining hardware instead of focusing on your business’s competitive advantage.


so, you might be dealing with some half/truths here.

Netflix does NOT use aws for their meat and potatoes streaming, just the housekeeping. They use OWS for the heavy lifting.

https://www.networkworld.com/article/3037428/netflix-is-not-...

But re: maintaining hardware, I only maintain my dev box these days. We are fine with hosted linux services, but backup/monitoring/updating is just too trivial, and the hosting so affordable (and predictable) w/linux it would have to be a hype/marketing/nepotic decision to switch to aws in our case. The internet is still built on a backbone and run by linux, any programmer would be foolhardy to ignore that bit of reality for very long.


So what do you think AWS services are running on if not Linux?

Netflix is by far AWSs largest customer.

From the horses mouth on why they decided to move to AWS:

https://www.se-radio.net/2014/12/episode-216-adrian-cockcrof...

You could also go to YouTube and watch any of the dozens of talks NetFlix has done at ReInvent.

Of course Netflix caches it’s videos in the ISPs data center. But caching is not the “heavy lifting”.


Please do not spread falsehoods.


If your "managed services" are a ton of work, then they're not really managed.

I built a system selling and fulfilling 15k tshirts/day on Google App Engine using (what is now called) the Cloud Datastore. The programming limitations are annoying (it's useless for analytics) but it was rock solid reliable, autoscaled, and completely automated. Nobody wore a pager, and sometimes the entire team would go camping. You simply cannot get that kind of stress-free life out of a traditional RDBMS stack.


If anything I had much more troubles with the datastore than with any other DB ever. We are migrating away and the day it's over will be thz biggest profesional relief I've experienced. I guess you coule consider that you don't need to manage it, but the trade off is that you have to go around all the limitations through nasty software hacks instead of just a simple configuration.


I also am curious to know what sorts of problems you've had.

I have noticed that people tend to get into trouble when they force the datastore to be something it's not. For example, I mentioned that it's terrible for analytics - there are no aggregation queries.

In the case of the tshirt retailer, I replicated a portion of our data into a database that was good for analytics (postgres, actually). We could afford to lose the analytics dashboard for a few hours due to a poorly tested migration (and did, a few times), but the sales flow never went down.

The datastore is not appropriate for every project (which echoes the original article) but it's a good tool when used appropriately.


It definitely sounds like they shouldn't have used datastore in the first place if it's giving them that much trouble. A common pattern we use is to replicate the data in bigquery, either by stream or batch job that flows over pubsub - perfectly scalable and resilient to failure.


Regarding Google Cloud Datastore, what kind of limitations have you met?


Yeah, I think it's a trade off. Certain services can be a no-brainer, but others will cause pain if your particular use case doesn't align precisely with the service's strengths and limitations.

DynamoDB vs RDS is a perfect example. Most of that boils down to the typical document store and lack of transactions challenges. God forbid you start with DynamoDB and then discover you really need a transnational unit of work, or you got your indexes wrong the first time around. If you didn't need the benefits of DynamoDB in the first place, you will be wishing you just went with a traditional RDBMS vs RDS to start with.

Lambda can be another mixed bag. It can be a PITA to troubleshoot, and their are a lot of gotchas like execution time limits, storage limits, cold start latency, and so on. But once you invested all the time getting everything setup, wired up...

In for a penny, in for a pound.



When you buy a service from a big company, and it doesn't work, you get to debug the service.


Which is exactly why everything runs Linux instead of Windows.


I guess that's his point, you are not doing less work, all you did was moved your debugging from one place to another.


And let's not forget! Got a support contract so we can all BLAME someone not in the room and feel good.


>You will run into performance issues, config gotchas, voodoo tuning, and maintenance concerns with any of AWS's managed databases or k8s.

The default config for a LAMP stack will easily handle 100 requests per second. 10 if you app isn't optimized.

Run apt upgrade once a month and enable automatic security updates on ubuntu.

That is neither hard nor "vodoo".

I've used managed services and I don't see the point until you hit massive scale, at which point you can afford to hire your engineers to do it.


That doesn't sound like a shift of work. It sounds like work I already would have done - performance tuning doesn't go away by bringing things in house.

Now that person I pay to operate the service can focus on tuning, not backups and other work that's been automated away.

Sounds like a massive win to me.


It does make performance tuning harder since you likely don't have access to the codebase of the managed service, requiring more trial-and-error or just asking someone on the support team ($$)


Then you pay $O/month for AWS Enterprise Support (who are actually quite good and helpful) to help augment your $M/month employees and $N/month direct spend.


Support - in the long run - is pretty cheap (I think I pay around 300) and 100 percent worth the tradeoff. The web chat offers the fastest resolution times in my experience once you are out of the queue


To be fair, this is only a valid shift for folks moving. If you are creating something, you have both "how do I configure" and "how do I keep it moving?"

That is, the shift to managed services does largely remove a large portion, and just changes another.


Yes, but it's still pretty much a straight cost offset. If you hold your own metal, you have to do all of that and still administer the database. Sure, there could be a little overlap in storage design, nut most of the managed systems have typical operational concerns at a button click: backup, restore, HA... Unless your fleet is huge and your workload is special, you're going to win with managed services.


If you hold your own metal

That's going too much to the other extreme, ec2, droplets, etc.. are fine.


define huge?


Bigger than Netflix, assuming that you don't know something that Netflix doesn't.


And a lot of the time keeping X running is simpler than configuring and tuning this service.


He specifically mentioned Dynamo. There is nothing to configure except indexes, and read and write capacity units.


That's no joke. I have a decent software background and it was far from trivial to get going with aws services. Their documentation doesn't always quite tel you everything you need to know and half thhe time there are conflicting docs both saying to do something that's wrong. Still has been less work than a production server at my last engineering job but then again that project had a lot of issues related to age and shitty code bases. Hard to say which would have been less work honestly.


> but because fuck maintenance, fuck operations, fuck worrying about buffer overflow bugs in OpenSSL, I'll pay Amazon $N/month to do that for me, all that matters is the product.

That's nice in theory, but my experience with paying big companies for promises like that is I still end up having to debug their broken software; and it's a lot easier to do that if I can ssh into the box. I've got patches into OpenSSL that fixed DHE negotiation from certain versions of windows (especially windows mobile), that I really don't think any one's support team would have been able to fix for me / my users [1], unless I had some really detailed explanation -- at that point, I may as well be running it myself.

[1] And as proof that nobody would fix it, I offer that nobody fixed this, even though there were public posts about it for a year before my patches went in; so some people had noticed it was broken.


This.

I've been informally tracking time spent on systems my team manages and managed tools we use. (I manage infra where I work.)

There is very little difference.

And workarounds and troubleshooting spent on someone else's system mean we only learn about some proprietary system. That's bad for institutional knowledge and flexibility, and for individual careers.

Our needs don't mesh well with typical cloud offerings, so we don't use them for much. When we have, there has yet to be a cost savings - thus far they've always cost us more than we pay owning everything.

I mean, I personally like not dealing with hardware or having to spend time in data centers. But I can't justify it from a cost or a service perspective.


After amazon switched to ec2/sec kind of billing in 2017 i would say cloud is pretty good for prototyping and running CI. As to anything else i would say it depends.


absolutely. I am _right_now_ fighting ELB because we see traffic going to it but not coming out of it. If it were MY load balancer, I would just log in and tcpdump.


And you couldn’t use VPC logs?


Finally figured it out. Non-http/2 support.


Your experience with companies or your specific experience with AWS?


With companies. I don't have a lot of specific experience with AWS, except for the time when nobody mentioned that the default firewall rules are stateful, and there's a connection cap that's kind of low but scaled to instance size, and there's no indication about it, and the sales droid the assigned to my account because they smelled money didn't mention it either; but I'm not bitter. :)

Based on all the articles I see from time to time, I fully expect if I worked for a company that was built on AWS, I would spend a lot of time debugging AWS, because I end up debugging all the stuff that nobody else can, and it's a lot easier to do when I have access to everything.


Everything you mentioned is covered in a Pluralsight video.

But you criticize AWS based on a few articles and hardly any real world experience?


I didn't start off criticizing AWS; just large companies in general, based on my experience with them, and extrapolating it to AWS based on my experience with the one issue, combined with reading things that sound like the same pattern of people having to debug big company's software, and it's hard when you can't see anything.

On this specific issue; I worked with sales support, and also looked on their forums; and found things like this: https://forums.aws.amazon.com/message.jspa?messageID=721247

Which says 'yes, we know there's a limit; no we won't tell you what it is' and ignores the elephant in the room, that apparently if you reconfigure the network rules, the problem evaporates.

Excuse me for not banging my head against the wall or watching random videos from third parties. Actually, I only know there's a way to avoid this because I complained about it for a long time in HN threads, and finally, some lovely person told me the answer. In the mean time, I had continued my life in sensible hosting environments where there wasn't an invisible, unknowable connection limit hovering over my head.


>>> I didn't start off criticizing AWS; just large companies in general, based on my experience with them

I will reply on that because I have experience with both.

If you worked in big companies, you do have to constantly debug half baked internal tools written by other people, with little or no documentation. It's horrible when you change company later, the institutional knowledge is lost when you leave and it's worthless for you to get a new job.

AWS is absolutely nothing like that. The tooling is working and polished and documented. It's much much better than any internal tools you will find in big companies. When you're out of your depth, you can just google for help. When changing company you can continue to apply the experience and never worry about the tooling in place.


So you should be excused for not doing research on something your entire infrastructure is based on?

And just “extrapolating”?


No, my entire infrastructure is not based on AWS; this was a small side project, that didn't go into production. In part, because the network did weird things, and the support system wasn't very helpful. Mostly, because it turned out to be unnecessary.

What am I supposed to do to satisfy you here? After finding the network does weird things, and not being able to get timely help with, I should conclude that AWS will solve my needs, if only I paid them more money? If only I had wasted more time watching videos that don't come up against reasonable search terms, maybe I wouldn't have come to the conclusion that having an amorphous, uninspectable, blob of a network between my servers and the internet is a bad thing?

P.S. This may be a little of pot and kettle, but what makes it so important to you that I'm convinced that AWS is unicorns and rainbows and having more control and visibility is a bad thing?

I'm sure AWS has its perks; it's great if it works for you, and if you can take advantage of running drastically different number of instances at peak and off-peak, I think there's cost savings to be had. But when it comes down to it, I'm not willing to hitch my availability to complex systems outside my control, when I don't have to, because I don't like the feeling of not being able to do anything to fix things; and I get enough of that from dealing with the things that are broken that I can't control and don't have any choice about depending on.


PS. This may be a little of pot and kettle, but what makes it so important to you that I'm convinced that AWS is unicorns and rainbows and having more control and visibility is a bad thing?

Because I was developing and maintaining web servers, database servers, queueing systems, load balancers, mail servers and doing development at small companies before AWS in its current form existed. I’ve done both.

The problem most posters here have isn’t lack of visibility, it’s not understanding the tools - it’s a poor carpenter who doesn’t understand their tools.

What am I supposed to do to satisfy you here? After finding the network does weird things, and not being able to get timely help with, I should conclude that AWS will solve my needs, if only I paid them more money? If only I had wasted more time watching videos that don't come up against reasonable search terms, maybe I wouldn't have come to the conclusion that having an amorphous, uninspectable, blob of a network between my servers and the internet is a bad thing?

Do you usually jump in on a new technology or framework and get frustrated because you don’t know how something works when you didn’t take the time to learn the technology you use?

I’m sure you’ve never had s problem with any piece of open source software that was “out of your control” or did you just download the software and patch it yourself?

The truth is that yes any piece of technology is complex and you thought you could just jump in without any research and start using it. I don’t do that for any technology. When I found out that we were going to be using ElasticSearch for a project, I spent a month learning it and still ran into some issues because of what I didn’t know. That didn’t mean there was something wrong with ES.


> Do you usually jump in on a new technology or framework and get frustrated because you don’t know how something works when you didn’t take the time to learn the technology you use?

I usually jump in and try to solve my problem with the tools that seem right for the job. Reviewing the documentation as needed. I get frustrated when the documentation doesn't match what I have observed. In this case, the connection limits of the default stateful firewall on EC2 were not mentioned anywhere I could find (knowing exactly what to to look for and where to look, I think I could find a vague mention of the general limit today). The specific limits per instance type certainly still aren't. I found forum posts about the connection limit with useless responses from employees. I reached out to my support contact and got useless information (they did get me access to bigger instances though, which just have a larger, unspecified, connection limit, that was still small enough that I could hit it). I reached the conclusion that AWS must be great, because people love it, but it has very non-transparent networking, and useless support for non-trivial issues. I'm sure if we were spending more, we would get better support people and they'd share insights into their networks -- I've seen that with our other hosting providers, although network insights there weren't needed for everyday things, just more details were nice when the network broke, so we could help them detect future issues and plan for failures.

If the technology mostly works, but I need deeper knowledge, maybe for optimizing, I'll seek out deeper references, but third party references during discovery is very dangerous. When there is a conflict between what is observable, what is documented in first-party sources and what is documented in third-party sources, I would have no context to know which documentation indicates intended behavior and which is most out of date.

> I’m sure you’ve never had s problem with any piece of open source software that was “out of your control” or did you just download the software and patch it yourself?

Download and patch myself, and upstream the patches if I have the time and patience. Isn't that the point of open source? Everything is broken, but at least I can inspect and fix parts of the system where I can see the code. Hell, binary patching is a thing, although I wouldn't want to do that on the regular. I've had patches accepted in the FreeBSD kernel, OpenSSL, Haproxy to fix issues, some longstanding I ran into.

At the end of the day, I'm responsible for the whole stack, because if it's not working, my users can't use my product. It doesn't matter if it's the network, software I wrote, open source software, managed software; even software on the client devices is my problem if it doesn't work. The more I can inspect, the better.


The specific limits per instance type certainly still arent

In the official documentation they do mention that different instance sizes have different networking capabilities. This is stuff you learn early on when learning AWS - from the official books.

In this case, the connection limits of the default stateful firewall on EC2 were not mentioned anywhere I could find

The fact that security groups are stateful and Nacls are stateless are questions I ask junior AWS folks when interviewing. That’s like one of the level 1 questions you ask to know whether to actually bring them in for an on-site.

and useless support for non-trivial issues. I'm sure if we were spending more, we would get better support people and they'd share insights into their networks

So what did you find when you turned on VPC logging for the network interface of the EC2 instance? Again this level of troubleshooting is a question I would ask junior admins during an interview process.

I’ve had 100% success rate with our business support using live chat. With things that were a lot hairier and with my own PEBKAC issues.

Download and patch myself, and upstream the patches if I have the time and patience. Isn't that the point of open source? Everything is broken, but at least I can inspect and fix parts of the system where I can see the code. Hell, binary patching is a thing, although I wouldn't want to do that on the regular. I've had patches accepted in the FreeBSD kernel, OpenSSL, Haproxy to fix issues, some longstanding I ran into.

I’m sure my company wouldn’t have any trouble with approving our running our own patched version of our production database an OpenSSL....


> So what did you find when you turned on VPC logging for the network interface of the EC2 instance? Again this level of troubleshooting is a question I would ask junior admins during an interview process.

Hey --- this sounds like something my support tech should have asked me, or should have been mentioned by the employee response in the forum. I didn't look at VPC logs; I did see that SYN packets (or maybe SYN+ACK responses, I don't remember) were mysteriously missing in tcpdump between the ec2 instance and the server.

Either way, I'm not interviewing for a AWS position; I was just trying to use a managed off the shelf service. Looking through the docs now, I did find the mention of connection tracking [1], but even there, there's no mention of a limit (of course, as someone familiar with firewalls, I know there's always a limit with connection tracking, which is why I only rarely write stateful rules, and wouldn't have assumed default rules were stateful. I had read the bit about the default security group, which says:

> A default security group is named default, and it has an ID assigned by AWS. The following are the default rules for each default security group:

> Allows all inbound traffic from other instances associated with the default security group (the security group specifies itself as a source security group in its inbound rules)

> Allows all outbound traffic from the instance.

> You can add or remove inbound and outbound rules for any default security group.

Unfortunately, "allows all outbound traffic" was misleading, because it's really allows all outbound traffic subject to connection limits.

> I’m sure my company wouldn’t have any trouble with approving our running our own patched version of our production database an OpenSSL....

I'm assuming that's sarcastic, and you're actually saying you don't think you would be able to get approval to run a patched version of software. Are you saying that if you find a problem in OpenSSL (or whatever enabling technology) that causes your system to be unreliable, your employer will not let you fix it; you'll need to wait for a fixed release from upstream? From experience, upstream releases often take weeks and some upstreams are less than diligent about providing clean updates; not to pick on OpenSSL, but a lot of their updates will fix important bugs and break backwards compatibility, occasionally breaking important bits of the API that were useful. I guess, if you're in an environment where you have no ability to fix broken stuff in a timely fashion, it really doesn't matter whose responsibility it is to fix it, since it won't be fixed.

I really hope you didn't need my patch for your databases systems; but maybe you want/wanted it for your https frontends if you were doing RSA_DHE with windows 8.1 era Internet Explorer or windows mobile 8.1 so that your clients could actually connect reliably. Anyway, if you're running OpenSSL 1.0.2k or later, or 1.1.0d or later (might have been 1.1.0c), you've got my patch, so you're welcome. Fixes the issue well described here [2]

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-ne...

[2] https://security.stackexchange.com/questions/104845/dhe-rsa-...


You specifically mentioned OpenSSL. Our auditors would have been up in arms at my previous company where we did do everything on prem if we ran a custom version of OpenSSL. Do you really think we could pass either HIPAA compliance or PCI compliance with our own unvetted version of it?


I work on a team where our products are all in Fargate containers. I understand the appeal of serverless -- you never need to exec into the container, but half the time when we're debugging an issue in prod that we can't reproduce locally, we'll say, "wouldn't this be easier if we could just exec into the container and find out exactly what's going on?"


Not being able to ssh into a container sounds like a missing feature of that particular container solution? I would expect that I can ssh into a docker container hosted in a kubernetes cluster. Hmm, pretty sure I must've done this dozens of times.


Removing SSH should be the goal though. If you follow the old Visible Ops book you also "Electrify the Fence" and introduce accountability, etc. If your goal is to see what a process is doing introduce tracing. If you need to add a "println()" then push that out as a change because the environment is changing from your altering of it. Because the tool doesn't exist yet that you need to SSH into a box doesn't mean it shouldn't - you have to instrument the tooling to prevent you from needing this adhoc ability. Admittedly it scares me still but ideally the end game is to never need to or have the ability to do so through a tool which has all the things you are looking for without allowing a human to be too human and miss a semi-colon.


> "println()" then push that out as a change

No, when you actually need to debug in production that's usually not what you want. Changing or restarting the software you are debugging might well make the behaviour you want to understand go away.

> introduce tracing

Yeah, well, that's basically "logging in". Just over less mature and likely less secure protocol than SSH.

You don't need ptrace and tcpdump to debug software. It's just that it can shave a few weeks off your time when you need to reproduce something in the more tricky cases.

These discussions tend to surface in the context of containers but that's all very irrelevant. You need to debug software isn't affected by the way you package it.


You need to be able to troubleshoot things in production though.

Perhaps whenever a developer wants to troubleshoot the orchestrator could make a clone of the container. The clone continuously receives the same input as the original and gets to believe that it affects the back end in the same way. That way the developer can dissect the container without impacting real data.


> My goal is to be totally unable to SSH into everything that powers my app

That sounds like a nightmare to me, and I'm not even a server guy, I'm a backend guy only managing my personal projects.

> I'll pay Amazon $N/month to do that for me, all that matters is the product.

I don't want to pay Amazon anything, I want to pay Linode/Digital Ocean 5 or 10 or 20 dollars per month and I can do the rest myself pretty well. My personal projects will never ever going to reach Google's scale and seeing that they don't bring me anything (they're not intended to) I'm not that eager to pay an order a magnitude more to a company like Amazon or Google in order to host those projects.


it's astounding how much a $20 linode can handle (as a web host) with a proper software stack and performance-minded coding.

but companies throw money, hardware and vendor lock-in at problems way before commiting to sound engineering practices.

i have a friend who works at an online sub-prime lender, says their ruby backend does hundreds of queries per request and takes seconds to load. but they have money to burn so they just throw hardware at it. they spend mid-five figures per month on their infrastructure and third party devops stuff).

meanwhile, we run a 20k/day pageview ecommerce site on a $40/mo linode at 25% peak load. it's bonkers how much waste there is.

i think offloading this stuff to "someone else" and micro-services just makes you care even less about how big a pile of shit your code is.


And we just thought the frontend of websites and services was bad.

Seems the insanely bad coding and ignorance of speed and performance goes all the way to the backend too!

Disappointing.


One of the things I'm most grateful for as a programmer was the ability to learn a lot of performance-minded backend design principles within the Salesforce ecosystem. Salesforce was always brutally unforgiving in terms of the amount of I/O it would let you do per transaction, which led me to become much more aware of how those things worked under the hood, and had the effect of teaching me how to better design my code to fit within the limits.

When you can't use more than a couple seconds of CPU time per transaction, or you have a fixed and spartan heap size for a big algorithm, you learn good habits real quick.

I imagine a lot of engineers today haven't had the experience of working within hardware limits, hence all the waste everywhere.


> When you can't use more than a couple seconds of CPU time per transaction

hmm, i guess that explains why salesforce is universally known for being slow; 2 seconds is an eternity.

we deliver every full page on the site within 250ms from initial request to page fully loaded (with js and css fully executed and settled). images/videos finish later depending on sizes and qtys.


I don’t disagree that Salesforce is slow but this is a different workload, think db/business logic to perform a task, processing records, etc, not loading a typical page/assets.


i just signed up for a trial of salesforce and at least when there's not much data in the account it doesnt seem too slow. i was going off of what some coworkers related to me from past experience.

i think it's expected that if you're doing some editing/reporting/exporting en masse, then a couple seconds could be okay (assuming it's thousands or tens of thousands of records, not a hundred). but not for single-record actions / interactions.


Astonishingly it turns out that bad programmers are everywhere and aren't made any better by handling a different domain of problems. Truly shocking.


We process up to 15 million events per second on a $15K HP server.


a pretty meaningless metric without knowing what an "event" is.


The event is market data update. On one hand it is in fact pretty simple. On the other hand, it is not much simpler than something like “user X liked Y” or even “product Z added to cart”. I think it shows what a well designed, optimized code can do on a normal hardware. Remember there is 3-4 billion instructions that a modern cpu is pumping out per second.

One large bank I used to work for had a payment processing system doing about 40 thousand transactions per day. It ran on about $8 million worth of hardware - mainly IBM mainframe. I was impressed until I found out each transaction was essentially a 60-100 byte text message. I can confidently say a well designed system can do 1000 times that load on an iPhone.


Not that it isn't probably still terrible code but the banking system is likely logging the full transaction details every time it is modified / passed to a different part of the system. I.e. fully auditable - if something goes wrong with the transaction it can be fully traced back, corrected and the underlying issue investigated.


Cloud providers have generous free tiers, so many personal projects pay $0. However, I'm in the same boat. I'd rather pay $20/month to DO and risk my site going down in a DDoS attack than having a REST endpoint basically tied to my credit card and hoping Amazon blocks all the malicious traffic.


Yep, keep your stuff portable and you can always migrate to AWS/Azure/etc. later if your personal project should turn into something bigger.


As someone tasked with protecting infrastructure, devs not being able to SSH into production is a godsend.


There's a difference from devs not being able to ssh in due to access restrictions and devs not being able to ssh in because it's not a feature of the service. I actually agree that not having unilateral access to the production system is a good thing. But not being able to have anyone able to be granted access, even for a limited time, is neither productive or safe.


so, what is to stop them from running code from the app?!? Who do you call when something breaks? Do you expect the dev to not be able to log in and investigate at 3am or whenever?

I mean if you can't trust your dev team, or the procedures surrounding them, you are kinda screwed.


To be clear, the issue isn't about trusting developers in prod. It's about limiting access to prod as an avenue for attackers who will be looking to get in there and exfil sensitive data.

Nothing to do with trust of devs.


Enjoy the many (many, many) connection reset by peer, TCP idle timeouts, and 500 Internal Server Errors on Dynamo's end.

Between that and the dogshit documentation, it's truly thrilling to be paying them a ridiculous amount of money for the privilege of having a managed service go undebuggably AWOL on the regular, instead of being able to resolve the issues locally.


Yeah and if you’re on gcp you also get dogshit support whose main strategy is to sandbag you with endless (and useless) logs collection until you give up and move on or take too long so they can close the ticket.


I can see my code handling all the exceptions correspond to the points you raised with exponential backoff. Which part of Dynamo's documentation is dogshit?


To me this comment is not a response to the article.

The article isn’t asking why use Dynamo vs SQL in the context of the traditional NoSQL vs SQL way

It’s why Dynamo vs QLDB.

Or why Cloud Firestone vs Cloud Spanner.

Or Firehose + some lambda functions vs EMR

It’s wholly separate of the managed angle and focuses on the actual complexity of the products involved and the trade offs therein.

In the surface the two sound similar, but there’s much more nuance to this article's point (take the example of Dynamo, the author isn’t opposed to it for being NoSQL, they’re opposed to it for low level technical design choices that are a poor fit for the problem their client had)


Yeah, "not my problem" sounds like a winning strategy, until you realize that it also means "nothing I can do about it" when things inevitably go wrong.


Not to mention that you're the lowest priority when something does go wrong.


But these services do break, and when they do break, often times you don't even get a useful error or log file to grep. Sometimes I just really wish I could SSH into a box to figure out what the hell is going on.

You're gaining an expertise, but instead of it being with the underlying technology, it's with the management services for that technology. I think the premise of your argument is correct that understanding these services is simpler than the technology itself, but I seem to inevitably find myself in some edge use case of the service that is poorly documented or does not work as described, and trolling stack overflow or AWS developer forums is the only solution other than trial and (no) error.

Let me tell you: it's not fun when you're entrenched into a proprietary service that is now outdated and kept only for legacy reasons, and the version you are on doesn't allow you to do things like update your programming language to a more recent version. Now you get to move to another managed service and lose all that expert knowledge you had with the old one.


> My goal is to be totally unable to SSH into everything that powers my app.

If the 0day in your familiar pastures dwindles, despair not! Rather, bestir yourself to where programmers are led astray from the sacred Assembly, neither understanding what their programming languages compile to, nor asking to see how their data is stored or transmitted in the true bits of the wire. For those who follow their computation through the layers shall gain 0day and pwn, and those who say “we trust in our APIs, in our proofs, and in our memory models and need not burden ourselves with confusing engineering detail that has no value anyhow” shall surely provide an abundance of 0day and pwnage sufficient for all of us.

Thus preacheth Pastor M. Laphroaig.


If you go server less because fuck accountability, shouldn't the extension be jobless because someone else will provide the service you did?

In that way, cloud/serverless migration is really the whole worldwide IT workforce progressively willingly migrating IT (design, development, operations & applications) out of companies, to just a few big providers.

Just like energy.

And who gets to decide destinies in the world? Those who control, have and provide energy.

What a time to be alive.

End of XIX century was the rush to oil-based energy to control the world. End of XX was the rush to data-based capture and control.


Which Dynamo are you talking about? The Dynamo paper, or Dynamo, the AWS managed database product? Dynamo-the-product makes sense for all sorts of random tasks. Full-on implementations of Dynamo-the-paper (for instance: Cassandra) fall into the bucket of "understand the problem first" that the author is talking about.


Security and privacy yet again traded for convenience. We need strict data privacy laws and then I would have a little more trust in large centralized entities that would otherwise have every incentive to abuse people's data.

Such centralized systems also become significant points of failure. What happens to your small business if one of the many services you rely on disappears or changes their API?

"All that matters is the product" sounds very much like "all that matters is the bottom line" and we've seen the travesties that occur when profit is put above all else.


It's not clear that having your own datacenter is more secure than using AWS/GCP/Azure services. In both cases there are good monitoring solutions, however I'd say that cloud-based solutions have easier immediate access to most things because they just integrate with the provider's API whereas on prem you're installing agents and whatnot.

Also having granular IAM for services and data is very helpful for security. You have a single source of truth for all your devs, and extremely granular permissions that can be changed in a central location. Contrast to building out all of that auditing and automation on your own. Granted, IAM permissions give us tons of headaches on the regular, but on balance I still think it's better when done well.

If you're concerned about AWS/GCP/Azure looking at your private business's data, I think 1.) that's explicitly not allowed in the contracts you sign 2.) many services are HIPAA compliant, hence again by law they can't 3.) They'd for sure suffer massively both legally and through loss of business if they ever did that.


It is important to understand the following:

Depending on the problem, choose the right tools. For example so far everything I've seen about React and GraphQL tells me that maybe GQL isn't necessarily the best solution for a 5-person team, but a 100 person team may have a hard time living without it.

Kubernates / Docker is significant work. And when we had 4 developers making that effort was wasteful and we literally didn't have the time. Now at 20 we're much more staffed and the things Kubernates / Docker solves is useful.

Meanwhile we have a single PostgreSQL server. No nosql caching layer. We're at a point where we aren't sure how far we gotta go before we hit limits of a PSQL server, but at least 3-4 years before we hit it.

Point is look at the tools. See where these tools win it for you. Don't just blindly pick because it can scale super well. Pick it because it gives you things you need and a simplicity you need for the current and near-term context, and have a long-term strategy how to move away from it when it is no longer useful.


> My goal is to be totally unable to SSH into everything that powers my app.

In reality that means that when something doesn't work as expected, instead of logging in and solving the problem, you'll be spending your time writing tickets and fighting through the tiers of support and their templated replies, waiting for them to eventually escalate your issue to some actual engineer. Wish you a lot of luck with that, I prefer being able to fix my own stuff (or at least give it a try first).


I'm not sure I follow, how does managed services help with choosing the right database technology for the problem set?


He doesn't care about that. JUST THE PRODUCT. That's how amazing his fantasies are.


Yeah, I can't quite square what the OP is saying with how to deal with something like Amazon Dynamo's lack of large scale transactions. (Even basic ACID was only introduced late last year, if I read correctly.)


If you choose Dynamo, then you get to blame Dynamo for not having ACID when you lose data!

This is an incredibly common fallacy in our industry. Remember, you are responsible for the appropriateness of the dependencies you choose. If you build your fancy product on a bad foundation, and the foundation crumbles, your product collapses with it and it's unprofessional to say "Well, I didn't pour the foundation, so I can't be blamed." Maybe not, but you decided to build on top of it.


A SQL database, bash, and a Raspberry Pi in your closet all are managed too. They're built on top of software that's been battle-tested over decades to keep itself running. MySQL/Postgres/whatever will manage running multiple queries for you and keeping the data consistent as well as Dynamo will. bash will manage running multiple processes for you as well as Hadoop will. A Raspberry Pi with cron will manage starting a service for you every time it powers on as well as Lambda will. The reasons to use Dynamo/Hadoop/Lambda are when you've got different problems from the ones that MySQL/bash/cron solve.

If you don't believe that this counts as "managed," then for all your words at the end, you believe at the end of the day that making a solution more complex so that failures are expected and humans are required is superior to doing a simple and all-in-code thing. For all that you claim to not like operations, you think that a system requiring operational work makes it better. You are living the philosophy of the sysadmins who run everything out of .bash_history and Perl in their homedir and think "infrastructure as code" is a fad.


With this definition unless you write your own database system, everything is managed. That makes it pretty useless to qualify anything as managed.

The Raspberry Pi is managed by you, as opposed to managed by a different company / different people so you can focus on your business problem.


Here I am still using SQLite (had to add an index yesterday!) Maybe one day I will grow up :(


SQLite is good quality software, and very heavily tested. There's nothing wrong with using it so long as you don't outright misuse it, and they are clear about when you shouldn't be using it.


I very rarely run into a situation where I need more than SQLite. The page on the subject is great: https://www.sqlite.org/whentouse.html


I's not so good for multi user systems as I understand as it has a lock on the whole database when writing.


Queuing works most of the time, does it not?


Queuing and retry works most of the time. Only time I truly get "database is locked" is when I open the shell while other processes are accessing the DB


Are all writes blocked during database migrations?

ALTER TABLE is quite limited in SQLite (for example there is no DROP COLUMN). Have you missed this?


All writes are blocked during database migrations, so there is potential offline time. For example, when I added the index, I needed to be offline. Fortunately, my application is only needed during business hours.

Yes I miss proper ALTER TABLE a lot when it would have been useful. But my experience shows me that I can design software pretty well following YAGNI and the occasional instances of when I need missing functionality is more than made up for the ease of use otherwise.


Thanks for sharing your experience with SQLite :-)

What are the advantages, in your own experience, of using SQLite instead of something like PostgreSQL? With PostgreSQL, you could run database migrations during business hours and use a proper ALTER TABLE.


Main reason is that SQLite is embedded with my application. There isn't a second, third or fourth "microservice" to spin up. I know Terraform/Docker, love them, but the gymnastics I need to orchestrate for my particular situation is not worth it. So for the tradeoff of complexity for functionality, SQLite hits a very good sweet spot for me.


Where do you queue: in memory? on disk? in a dedicated queue server (Redis, Kafka, ActiveMQ, etc.)? in another SQLite database?


So far, haven't needed queuing outside of a single thread in the application (I know, what happens if it crashes!) Hasn't happened in ~8 years. Won't cry if it does.


I have an app that uses an in-memory queue too in one thread :-)

I'm curious: why do you queue instead of directly writing to the database?


At the moment, I queue to batch insert real time data at ~5 second intervals. It seems less wasteful of resources, but quite frankly is probably not needed.


To be fair I have only really used SQLite for early development stages so haven't investigated workarounds in much detail.


How do you access remotely, from your workstation, to the SQLite database on your server, while it is still serving requests?


  ssh user@server
  sqlite file://$PWD/db.sqlite?mode=ro


Yes, this will work of course :-)

I was thinking of something more "graphical", similar to pgAdmin or SequelPro, when you need to navigate in a dataset.


I use https://sqlitebrowser.org/ if I need to but primarily Jupyter w/ Pandas to explore and understand the data and prototype new queries.


Regarding SQLiteBrowser, I guess you have to download a copy of the database on your local workstation?


Correct. I use it rarely.


Just ran it using X remotely, so that's an option too.


There's https://sqlitebrowser.org,

But I suppose you'll have to string up something if you want it to access a remote db.


There is also https://www.phpliteadmin.org, but I'm not a fan of the user interface.


Whatever exists specifically for sqlite, sshfs is your friend too.


... Fuck debuggablity. Fuck performance. Fuck understanding what is actually going on....

(Sure, managed solutions can satisfy all of those. But not without understanding a few orders of magnitude more complexity than a simpler solution, and you'll never get there if your attitude is "I'll pay Amazon to do that for me." Amazon doesn't care whether you succeed.)


> Amazon doesn't care whether you succeed.

Wait, what? Why wouldn't they? Successful companies will spend money with them; defunct companies can't.


> My goal is to be totally unable to SSH into everything that powers my app. I'm not saying that I want a stack where I don't have to. I'm saying that I literally cannot, even if I wanted to real bad.

So you want technology that prevents you from doing something you might want to do, real badly?


If the main reason for your switch from sql to dynamo is that it is managed why not use aws aurora, aurora serverless, or any other more managed sql solution?

Now when many sql databases support json I don’t quite understand why you would use a documentdb for anything else then for actual unstructured data, or key value pairs that needs specific performance.

I used mongodb for a while now I’m 100% back in sql. Document dbs really increased my love for sql. But I’m also long from being a nosql expert and I never have to deal with data above a TB.


Serverless is not a panacea, actually most of the times is true the opposite. You will spend much more time dealing with configuration, deployments, managing cold starts, orchestrating everything. Sure, for some use cases is perfect, but I can tell you almost certainly that the future won’t be on AWS lambda &co.


Its moving complexity from the application layer to the deployment layer.


which is awesome. Application complexity is harder to solve than deployment/infrastructure complexity. And more often the cloud providers would solve the deployment complexities over time.


To your point, when I often see the "shell script" solution it's often missing basic things you'd want like version control or any kind of bread crumb trail (I find myself sshing in and "grep -r" trying to find things). That works if you're cobbling something together, but doesn't scale beyond 1 person.

It's tough to get the correct level because you can ansible > docker > install an rpm or just change the file in place. Both have their place and "hacky" solutions can work just fine for many years.


> That's why serverless is the future;

Future of what exactly?


The future of the pit they will dump money into.


The future of maintenance problems, where the split between developers and deployments makes it a pain in the arse to work out what has gone wrong.


The future of making nouns out of adjectives, apparently.


Future of server ?


Future of server is serverless? :P


Serverless is just somebody else's server.


No, "cloud" is somebody else's server. Serverless is somebody else's server that scales your services seamlessly (and usually opaguely).


I was riffing off of the phrase, "The cloud is just someone else's computer".


Serverless is the cloud?


Oh, I didn't realize we were doing trick questions.

What's the safest way to go skiing? Don't ski.


Before we make a technology choice, we should be clear what those choices are. SQL is a query language and DynamoDB is a database. "NoSQL" technologies can be fronted with SQL interfaces. Relational databases can vary a lot too in their ACID compliance ("I" being the component with maximum variance).

The choice of technology should first be based on the problem and not whether it is serverless. Choosing DynamoDB when your application demands strict serializability would be totally wrong. I hope more people think about the semantic specifications of their applications (especially under concurrency).


Well, if all you want is managed, per request hosting of SQL, use Aurora serverless! It's a game changer. As for Dynamo vs SQL -- you really cannot substitute one for the other. They're different tools for different jobs. Dynamo is extremely low latency and great for that purpose, while SQL is great for being a store of record, and if you use it as long as you can get away with it for as many things as possible, it's usually not a bad idea.


Issue is, this works only for very simple systems. With complexity comes maintenance and operations. You also cannot build anything innovative in case of heavy load systems because you cannot embbed your custom highly performant solutions (cloud provider simply does not support them ootb).

Serverless sounds great for "some" kind of systems but in other cases its a complete missfire. Our job is to know which tool to use.


N gets to be a big number my friend. I love aws but I can’t get over the number of startups I’ve seen with $100k MRR and $35k MRC allocated to AWS.


>"why are you using Dynamo, just use SQL"

On the other hand, I find it frustrating how many developers either don't know basic SQL or try to get cute and use the flat side of an axe to smack down nails instead of just using the well-known, industry-standard beauty of a hammer that SQL.


If everyone did this, in the limiting case, no one has the ability to investigate and repair actual bugs until the point where the application can only crash every 10 ms and does nothing but get continually restarted.


>SQL can be managed with, say, RDS. Sure. But it's not the same level of managed as Dynamo (or Firebase or something like that).

It's managed at every shared hosting in the world.


Not at the same level, hardly anywhere. Replication, auto-scaling, failover and point-in-time backups are typically not managed for you. AWS and GCP have decent solutions now, but they're not nearly as hands-off in terms of management.


What specific managed services do you use, if I may inquire?


>> My goal is to be totally unable to SSH

Great goal. I guess you never had any use case to login to a production server to investigate a business critical issue.


I have had to SSH into prod boxes many times to debug emergent issues. Now I'm using all serverless, and I get a happy fuzzy feeling whenever I think about how I never ever have to do that again.


"The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair." -- Douglas Adams


It's more likely that parent has had to do that a lot of times, or perhaps has managed people who have. That would be a sign of organizational dysfunction, and fixing that is a good goal to have.


I recommend heroku and heroku postgres


Using Dynamo is the worst thing we've done for productivity and costs in our product. It introduces onboarding overhead, unnecessary implementation time overhead, costs a lot, and is always a bottleneck in every operation. Using DAX requires paying for VPC which already costs more than a server that could power many small to mid-sized products. Resizing table throughputs requires lots of manual work to keep it cost effective, because the default settings are designed to nickel and dime you to death. If we just used a simple SQL or MongoDB server attached as a container we'd save literally months of development time and thousands of dollars per month.


This sounds beyond painful. We have similar problems at work with expensive system that I believe we could reimplement with traditional technology, but we don't.


If you don't care about cost and don't care about someone else having all of your data, that is probably a fine route.


I see that HN now steps into the gutter of reddit-like language and can damn all professional decorum.


> As of 2016, Stack Exchange served 200 million requests per day, backed by just four SQL servers: a primary for Stack Overflow, a primary for everything else, and two replicas.

This was the most enlightening piece of the article for me. Their alexa rank today is 48 (globally, 38 in U.S.), so whatever your site is, you are probably not dealing with as heavy a load as them. What techniques do you have to employ to serve this many requests from a single database server?


https://stackexchange.com/performance

https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...

Lots of caching (redis, CloudFlare) and trying not to use the database unless absolutely necessary, I would expect.


That's the beauty of serving mostly read-only data - you can just cache the most frequented pages. I can't think of a time that I clicked on a link for Stack Exchange that wasn't served via a search engine. I've never purposefully clicked a link from within Stack Exchange, and I've never posted any answers. I can thank those who do for enabling my success in my career, but I'm willing to bet that most Stack Exchange users are just like me.


"I've never purposefully clicked a link from within Stack Exchange..."

How do you resist those crazy unrelated questions in the sidebar? Currently shown for me:

  Did Shadowfax go to Valinor?
  Is it legal for company to use my work email to pretend I still work there?
  Why can't we play rap on piano?
  Can a virus destroy the BIOS of a modern computer?


These became so distracting that I applied an Adblock filter to hide the “Hot Network Questions” div, lol!


Ha! I do this too. Just the other day I was pair programming with someone who asked "why don't you have those super stupid questions on the right nav bar?"


That sidebar is my biggest time waster in the day.


They changed it so you can hide hot network questions(without any karma requirement)


Thanks for pointing that out! Direct link for anyone else looking for it: https://stackoverflow.com/users/preferences


I guess it's finally time to create an account!


Well, thanks a lot, now I have no other choice but to google at least two of those.


That shadowfax one, quora is horrible that way too. You're interested in one bit of canon triva one time, and the site immediately assumes that particular story is all you want to read about, every day for months.

I bet gwaihir carried shadowfax to valinor.


I don't think hot questions are tailored at all, are they? I often get questions from boards I never visited, so I assumed it's a global collection.


Also, caching at the correct granularity. I've seen too many systems that cache on a fine grained level and performance drops because because there are so many cache lookups, even with an in memory cache.


They have details here: https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...

Lots of caching, lots of RAM, SSD storage, and a low-level ORM for SQL.


well the "orm" is more like a database driver that can serialize to models than an orm.

but consider their stack there are better "starting" alternatives (they also used them before dapper), like Linq2Sql, EF Core, etc...

Also compared to other languages Linq and EF Core are way better than most alternatives (besides that some things are just not working, like lateral joins)


Caching and a whole lot of it. Most of their pages could be served by a CDN.

How much do you think stack overflow is reading vs writing? How often do you even click a link from a stack overflow page once there?


Upvotes are clicked a lot but I guess they are just an api call and can be tuned to be high performance.


I rarely click anything on stack overflow and neither do most people I know. The only ones I know who do that are stack overflow evangelists. I'd guess very few people have an account and use it.


I've taken to using upvotes for bookmarking, that way when I inevitably look the same question up again six months from now I'll know which of the answers I thought was most useful last time.


I’m in between. I have an SO score of just 1000, but I vote a lot. Mainly because I’m logged in via oauth anyway so it’s no hassle to do so.


Upvotes are actually ideal for an in-memory service like Redis that fsyncs less frequently. Even if you lose the last few reads (highly unlikely, with some care), it's not the end of the world.


it's also not super critical to show correct numbers in real time for that so there's that.


Well it's easier than you think (if you know what you are doing). I am serving around 90-100 mil requests per day from one 4-core xeon machine with most of the CPU idle. With around 2500 queries/s to redis and ~2k queries per second to postgres. App is written mostly in Go. A lot of those requests are updates.


Where did you look for tuning Postgres? I am interested.


I would love to know as well.


Well you have to remember that outside of cloud unknown blac kbox vcpus, generic networking setups and extremely underpowered storage options exist a different world where you can have a box with 200+ physical cores few TBS of RAM and extremely high performance storage


Well, regardless of caching and special ORM, doing 11K peak sql qps on 15% cpu load, is remarkable. They did a great job optimizing the database and access.

They probably also save a lot on ops team - running a single sql server with a failover is well-understood, error-resistant and covered in vendor's guides/best practices.


Except that Stack Exchange is a very static site so caching and read ratio is probably very high, I guess 99% of the people just type a question on Google land on SE and that's it.


Simple: don't hit the database with every request. SO is mostly hosting static text content that updates infrequently and can be very heavily cached. You don't need to read the database to generate every pageview, quite the opposite.


While the 2 servers for StackOverflow is true, if i remember correctly those were two very, very, very beefy machines. Im talking 64 cores and 256GB RAM beefy bare metal, not your average random 4core/4gb VPS.


and they're using that very un-hip microsoft .net


Which goes to show that the answer to "what technology should I use to start my startup in [current year]" is always "the one you know best" since Joel was a former Microsoft employee before starting Stack Overflow and his other projects.


C# is great. It's a high level programming language and it's also super-fast handling ~7.000.000 requests per second on a single server according to the following benchmark:

https://www.ageofascent.com/2019/02/04/asp-net-core-saturati...


C#/.NET can be super fun and productive. Use what you know.


C# was great in 2008, then it got kind of enterprisey and closed - not it's ridiculous and getting steam into all avenues of software engineering. It's arguably the most powerful language in the world.


How is it closed? It's now better than ever with the new cross-platform .NET Core with all kinds of runtime and language advancements.


s/not/now


in the past two years .NET has become (normally/probably) the best performing GC runtime out there. With some nice features as well, and a better cross platform story.

hipness has increased


It's the outcome of good architecture, application code, and the fact that single row lookups are already just as fast as a cache, especially in SQL Server running on highend hardware.

By the way, 200M is less than 2500 requests/second which is not much at all. In terms of actual throughput, they are not that big.


> By the way, 200M is less than 2500 requests/second which is easily doable.

They are probably not spready uniformly throughout the day...


They're also not all full page loads either. Their stats show only 66M of those 200M are pageviews.


It is difficult to make a conclusion. We can have a simple architecture but we spend 200% effort on tuning every part of the system.


Good example for which I’d highlight a particular thing to note: The kinds of systems that minimize failure probability at a large scale are often not the kinds of systems that minimize failure at a small scale.

At a large scale (e.g. hundreds or especially millions of hardware nodes) the most common faults will be due to individual nodes / services / whatever failing, so you want a complex fault tolerant system to deal with those faults.

At a small scale (e.g. stuff that can fit on one or several servers) the most common faults are from the system itself, not from individual nodes. Here, using a complex system will drive up the likelihood of failures, especially when you don’t have a large team to manage the system.


Our longest outage was our cloud provider (one of the big guns) turning off our entire account for 10 hours due to suspicious activity.


If im not mistaken, Azure went down recently in one region for a wooping 24 hours.

That's pretty scary for anyone having at least three nines SLA


Grandparent speaks about a different scenario. In a similar vein: imagine your credit card blocks AWS payments, you don't notice, and then AWS payment reminders land in your spam folder. Boom, services out.


Despite it never happening to me in few years of using AWS, Ill explain the difference between this example and mine.

Payment example shows the issue on your end. Something you have control of. In Azure example, companies could do nothing to prevent that outage. Sure you can run services in multiple regions but then you pay absurd amount of money to transfer data between regions (at least in case of AWS).


If I'm not mistaken, in Azure's case the issue was with some authentication service that was hosted in only one region but was used by all of them. The service had no redundancy.


Yea, sure, but you should also consider getting a different credit card AND email provider.

Why would a credit card suddenly block charges from a well known company you've been recurring paying for X months or Y years.

Why would my email block one of the biggest service providers on the planet?


I think the point is that bank payments, email spam filters, the risk of getting marked as "suspicious activity" as in GP, these become risk factors with a cloud infrastructure as much as, say, a power outage would be if you were self hosting.

When you self host something and you have a power outage, that counts as an interruption of service; when a major cloud provider suspends your account because of some chain of mistakes on your or their part, this doesn't impact their SLAs, because the service is technically fine.


because banks are worse at tech than tech companies

if it’s critical, make sure it’s isolated


At the small scale, the most likely failures are that your product isn’t something users want to use or pay for.

That is, tech performance rarely killed a startup, but programmers not understanding product kills them all the time.


I would really love to read more about this, with some examples and antipatterns for small systems


What this article, and the comments on it at the moment, are missing is that developers choose many of these technologies because they are sexy and will help the developer get their next job.

Which of the following two developers has the better chance of breaking six figures next year:

"Used Hadoop, MapReduce, and GCP for fraud detection on..."

...or...

"Used MySQL and some straightforward statistics for fraud detection on..."

This is a big part of why all these things exist in places where they shouldn't. As a dev that always goes for the simplest solution first and has yet to break a hundred k at 40 ... I'm spending my evenings now trying to figure out how to deploy the latest technologies where they're totally not needed.


Maybe I’m just swimming against the tide, but I work at a “big N” company and I am more impressed by “saved X dollars”, “made process Y faster” or “built feature Z” than I am with a specific set of technologies.

I’ve interviewed a lot of incredibly bright people that didn’t know any technologies more modern than C++.


+1. It feels like most people making comments such as the OP are people actually not working in BigN companies and just making assumptions.

I've done a lot of interviews at BigCo and definitely not looking much at the technology.


Part of the problem is that many software developers work in organisations where that kind of information - dollars saved or made, time saved, steps reduced - isn't shared. In a siloed org, the development team writes code to satisfy bug reports or feature requests, which come from a business analyst or product owner, who is the sole conduit to "the business". Why these things are needed, their relative priorities, or their ultimate impact, are not regarded as important concerns for the development team.

tldr: software developers often can't measure the impact of their code, so they fall back on describing the technologies they use, which drives a counter-productive desire to employ "sexy" technologies.


> Part of the problem is that many software developers work in organisations where that kind of information - dollars saved or made, time saved, steps reduced - isn't shared.

It's not even that they are not shared. It's that the importance of projects is often measured in dollars spent, complexity, amount of people working on them. Which means that often the inefficient teams and solutions are considered more important- their managers and tech leads have more people under them and they are more visible to the rest of the business.


That's surely because C++ is the language To Rule Them All!

I jest partly, because every time I write something in C# or PHP (even with the half-baked "strong" typing they are introducing), I constantly curse and think how easy it'd be in C++


I do a lot of interviewing at a big company. It's easy to detect when people are cramming technologies on their resume, and it's a moderate negative signal. They aren't thinking of the customer, or of the problem; they're thinking about the tech. That gets in the way of good design and is a red flag, in my book. I'm not at all alone in this, so I'm not sure it's as clear as you make it out to be.


Same here, but I can tell you from experience with group interviews that while we're definitely not alone, we're probably in the minority.


There is a term for this: resumé driven development. I experienced it too first hand and it is maddening. Every problem becomes a microsevice kubernetes problem.


The competent software teams and companies don't hire by keyword though, they look at actual output and results delivered.


I have this silent debate with my engineers from time to time when one of them gets an itch they feel the need to scratch with an industrial strength back-scratcher. I usually go "lawyer mode" and ask them question after question to justify their choices. They either forget their itch or realize that rubbing themselves against a wall will fix it. I understand their desire to put en vogue frameworks on their resume, but I can't have someone's flight of fancy fucking up our tech stack.


I actually think it's incredibly insulting when people assume that engineers are choosing frameworks just for their resume. Most in fact are looking to these tools e.g. React, Spark, Kafka etc because so many other engineers are using them with success. And so they think they will equally have success.

But then they didn't have the context as to why those tools were chosen and so often they aren't suited. But I've never met anyone in the last 20+ years with thousands of engineers who was doing it for their resume. In fact the best thing for your resume is for the project to be a success anyway.


I've followed people managing tech teams and I think, like anyone else, engineers spend more time doing things they like and are interested in and slow-walk things they aren't. Trying to align those incentives with the bigger project or the company is what managers do.

Looking at new tech is way more interesting than doing the same thing you've been doing for 10 years. Its good to have some mechanism in place to vet that since I know I don't trust myself to always make a good choice (and in my opinion I have excellent taste).


A perverse incentive I've noticed at almost every company I've worked for is that the guy who spends every day slugging it out in a bash terminal to keep a ten year old service running won't ever get promoted. However, but the guy who tries to rewrite it in Apache Flamboozle and AWS Fuzzworks will get promoted long before his house of cards tumbles down.

The end result is that not only is the routine stuff boring but it's also career-limiting.


I think any support department (tech or not) has what I think I've heard others call the "janitor problem." If you're doing your job and the trash is taken out, nobody notices.

I got advice when switching jobs that I had to be my own advocate and nobody was going to do it for you. I think the advice was meant specifically for that company, but I think it applies to most behind the scenes roles.

Similarly, everyone is excited about a new UI while a bad backend can only drive you away. Massive improvements and scaling can only really be measured by new dollars brought in (but that was due to sales and marketing, right?)


Is the guy who is slugging it out doing ops to keep something from falling over or is he adding features to the "legacy" (i.e. was already deployed before we were hired) implementation.

I'm probably being a little naive, but if feature work isn't ongoing, I'd expect that guy heroically slugging it out to make incremental progress in stabilizing the thing so that he didn't need to spend all day at a terminal babysitting a computer program.


> I actually think it's incredibly insulting when people assume that engineers are choosing frameworks just for their resume

I don't find it insulting at all. Reality is that many, if not most, recruiters are looking for candidates that have experience in whatever technologies are hot at the time. Expertise in a sought after technology can be worth tens of thousands of dollars extra in annual compensation. It's natural that some people will try to use that technology to further their career if possible. And I think to a degree, a lot of people base their job decisions on what they'll get to work on. 'Is it a stack that is in demand and growing?' 'Is it a stack that might be difficult to learn but pays really well?'

Now the sensible thing to do is to not shoehorn in some tech if it doesn't make sense for a given application. And beyond that, the a big reason people are tempted to chase the shiny new object is because the IT recruiting process is broken.

I just see it as people trying to do the best they can for their career, usually there's no malice involved.


> I actually think it's incredibly insulting when people assume that engineers are choosing frameworks just for their resume.

Maybe, or maybe it's just realistic in a world where common advice is not to stay at any company for more than 2 years.


My experience is engineers get heavily judged for not having the latest bro tech on their resume.


Mostly younger people are choosing to jump around jobs.

The common advice and wisdom is actually not to do this.


The common advice and wisdom is actually not to do this.

This is a long thread unto itself, but in an environment where many companies have zero loyalty to their employees and would rather hire more experienced people than train and promote their existing employees who already know the ins and outs of the company...

Job hopping is often the fastest way to more money and more crucially, more responsibility which means more personal growth.


The old fashioned advice was to do this, at a company that paid you for long service.

Those days are mostly dead.

While you might pause for fear of burning your bridges by leaving a project in the lurch, staying at a company for over 2-3 years is not the way forward in a career anymore - and that is very much the common advice and wisdom nowadays.

This isn't because young people want to, this is because you tend to get huge jumps in wage jumping between companies, while simultaneously HR seems to be allergic to giving good raises.


The common advice and wisdom from what I've been hearing is that it's perfectly acceptable (especially if you are underpaid the market rate), but not to do it frequently or too fast. There was some evidence of within company promotions being poorer than just switching jobs.


It's not just for their resumes, it's for the pleasure and challenge of learning something new, the feeling of confidence and with-it-ness when you get to say "yeah, we use that, I set it up" when people are talking about the latest and greatest thing, the fear of missing something transformative and getting left behind. It all boils down to nurturing a person's confidence that they are good at what they do and their ability to sound that way to other people, so "resume" for short.

And even if it is just for the resume, it's smart and understandable. Consider the difference between, "Eh, I evaluated some NoSQL options here and there, but it seemed safer to stick with Postgresql since we didn't have a compelling reason to experiment. Postgresql works fine at the scale of our product and we're very familiar with it," versus, "Yeah, we used Mongo for a project and it was a shit show, switched to DynamoDB which has been solid. The product is still 80% on Postgresql, but we use Cassandra for analytics, and of course we run InfluxDB for Grafana, which we're going to replace with Prometheus for better scalability." These could be two people facing exactly the same serious of engineering choices, with the first guy making the better decision every single time, but the second guy sounds like the kind of curious and hardworking person you'd want to hire, while the first guy sounds maybe... stuck? Counting the days to retirement? Maybe boring SQL guy has saved his company a ton of engineering work that they were then able to invest in product work instead of engineering, but when he and NoSQL dilettante guy are both interviewing at a new company that wants people who are "curious" and "passionate" about "dedicated to learning and growing their skills" and "ready to meet new challenges head-on" he has to be worried about sounding like a dud.

But I've never met anyone in the last 20+ years with thousands of engineers who was doing it for their resume.

It's not something you can distinguish from being eager to learn and overenthusiastic about new tech, which we all are to some extent.


Unfortunately, people can and do indulge in exactly this behaviour, and there does need to be some pushback against it. Individual self-interest must be counterbalanced by what is in the interest of the business when the two are not aligned.

I have seen this happen before, and when the interests and ambitions of an individual are counterproductive to the direct needs of the company, that can be deeply problematic.


I doubt it's usually for the resume, per se. But I do think peer pressure is a very strong force, as are aspirations. Most of us would rather be able to tell our friends, "We're using this huge and complicated but massively scalable Kafka cluster", than, "Yeah, our message queue is a database table. No, it isn't web scale. No, we don't anticipate ever growing large enough for that to become a problem."

Also, we like playing with new toys. Solving the same old problems the same old way we were doing it 20 years ago just isn't sexy.


Oh come on, it is not just developers but non technical higher ups also read some articles and now I have to come up with something that has "Micro-services, Docker, Blockchain, AI" in name to get budget for simple web api. Hell it does not even matter if we actually do it that way, just have to fill in buzzword bingo correctly and you win budget money.


> I actually think it's incredibly insulting when people assume that engineers are choosing frameworks just for their resume.

Uh? That's just the norm. I've met very, very few people who actually think about introducing a shiny new technology as a last resort. Usually it's the complete opposite, they start with the shiny new thing they've read about and build systems and even features around it. I've often heard the CV mentioned explicitly as a reason, but there is also a lack of imagination- they just can't think of ways to adapt the current system to their requirements.

Ah, as for the project to be a success... "I've started as the lead developer in a team of two, and after X years the project has grown and I was leading a team of 20"- this is a common success story. Nobody cares about the fact that the team of 20 was needed because the tech choices made were so poor you needed to explode the size of the team.


>Most in fact are looking to these tools e.g. React, Spark, Kafka etc because so many other engineers are using them with success.

This is not a real criteria for choosing anything. You can say the exact same thing about ASP.NET WebForms, MySQL, MSMQ or pretty much anything that was popular at any point in time. If you're only looking for "success" stories, you're not doing due diligence as an engineer. Who defines "success" anyway? The same person who chose React, Spark and Kafka in the first place and whose salary is directly proportional to how hyped-up those technologies currently are?


"ask them question after question to justify their choices"

Asking questions is not assuming. I think you're letting yourself be triggered by a flippant remark at the end of an otherwise solid comment.


Honestly this is why I have a homelab and side projects. If there's an interesting tech that I think may be applicable(or it's just interesting) I'll spin it up at home and give it a run.

Keeps me sharp and gives me an opportunity to explore tech that isn't in my day-to-day wheelhouse.


Bad news is pretty much anybody smart, and talented wants to work on new, exciting and bleeding edge stuff.

If you are still running on Tomcat, all the best getting the market standard genius talent to work for you.

You would also like to learn about what happens to workplaces where smart people don't work or even like to work at.


I used to agree with this, but then I started to look for a new job without knowing en-vogue frameworks. Now I make sure I include some fashionable tech, its important for people's careers.


or maybe you discover that it's an industrial size itch and the chosen backscratcher is needed.


Almost never.

Often there's an aspiration to reach Google-level operation. Invariably some director/VP insists on building for that future. Then it turns out sales can't break into the market and adoption is low.

Now you have a neglected minimum viable product because you're scaling that skeleton instead of adding features your existing customers want. Or you're delivering those features at a slow pace because you're integrating them into two versions of the software: the working one and the castle-in-the-sky one. Then there are all sorts of other things that throw a monkey wrench into your barely moving gears: new regulations, competitors nipping at your heels, pivots, demands from management for "transparency and visibility" into why it's taking you forever to deliver their desired magnum opus.

Products that reach Google scale probably get there without even noticing it because they're correctly iterating for the marginal growth.


I'm having flashbacks to when, at a startup in 2000, our CIO insisted we needed an EMC Storage Solution. "We're going to be _huge_!" (Over my written objections, mind you.) I spent precious time and huge amounts of money to build out the colo cages to fit his folly and then the company promptly went belly up. Good times.


yes and no. there is a time for building and a time for learning. I actually encourage people to think how they would build something “google scale”. it’s not meant to be what pays the bills - but it can have dramatic effects on motivation, productivity and making the thing that pays the bills better. you see, once really smart people are free to dream they understand the cage they’re in better. timebox it and share what you’ve learned.


You should optimize for the more likely scenario.


there is a time for optimization and a time for learning.


It's frankly bizarre to have a knee-jerk reaction that new technology is worse. New technology is probably better, given that it was created by people who had more context about both problems and solutions than was available when older technology was created. When I hear people talking about doing the same thing for 10 years I wonder what problem they are working on that could be so complex that in 10 years they have not managed lights-out automation for it. My suspicion is that attitudes like yours are not only wrong but also massively destructive of value.


Google's search engine is more than 10 years old, yet Google is still hiring engineers. Why is that? Should they not have managed lights-out automation for it?

More importantly, the world changes and software has to change with it. An accounting system will need to be fitted to changing regulatory frameworks in the future. That knowledge is the basis of good application design. Schemaamnesia databases with highly-scalable vendor-locked PaaS peripherals that are constantly changing also require constant developer attention, but do not rise to that challenge.


I completely disagree. All new technology goes through a pruning phase for the first few years where people realize whether it's worthwhile or not. It's like natural selection, only the fittest survive.

A technology that's been around for decades and is still considered a viable candidate will be much more reliable, documented, mature, bugfree... than some new library that is untested hasn't been worn smooth over time.


Could not agree more with this. Whether it is data warehousing, a maze of micro-services, or machine learning for your basic crud app you should look into whether you truly actually need it and whether it helps solve a real problem you have. A basic stack with Rails/Django and Postgres can get you quite far. This is often as much as most companies/startups ever need.

Also personally love the callout to Joe who while being a professor is consistently practical on when approaches do or don't make sense and at what scale they do.


Not sure what reality people are living in.

But a basic stack can only be used for basic systems and for basic problems. And there just isn't many of these going around anymore as they've all been solved. Or more commonly nobody is interested (including from the business) in delivering something basic. They want to innovate for their users.


I'd argue the complete opposite: every company and their mother is online these days, and nearly always they're a) doing something that's been done before and b) are nowhere near the amount of throughput that would require some kind of overcomplicated setup.

No, your company isn't special, and "innovating for your users" is probably the worst thing you could do, compared to delivering a good product using established practices that are tried and tested to deliver results.


I think I'm on a different page here. If a business wants to sell an innovative product, that doesn't mean they have to have an innovative database for purchase orders.


Three worries for me:

1. I know the risk of only being able to peer through the fence at the distant piece of software that is running your business, unable to gain any insight while your production application is limping badly and customers are running away like water. On top of that, thrashing it out with Mr Clippy is better than average support. If my business depends on it I want experts at hand who have the tools and the access they need to do the job they do.

2. The insane pace of serverless is entirely fad driven and lacks quality engineering which is required for critical pieces of software. The tools are universally poor quality, unstable, unreliable, poorly documented and built on gaining mindshare, making IPO and selling conference seats. Best practice never materialise as the rate of change does not allow an ecosystem to settle and work the bugs out. The friction is absolutely terrible but no one speaks of this in fear of their cloud project being labelled a failure. Every person I have spoken to for months is hiding little pockets of pain under a banner of success. Some people clearly will never deliver and burn mountains of cash hiding this.

3. Once you enter an ecosystem, you are at the mercy of the ecosystem entirely, be that a service provider or a tool. Portability is always valuable. It has cost, scalability, redundancy and risk benefits far beyond the short term gain of a single vendor decision. I'm currently laughing at an AWS quote for a SQL Server instance with half the grunt, no redundancy, no insight possible for only 2x the capex and opex combined of dedicated hardware including staff. But can't move to Azure because everything is tied into SQS.

I can never be behind anything but IaaS myself. This is contrarian especially in this arena but I will put my 25 years of experience on the line every time and say that it is the right thing to do. IaaS is choice, flexibility, allows you to gain deep insights and protects you from serfdom, fad technology and pick and choose mature products rather than what the vendor sees fit.

This is just another rehash of buying a mainframe. It's just bigger and you pay hourly to write COBOL.


1. You have to pay for those but the thinking goes that if you are at the complexity and scale of those problems you reach out to a TAM to sort them out for you (ie. $$$).

2. Give it time; this will evolve and it is still in its infancy. Like all things it is buggy in the beginning but as adoption hockey-sticks so too will the stability, documentation, etc.

3. Portability is traded for breadth of services and depth is gained through vendor lock-in and the one-size-fits-all package. Of your concerns I would say this one will be around for a long time or at least until a "conversion kit" is built to shoehorn all of your stuff into another provider allowing you to jump ship or test the waters elsewhere.

I am a veteran like you (20 years) and while you choose IaaS because you like the control most want to punt their problem to someone else and to pay for that.

If our goal is to get from LAX to JFK we could fly our own plane, charting our own course, looking up weather, doing engine checks, refueling, dealing with air traffic control. and we'll get there and be in full control the whole way. However most will pay to be shuttled in to a commercial airplane. Then of course there are some that are willing to pay more to have their own personal pilot get them there in a chartered aircraft where they are afforded a more tailored experience.

There are some really great pilots you can hire out there to do it all for you (or if you are in fact one yourself) but if the goal is to simply go from point A to point B in as little time as possible we don't have time to find, train, and rely on a single pilot or ourself to get there. We will take the hit and pay others to get us there.

That is what all this serverless non-sense is about if you ask me. The tradeoffs of simplicity and handing the busywork off to someone else is more enticing than the control we have in the process. Also, isn't it nice to be able to say, when you arrive late, that it was the airlines fault? :)


Yup, experiencing that at work quite a bit. People are currently arguing about big data solutions for a data set of 300Gb or so, growth in the area of 100ish Gb / year. I'm still arguing that this could be an in-memory dataset for postgres, and we wouldn't need that with a good schema.

Or other people are wondering if we could replace all of our relational databases with kafka. While complaining about inconsistent data sets. Well let's talk about advantages of a good relational schema first.

Maybe I'm turning into a grumpy old admin/DBA... but mariadb or postgres used well just solve so many problems.


>> Maybe I'm turning into a grumpy old admin/DBA

No you're not; you're saving the company tens or hundreds of thousands of dollars over the coming years.


> Or other people are wondering if we could replace all of our relational databases with kafka.

Unless you're using your DB as a data transport, dear God why. And if you are, dear God why. I work with Kafka and it's very good for a certain subset of problems. Replacing RDBMSes is not one of them.


I once asked if Postgres would have any negative performance impact from holding 1.8m records and once the laughing stopped and they realized I was serious I got a good lesson on how robust Postgres can be.


I once worked for a company that maxed out a 32 bit autoincrement primary key on a mysql table. Relational databases can hold a lot more than people give them credit for.


Yep. I worked at a company that had to migrate to 64-bit IDs. This was with MySQL 5.0 on servers with HDs, not SSD. Some of those "alter table" commands took all night!


A colleague describes this problem as "running out of numbers".


Oh man, I hope they saw that one coming ahead of time.


I managed complex datasets of a few TB with standard Unix tools just fine. If you are lucky, you have a sustained read and processing speed of 200M/s (or more with, if you are put in some work), which allows you to sift through a TB in a good hour.

Sometimes I am wondering, how people will rationalize the layer bloat in 5-10 years, when hardware went through a few more upgrades.


This all boils down to "make cost benefit decisions".

The problem is that maybe cost benefit decisions are really hard. To understand the cost of map reduce, for example, I probably have to use it at least once. High cost just to understand something. And I have to understand the alternatives, so I'm trying those out too.

You know what's faster? Signaling. When large companies say "this approach works for us" it's really cheap and easy to say "Well that's probably good enough then".

Do you lose out on efficiencies by not fully understanding the problem? Duh. But probably not as much as understanding an entire domain for every technical decision you make.

The steps outlined, particularly 1,2,3,4, are really expensive for every single technical solution. Reading multiple technical papers for a db is probably more costly than just acquiring customers, hitting scaling limits for your uninformed choice, and moving to a new db based on a somewhat more in depth approach or hiring a consultant.

Let's assume that if you guessed a technical solution, the cost of 'guessing' is 0, and the cost of picking the wrong solution is 'low'. Should I waste time doing anything other than guessing? If the cost of 'do what google does' is 0, or near 0, and better average chance of working, why not pick it?

Unless the stakes are high, putting that kind of investment into every technical decision just seems needless. And they arguably have to be very high to offset the cost, and I would state that determining those costs upfront itself may be difficult and error prone.

For every mis-guess listed in that article, how many new customers did they acquire due to making the right 'guess' based on signaling from other companies? The companies clearly hadn't gone under from those bad decisions, so would it really have been the right call to have invested in such deep thinking about the problem early on?

Not sure I believe what I wrote entirely, just thinking out loud.


> Let's assume that if you guessed a technical solution, the cost of 'guessing' is 0, and the cost of picking the wrong solution is 'low'. Should I waste time doing anything other than guessing? If the cost of 'do what google does' is 0, or near 0, and better average chance of working, why not pick it?

That makes more sense if you s/cost/risk/. The answer to why you shouldn't just do whatever Google is doing is because trying to copycat the leader is the one thing that will guarantee your _failure_ in the market place.

Same with AWS. AWS Lambda was built to solve existing problems. Everybody will use it to solve the same problems it's designed for, but only a tiny number of people will actually succeed in the marketplace because it just becomes a numbers game. If you want to win you have to play a different game--look for problems where AWS Lambda is a poor fit.

If you're doing the same thing as everybody else you're by definition just reinventing the wheel. Why bother?


I mean cost in terms of 'time to make a decision', without considering 'cost of making the wrong decision', which is closer to risk.

> The answer to why you shouldn't just do whatever Google is doing is because trying to copycat the leader is the one thing that will guarantee your _failure_ in the market place.

The article seems to demonstrate the opposite - decisions following the leader, and paying for it later, but surviving in the interim.

> If you're doing the same thing as everybody else you're by definition just reinventing the wheel. Why bother?

I don't really agree. When it comes to something like a DB or architecture, that's probably not where you need to be inventing new solutions - products are rarely sold based on their technical implementation, and more about the results they drive.

If you were building a new Google search, and you used all of the same tech, that's probably a bad idea. But if I'm building a product for a completely isolated domain, who cares if it's solved using the same tools that Google solves search?

Regarding lambda, there's likely tons of room for competitors even within a domain to implement off of the same technology - again, it's just a tool, and not likely to be the differentiator.


This. Some of the biggest mistakes I've made have been too much "thinking". The article makes good points, but seems to argue all these newfangled systems have a very high cost if adopted in the wrong situation. I would argue the cost is probably not that high especially in a managed cloud and many decisions can be tweaked later.

In other words, if a Postgres based solution takes 1 month to build, and a CloudSpanner/Aurora/pick-cloud-managed-sql-DB solution takes 1 month to build, just pick one. Most of the time, if you make a mistake, you'll find out when you have more customers/funding/engineers/knowledge of what actually matters.


Yes, I think it's really easy to look at a project, after it has scaled, and go "wow they made such bad technical decisions! Why didn't they learn more about this?" and ignore that the system is running, they have customers, a business, and they're solving the problem when it has become important to do so.

I've seen lots of time wasted overthinking solutions only to find out that the requirements were changed a month before release, or customers would use the system completely differently than expect.


At some level this is one of the most "no shit, Sherlock" pieces I've ever read. On the other hand you see it happening absolutely everywhere so it very much needed to be said.

I don't know why it is but people seem much more interested in the tech than in the value they can create. Which is odd. Why? Because it's the second part where you get to exercise the most creativity and do the most interesting work. How do you think LinkedIn arrived at Kafka, for example?

What'a possibly more baffling is it's not just the devs who buy into the technology hype: often it's the so-called business types, and those in management and leadership positions, advocating and egging them on.


> How do you think LinkedIn arrived at Kafka, for example?

Well, Franz Kafka wrote about incomprehensible, oppressive bureaucracy as a way of life, and LinkedIn did a decent job of porting that to Web 2.0.


>I don't know why it is but people seem much more interested in the tech than in the value they can create.

Well, for some engineers who don't have much say in product development, the amount of value they can create is bounded by the business specification/deliverables they have to produce (and maybe equity or lack thereof plays a role here). However, if the architecture is considered their responsibility, then there's a lot of fun in using _shiny new toys_ even if it's overkill and will incur unnecessary technical debt. It's fun to overengineer things.

Not saying it's a good thing or even professional behaviour, but I understand it.


I want a word for things that are simultaneously "no shit, Sherlock" and yet also profound things that we forget very easily. There's a lot of them. For engineering, I think "always consider costs/benefits" is probably #1. It sounds so stupid simple, and yet I feel like it's the exception when it gets deployed successfully.


"common sense"


"Whoa!!!!!!"


That seems to tilt a bit towards the profound.


If everyone just focused on business value and ignored the technology aspects then nothing would've been invented.

There would be no Internet, Web, VR, ML, AI etc and we would all be writing assembler or using punch cards.


> If everyone just focused on business value and ignored the technology aspects then nothing would've been invented. > There would be no Internet, Web, VR, ML, AI etc and we would all be writing assembler or using punch cards.

The Internet, Web, ML, and AI at least were all originally invented using research funding, not business funding. The Internet (specifically the TCP/IP suite) was funded by DARPA, the research arm of the US Department of Defense. So you're right that some people need to focus on something other than business value, but usually that is someone paid to do research, not paid to implement a specific solution that needs to be widely deployed within 1-3 years.

But if you're picking a product to use for a particular task, you aren't doing that kind of research. You are instead doing engineering. That is, you are determining how to solve a particular problem using the knowledge and resources (including reusable code) at hand. So you need to think to determine what collection of resources will actually do the job well, at reasonable cost and time, etc. Different problem.


All of those I listed are completely different now to when they were first invented. To the point of being unrecognisable. And all of that is because of real world innovation from real world engineers solving real world problems.

This anti-innovation mindset is just as damaging overall as everyone using the latest tools and ignoring the more proven ones.


How is this article anti-innovation? It's saying, "Don't adopt a solution that doesn't actually fit your needs." It's not innovative to use MapReduce to do payroll reports at the end of a quarter, when a simple SQL query can produce the exact same thing. It's just silly. It's not innovative to select something that overfits your needs unless it actually prepares you for your (ideally known, but at least high confidence expectation) needs.

I shouldn't drop a bunch of money on an underutilized system unless it offers enough benefit.


> How is this article anti-innovation?

Quite. And, as you've said, combining a bunch of tech you don't need into a poorly fitting solution isn't particularly innovative. The innovation tends to come from using the right tools for the job and creating a great solution using them.


> It's not innovative to use MapReduce to do payroll reports at the end of a quarter, when a simple SQL query can produce the exact same thing.

I really, really hope this is just a random illustrative example.


Yes. That’s not a thing I’ve seen in my office or any office I’ve been in. It was just an example.


There is absolutely nothing "anti-innovation" about saying don't use technology designed for problems you don't have, for reasons you can't articulate.

Innovation is driven by addressing problems people actually have with new ideas. Chasing taillights is a very poor driver for it.


Did you actually read my comment? Right in the middle of it I made the following remark: "How do you think LinkedIn arrived at Kafka, for example?" That's just one example cited in TFA of something invented as a result of focussing on value.


Last time I checked, necessity was the mother of invention and not the other way around.


I use managed services a lot. And fortunately my data size is big enough for it to actually make sense to be using these big data tools.

The issue is the quirks. Every managed service has at least a dozen quirks that you're not going to know about when you visit the flashy sales page on the cloud provider's website. And for the vast majority of users, they're not going to have access to the source code to understand how these quirks work on the backend. So you end up in a situation where yes, it does take way less time to get 95% of the functionality done, but getting that last 5% can still take a considerable amount of work.

As an example, I am using Azure Event Hubs lately. It is supposed to provide something like a simple consumer api like Kafka does, but with consumer group load balancing across partitions. Awesome, there is a system that automatically handles leasing across partitions in a way that abstracts this all away from the client! Except, well actually the load balancing is accomplished via "stealing leases" (meaning, they are not true leases) so if you use the api you are meant to use, you will get double reads - potentially very many if you want to commit reads after doing more-than-light processing which can take time. So you need to use the much more poorly documented, barebones low level api and probably still end up writing a bunch of logic to dedupe.

Except, you use this kind of tool to begin with because you want to set up a distributed consumer group to read from a stream... so now you have a non-trivial engineering problem figuring out a way to get a distributed system to manage deduping in a light-weight way across hundreds of processes and machines...


Are there any good "introduction to AWS" books or book series that actually mention those problems and how to work around them? All that I've seen just parrot the sales pitches, and it would be excellent to know about such problems beforehand.


Sounds like RabbitMQ would be a good fit.

Several good managed offerings exist if you do not want to run it yourself.


This is great. It reminds of that Adam Drakes’ 2014 article “Command-line Tools can be 235x Faster than your Hadoop Cluster” [1] was the wake up call I needed. 1. https://adamdrake.com/command-line-tools-can-be-235x-faster-...


Reminds me of my first job processing radar data on a Masscomp mini-computer. The full processing was a bunch of simple c programs all unix piped together into a processing chain. Simple and elegant.

We appeared to have a persistent bug though, processing left running overnight always seemed to crash. So after a couple of nights I stayed in the lab to try and see what the cause was. At 7:30pm the lab door opened and in walked the cleaner who reached down and unplugged the computer so that he could plug in the vacuum cleaner. Problem solved.


Did that really happen to you or are you recounting a common tale?

https://www.reddit.com/r/talesfromtechsupport/comments/5yrs1...


It really happened, bank in the early 90's. Though I imagine it was a common occurrence.


That article totally misrepresents the normal use case for a Hadoop cluster though. Hadoop clusters are meant for when you have multiple petabytes of data and this network bandwidth becomes the bottleneck for doing batch processing jobs.

Let me know when your command line tools run multiple large scale processing jobs on petabyte datasets.

His article was him constructing a straw man about why people use Hadoop and attacking that.


I've worked with these "straw men" you say he constructed, they are absolutely out there. There was a time in 2014 when Hadoop/MapReduce was the hammer and every problem out there looked like a nail.

How many people have used Hadoop for a project? How many Petabyte+ datasets do you think are out there?

Unless you truly believe the answer to those 2 questions are the same, I think you can see why that article had to be written.


I feel it is in a similar spirit as the OP here. In both articles the important take away for me is focusing on an approach and tools that match the problems. It sounds like you are making a similar point.

I also appreciate quality of the writing and step by step journey.

When I first read the article, I found it timely as I felt bombarded by IaaS & PaaS offerings with “web scale” solutions for problems I couldn’t relate to, basic examples, and few case studies.


The only reason to use Hadoop is if you need a Shuffle phase, i.e. the intermediate data between Map and Reduce is too big to fit on one machine. If you have big input but small append-only output, use a work-queue (SQS or MySQL/PostGres will let you set this up in minutes), dump to files, and merge them with gzcat | uniq | gzip > output.txt.gz. If you have big input but a small volume of update-based processing (< ~1000 or so outputs/sec), have your workers update an RDBMS. If you have small input but big output, build up output file chunks and progressively write to S3 and delete them. If both your input & output fit on disk, use UNIX pipelines and command-line tools. If they fit in RAM, just load them with Pandas or equivalent and manipulate them in your favorite interactive programming language.


This is really a very good article, and I've seen this same behavior manifest on numerous client projects.

BUT. I have also grown a bit skeptical of the "Just use a RDBMS!" mantra. The same advice applies. Think about your use case. Even if your data isn't exoscale, a relational DB might not be the best choice.

My current project has data on the scale of hundreds of billions of rows. Nothing crazy, easily handled by a good Postgres box (by the numbers.) And for certain access patterns, it would be.

Unfortunately, it turns out that a lot of the analytics queries our business users need end up being joins and aggregations involving full or nearly-full table scans. PG is not particularly well optimized for this; it gets CPU and disk-access bound. Queries take tens of minutes, causing timeout errors in our BA tools.

Loading the data into a Presto (Facebook product) cluster instead dropped query times for the same aggregation into the tens of seconds range. Sure, our data doesn't even really count as "big" and Presto is probably overkill if you're only looking at size, but it is optimized and built for highly partitioned parallel aggregations.


I would object that your experience is actually a good example how to choose a larger scale solution. Start with some RDBMS, and switch once it stops working.

I think that's the better way of saying "just use a RDBMS" - "If you're not sure, use an RDBMS to delay a decision until you have more experience and knowledge".


Exactly same case here - we're using Athena (packaged presto from AWS) and have a Postgres instance w/ identical data in a different data model. Presto is mindblowingly good at the low latency queries and scales well to anything nontrivial. It seems opening files for reading is the biggest source of delay in presto! Postgres is not handling the scale well, but it was a nice experiment.


Your Athena data set are immutable S3 blobs of data however, while Postgres reads data that could be modified while it's reading them, concurrently.

Postgres has to deal with a transaction modifying 100,000 rows in a table, needing that transaction to then see its changes but then aborting, leaving the data exactly as it was before. Athena/Presto don't have to worry about any transactions, table files bloated by changes that need vacuuming, etc. etc.

So I don't think it's fair to claim PG does not scale, when you are comparing a transactional and analytical database.


That's rather the point though. The people saying "YAGNI, just use Postgres" aren't, a priori, any more correct than people saying "You definitely need Hadoop if you're gonna scale!"

You need to analyze our own use case critically, understand the tradeoffs of different tools and choose pragmatically.


There are multiple RDBMS data warehouse systems based on PostgreSQL that scale pretty well. I work for a company that develops one -- Greenplum. There are others like Citus.


Programmers often make the fallacy of assuming that only the asymptotic case should be considered (just look at Big-O notation). But when it comes to specialized tools and frameworks, unlike sorting algorithms, many use-cases never break past the scale where the asymptote is the primary thing driving the curve.

With that in mind, it's my experience that every tool or framework has a different window of scale in which it's ideal, which has both lower and upper bounds, and one simply needs to find one's own project along that axis when choosing a technology. Hadoop may be the best solution as N approaches infinity - and we as programmers love thinking in terms of infinity - but it may not start being the best solution until 10x your actual range for N.


The thing is, at least from my experience, the engineers working in startups know the solution they're implementing is overkill. Its just that everything ends up getting built assuming the startup is going to take off and become the next huge unicorn.

A lot of the time design meetings in a startup revolve around "Is this scalable?" or "What happens when we have 10,000 users or 1 million users?".

I think the problem isn't choosing overkill technologies, the problem is trying to solve a problem that doesn't exist yet and probably never will.


There's also sometimes a lack of discipline (or at least, I lacked discipline when I worked in startups) when it comes to focusing on the impactful stuff, not the fun stuff.


That's just bad engineering practice unless you can articulate why there isn't a transition path to more scalable tech when you actually need it. Usually in practice there is, and certainly for most companies. You can probably point to counterexamples, but to reuse the OP's term "you aren't them, either"


Unfortunately I've definitely seen cases where scalability wasn't taken into account and now it's impossible to bolt-on after the fact.


My rule of thumb is to assume something significant will have to change in the architecture when you get 10x bigger. At any given moment, you shouldn’t have made that change yet, but you should know what it will be and be ready to make that transition. But don’t gum up the current system with what you’ll have to do at 100x bigger.


I have participated in lots of startups. I have 3 successful companies myself. My background is cs. And what I can tell is that every single service like aws or GCS is absolutely insanely priced. Not affordable unless you have a new Netflix. Even though it looks like you are saved from a lot of maintainance you still have to deal with limitations you sometimes don't have a solution for.i have seen companies switching to aws going from 400 dollars a month to about 5-10k per month. Silly amounts for big organisations, but for startups it does make a lot of difference


We're still in a boom period, where many start-ups/VCs are operating with deep pocket books and instructions to scale fast at any cost. During the next bust, lean will come back in fashion.


It's an interesting question, because Fed rates are going to be held permanently low due to the US Government's debt / interest costs (and the next ~$18 trillion that will be stapled on over the coming ~15 years). Those Japan-style permanently low rates will always press upwards on risk capital markets like venture capital. It spurs capital to seek higher returns versus the weak yields everywhere else (whether treasuries or other). Start-ups will become more valuable persistently due to this effect, at least for the next 10-15 years prior to the end fallout once the US Government is formally drowning in debt interest cost (which is when the really bad consequences kick in, like full blown aggressive currency debasement to chop the debt down and stealth default; everyone will be taking real haircuts then, anyone with assets in dollars anyway).

I think it would take a very severe recession, in the 2009-2010 style, to hammer venture capital down considerably. Instead I expect Japan stagnation to continue to envelop the US economy, and for the exact same debt-laden reason. Ever slower growth, low traditional inflation (higher real inflation from currency debasement QE), debt taking up an ever larger share of capital available for investment, stagnant productivity, stagnant wages. With the enormous amount of wealth in the US, trillions of dollars will always be looking at the venture capital market in a given decade. ~2012-2035 will probably be the best years for venture capital in US tech history, and broadly the best context for start-ups, that we'll ever see. It's early in the loose money sloshing period from perma low rates, but not yet late such that you're eating an always-on QE real-value debasement hammer constantly (which sucks down your real value creation, as you run uphill against the Fed while it tries to debase government debt to keep the US Govt solvent).


In principle I agree with the author, but some of the patterns these big companies have introduced are valuable for companies with significantly lower amounts of traffic. For instance, in the article they reference Kafka. In one of our products, we use Kinesis, which has similar semantics, for data that is no more than 25k records per day. However, we find it useful because it enables us to have multiple consumers that operate independently, plus using Kinesis Firehose to automatically archive those records off to S3. We just use a single shard, which is more than enough throughput for us. We don't have any plans to scale to hundreds of shards, but find what it provides to be very useful in separating what each process does, and makes the code much simpler. And if we ever did need to scale, it wouldn't be much work to do so.


There was no section on "You are not Netflix", so I guess microservices are OK.

Just thought of a naming convention, how many types of microservices do you run:

  10? deciservices, 100? centiservices, 1000? milliservices.


From here in enterprise-land, the advantage of microservices isn't architecture, it's resource contention in development. Team A is much less likely to interfere with the work of Team B. Performance etc claims are nonsense, but keeping teams from breaking each other is real.

Assuming they aren't randomly changing unversioned APIs, of course...


Other direction, deca-, hecto-, and kiloservices.

https://en.wikipedia.org/wiki/Metric_prefix


I think OP's point was that if your company provides a service, and runs 10 containers to do it, then each container provides 1/10th of the total functionality of the service and is therefore a "deciservice". The joke being that you then can't say you use "microservices" until you've got 10^6 of them.


I meant division of functionality rather than horizontal scaling. For instance, all of Netflix consumer facing features making up the 'Netflix app' is a full 1.0. If we subdivide the functionality into pieces the number of functional pieces determines the fraction of the whole application it provides.

But.. if you have 10^6 total instances I don't think anyone would object to you calling them micro.


UNPHAT - these are great, but I'd add one more - ask yourself if you're an engineer before making a technology choice.

The number of projects I've had where the team has had one arm tied behind their back because they "have" to use Hadoop or NiFi or Lambdas or something else a manager is determined is the thing everyone's using is just lunacy. They're all tools which have their place, but you really have to know them to know when to use them. And as importantly when not to.

I mostly do consulting gigs where projects are 6 weeks to 6 months long and it's been years since I've seen that not hamper a team.


While I largely agree with the thesis of this article, I actually really like the entire "scalable microservice" way of designing things.

It's overkill for anything I need to do, but my home server is six 8-core ODroid devices, orchestrated with Docker Swarm, with most of my services on there being load-balanced, and glued together with Kafka if they need to talk to each other.

Do I need my internal video streaming server to be able to scale horizontally? Of course I don't, there's only ever three people max watching things from there at any point in time. However, I find that, overall, my brain thinks more-or-less in terms of these microservices, and it doesn't really hurt much to do it that way.

If I find it pleasant enough to do, why the hell not make it hyper-scalable and able to reach Google levels?

EDIT: Just a note, I know that Docker Swarm probably isn't quite capable of Google-size. Still, moving to Kubernetes or something wouldn't be terribly hard (the reason I didn't use it was because a few years ago I had some issues with Kubernetes on ARM and Swarm worked outta the box).


There are Swarm clusters in production with tens of thousands of hosts. There's nothing in Swarm that makes that impossible.


Fair enough; I didn’t know that Swarm was used for any large projects because apparently I don’t know how to read or use search engines; I stand corrected!

I guess that backs up my original point even more then; my stupid video streaming server might actually be able to scale to Google size some day :)


I can assure the author that those of us solving problems for small companies (in terms of employee count, not revenue or assets under management) do not jump on the typical bandwagons. We tend to be very critical and pragmatic.

I'm not even sure it is engineers at bigger companies who choose to jump on these bandwagons. I suspect it is often wannabe-technical managers who read that MegacorpX is using tech Y that sells the idea to upper management, along with some unfounded beneficial promises, that causes some of the trends we see.


Do you know how many engineers work at Google and Amazon? A lot. Chances are that many readers of this article actually are Google and Amazon.

My point is: these generalizations in either direction are not helpful. Many people are operating at Google scale, and many people aren't. Be aware of that when you read about potential technologies.


If you're Google, you know.

So if you're the audience of this article, you're not Google.


I'd dare to say that many groups/products at Google aren't even operating at Google scale. I'd also dare to say that some Google products operating at Google scale can be a bad thing (i.e. unprofitable). The latter statement, or both statements, might be supported by the number of products they kill off.


I have worked in many industries e.g. telecommunications, insurance, finance where our challenges are just as big and sophisticated as Google.

If you're doing large scale ML, having to respond to requests in very low millisecond numbers or having to crunch through significant amounts of data (often more than Google produces) then you need to use similar approaches to them.

Even doing ML at all can be quite demanding.


I find the the amount of traffic and comments articles with this message generate quite amusing, given the article can be boiled down to: don't use tools like Cassandra, Kafka, etc. until you've thought through whether they are the right tool for your use case. That last part is often forgotten -- these tools may not always be the right tool for the job, but SOMETIMES they are, regardless of your scale.

Well, duh.

The corollary is: if you're using one of these tools, and you HAVE thought through your use case and reasons, don't get defensive about it. If challenged by those with lesser knowledge after reading an article like this, calmly and rationally explain your reasoning. And like any technology choice, be prepared for someone else to offer another option you weren't aware of, with better reasons.

Again, duh.


I’m not sure why people frame the use of certain technologies as orthogonal to non Google-scale use cases.

For example, one might like using MapReduce or Kafka Streams for their programming paradigms, not for the redundancy or scale they provide.

Another example, Kubernetes, which makes it trivial to run containers and attach storage to them.


I guess the idea is that for all the people using MapReduce, most would be better served by a different paradigm than MapReduce. Because in most cases, there's a better paradigm than MapReduce.


> Because in most cases, there's a better paradigm than MapReduce.

You'd think so, right? But what happens to me all the time is I try to use "small-scale" tools to handle some small problem, and then I find myself writing glue code that's already implemented in, say, Kafka Streams. So I might as well just run one Kafka broker locally and write a Kafka Streams application. Or the same with the Django ORM, which I reach for all the time just because I don't like writing database access code when I can write up some models and be done with it. Every time I reach for "small" tools I end up writing tons of code that's not actually solving my immediate problem or question.


We like to think that we’re hyper-rational

Don't we just. Which makes us vulnerable to every shiny gewgaw and every personal and cultural bias out there.


Great article, and one that I think a lot of people here could learn from. Everyone gets all excited about fancy architectures and algorithms because they're fun and cool, but pragmatically, you can buy a LOT of server for the same price as a couple of engineers messing around for a month.


I find it funny with all these cloud providers giving huge free credits for startups (AWS giving away 15k for e.g.) like they could utilize all the amount. Reality? 10 visits per day.


Hype driven development is the real curse of the last decade. Not just big data, but also smaller techs like mongodb, or the countless javascript frameworks... choosing the right stack for the job could be a consulting job in itself, even for simple web development.


"The thing is there’s like 5 companies in the world that run jobs that big. For everybody else… you’re doing all this I/O for fault tolerance that you didn’t really need."

That's not necessarily entirely true. In the early 2000s, I worked for Cadence Design Systems, who at the time needed to build and test the plethora of tools they'd built or acquired on a large variety of systems. I worked on "GridMatrix", sort of like make but using a large, heterogeneous cluster of machines, built on top of the Condor and LFS batch scheduling systems---the same sort of thing underlying MapReduce.

On the other hand, I get where the author's going: cargo-cult design is rampant in enterprise software development. But it's not just cargo-culting Amazon or Google; it also involves fashion and resume padding.

And then, there's "eNumerate multiple candidate solutions. Don’t just start prodding at your favorite!"

Bwahahahahah. <- Unamused laughter.

Our first and only response, as an engineering discipline (if you want to call it that) is to pick the first idea that comes to mind and beat it to death with a stick.


How many times can we run this same exact type of article? If it didn't change anything the first time it ran, what are people expecting to happen now?

Maybe Hadoop isn't necessary for everyone who needs it. They'll learn. And if they don't, more job security for you.

Otoh there are some crazy advantages of running certain "large scale" softwares. Not everyone needs the "scale" Kubernetes offers, but managing tasks declaratively as containers that get scheduled on boxes in a hermetic fashion? Or maybe continuous delivery - not everyone needs CD, but it certainly offers many advantages to those who do it.

Most people could probably live off of shell scripts running under Cron on a box with CGI scripts written in Perl without any issues, that does Not imply there are no advantages to new technology because not everything is about scale and fault tolerance.


Most people simply don't realize just how fast PostgreSQL is on a modern machine (and, by extension, just how bloody fast a modern machine is).

My old laptop could do something like 150K transactions per second in PostgreSQL. You can scale really far before your database becomes the issue.


I don't realise how bloody fast a modern machine is because the software I use day to day takes seconds to do basic operations like move the cursor, or scroll a window, or load a list of ten items.


Hudson giveth and Nashua taketh away.

Or, for the youngsters: Portland giveth and Seattle taketh away.


People are told to do less with more, and machines are purchased to handle future growth.

Some things never change.


How many of these clever VC say to themselfs "None of our companies will be the next FAANG/MAGA"..

Oh wait. They dont say that. They fund, and hope and pray!


You should be telling "you are not Google" to people who try to build the next social network or any app that is basically an enormous chicken-and-egg problem.

Most programmers try to do that. And they keep trying. They're wasting their lives.


A lot of these decisions can be made with a bit of "back of the envelope calculations"; playing with orders of magnitudes based on latency times:

http://highscalability.com/blog/2011/1/26/google-pro-tip-use...

https://people.eecs.berkeley.edu/~rcs/research/interactive_l...


My general test for most new technologies is that the desire to use it has to have arisen out of the pain of not using it. If I'm looking at a framework for any reason other than having shouted "there has to be a better way!" then I'm not in a position to evaluate its merits, and even if it is the right choice I'm not in a position to understand why, and will thus be prone to using it in a stupid way.

It might seem expensive to do things the "wrong" way first every time, but I really do think it's necessary.


I don't think a lot of people actually think in terms of "they are Google", but more so in terms of "we hope to one day be Google" so they build (more complex than needed) systems with that ambition of scale and everything. Almost as if they are over planning for the future. Not to say that is either right or wrong, but just something worth noting, in my opinion. I don't think anyone actually over engineers something thinking they ARE Google when they are clearly not.


> just, think for yourself. Is it the best solution to your problem? What is your problem exactly, and what are other ways you could solve it?

This is the crux of the issue, psychologically. When making an important decision in the face of uncertainty, having someone else with a lot of clout (or a group of people whose collective wisdom you trust) simply tell what you do feels amazing. That feeling is hard to ignore in favor of constructing your own cost-benefit analysis.


I used to consult for a public service department.Their internal infrastructure was incredibly complex with hot fail-overs/redundant/scale-out hardware (running homegrown custom LoB applications for less than 200 employees). This paradoxically lead to huge downtime as no-one was smart enough to configure this mess, and deploying a new service which should have been trivial always ran into worst case scenarios.


I'm reminded of JWZ and his very practical advice on Backups:

https://www.jwz.org/doc/backups.html

In particular Addendum B:

RAID is a waste of your goddamned time and money. Is your personal computer a high-availability server with hot-swappable drives? No? Then you don't need RAID. RAID is not a backup solution. Even if you use RAID, you still need backups.


With modern filesystems, however, RAID-like implementations do have their purpose in ensuring data integrity, as is the case with btrfs and ZFS.

Side note: the linked URL redirects to a questionable image when navigated from HN. Interesting choice by the site owner.


That doesn't mean his point is not spot on.

I have more than one friend that has overengineered an industrial solution for basic computing needs. (and when things break, complexity goes exponential)

as to URL redirect: I don't see the image/redirect you refer to, even clicking on the link here on hacker news. Can you elaborate? I take the links I recommend seriously.


The URL will redirect to https://imgur.com/32R3qLv if navigated from HN, most likely caused by the Referer request header being present. Doesn't seem to happen all the time, but try it out in Private Browsing mode and it may redirect you there.


I think this article is an instance of a more general observation: most programmers should spend more time understand the needs of the business and users, rather than learning new technologies. The examples given were large companies solving their business problems rather than developing technology for its own sake.

Or to put it another way, form follows function.


In my case, a multi-tenant application, i split the application into multiple services, one service per module.

It's not about scalability or about anything technical. It's about domain understanding and data isolation.

There's NO problem here to understand. It's just one way to manage our data. Or another way speaking, design for failure.


>It's just one way to manage our data

Sure, but there are tradeoffs to any approach. More services == more devops == more complexity in fault tolerance, communication, data storage/consistency, etc.

I'm not saying it's a bad model _for you_, obviously I have no idea, but the point of the article is that some tend to jump toward complexity when they shouldn't.


It's not "more services == more devops". It's like this: Only manage the nessessary data.

If i need to change one part of data, i don't migrate the whole data. I just need to change that only changed part.


I can't imagine a microservice architecture that does not require more devops work than a comparable and sane monolith. Your second and third sentences haven't convinced me.


Developers and operations often love microservices.

If you make one small change you only have to test and deploy that one microservice. Instead of having to redeploy the monolith which not sure if you've ever done that before is often terrifying.


Not quite. With a monothlic, it's an overhead to migrate data, because one small mistake will take your application down.

So, in real world, it takes more overhead to manage a monothlic than a microservice architecture.


With microservices, if you make a mistake and take part of your application down, aren't you worse off because only part of your application is running and is now in an undefined state?


You really need to think that comment through first.

If you're a bank then it's fine if your notification microservice goes down because at least you can still accept payments, handle deposits, transfer money etc.

In almost no situation is there a case where having no availability is preferred over partial availability.


It's fine to have only one part of the application down, instead of the whole go down.

Each part of application is ran by different roles/functionalities.


At one company I was at we moved to microservices to reduce ops work.

Scale independence was a huge value add for us.


[flagged]


That's pretty hyperbolic. Are you implying that monolithic designs are just inherently inferior? I don't think that even approaches a reasonable take on the subject.


It's a pretty good analogy actually. If you think back to the early Java days when everyone was obsessed with design patterns and over engineered code people said exactly the same thing. Why do you have these interfaces if the implementation will never change ?

So I'm not saying monolithic designs are bad but they simply don't scale well once you get to a certain size. Just like design patterns should only be used once your codebase gets to a certain size.


I wouldn't say that's quite the same thing, and people _absolutely do_ over engineer/encapsulate their code today. In my own, admittedly limited experience, it seems to me that the biggest win with microservices are realized when you have large teams working independently. There is a complexity overhead that, in my opinion, isn't justified for most apps being worked on by small teams.

That said, I don't hold a hard opinion here; I don't have enough experience with them yet. I'm open to having my mind changed.


you actually need more testing if you have more services. especially integration tests are way harder to pull of in a microservices architecture.

also communication issues arise from bad api design, which of course can happen with microservices as good as in monoliths.


Please don't introduce new acronyms like UNPHAT when we have something like YAGNI that works just fine (and probably there might be something, a rule of thumb, before Fowler, or Ron Jeffries, or whoever, coined YAGNI).

Secondly, extreme YAGNI and extreme future-proofing are two ends of one of the considerations on how to approach writing a software. Neither extremes is good. In reality, the best location on that spectrum would depend on the situation at hand.

There is no silver bullet. And I'm sorry but I didn't learn anything new from this blog post, something that is already not part of the software engineering vast body of knowledge.

edit: I know UNPHAT is not an "exact synonym" of YAGNI. But what you're describing is a problem solving approach. That's probably even less new. Please look at Polya 'How to Solve it' if interested.


Note that this was from 2017.


This, for Docker. Everybody is using it these days, and I understand there are valid use cases. But it's way overused and very often does not justify the added complexity. Keep it simple.


This is true but it's also true that small/medium businesses are riding the same hype train and they'll gladly pay you to take them along for the ride so in the end, meh-money.


This is one reason Snowflake is killing it. The real market for "big data" is data warehousing, and users want a data warehouse that's in the cloud and portable among vendors.


Only in aws,azure portable.


We did an entire show on this with Ozan Onay back in 2017 ~> https://changelog.com/podcast/260

You can read through the transcript too ~> https://changelog.com/podcast/260#transcript


Great article - I'm kind of at a crossroads myself. I'm very productive with Rails and can scale it probably well beyond anything I'd ever need and can move away the big chunks to microservices later.

But there's also Elixir which can do away with most of these concerns.

So what do I do? Go with Rails to ship stuff? Or struggle a little bit more with Elixir to get that much more legroom?


I wrote an open response to this. Looking back I was a bit off-topic and I could write it better today but what the hell..... https://www.honestrepair.net/index.php/2017/06/08/re-you-are...


Assuming low level techs finally get realistic about the complexity they sign up for... If an executive says "we will use only use X technology for all X tasks", they probably picked the Google thing, and you have to deal with it, or they'll hire someone else who will.


Architects and engineers alike love increasing the complexity of systems just to use buzzword products.


This is a good article. I'd say it draws heavily on the same themes as "Hard Facts, Dangerous Half Truths and Total Non-Sense" by Sutton and Pfeiffer. Drinking Wild Turkey every morning isn't going to turn you into the Southwest Airlines CEO lol


This article has been written many times by many authors (even using the same cargo cult airplane image) and yet is no less true than the first time it was written.

That said, clamoring toward things like React has yielded a net positive result. So cults can be a double-edged sword.


A place I worked was building a video dating site, we spent thousands upon thousands ensuring that the site would 'scale' to millions of concurrent users.

I think we maxxed at somewhere around 60 concurrent before I left. They gave up on the idea a few months later.


Great advice, and to double click on it: google wasn't google either when they started and part of their value to investors was their bare-bones hardware infrastructure. That part of google is (philosophically) worth emulating.


Is this really a thing though? I would be surprised if I find someone that's able to deploy a spark cluster for a production application and at the same time doesn't know that is not a good solution for their problem.


Really interesting article and good points. Although, I always had the opposite problem: people wanting to use technology that is too basic without caring about the implications at scale. "As long as it's free".


Funny how a blogger reminds us to get in touch with reality once in a while, and then every commenter on here bends over backwards pretending that they actually are Google.


Oh, I had to spend an hour "fixing" my incorrectly autofilled fields for Chrome to please stop playing games with my users data.

I'm acutely aware I'm not Google.


Usually I implement things for future workloads. Sometimes I like to imagine 1 billion people could use each component, then it is truly ready.


What is meant by "“solve” the problem mostly within the problem domain, not the solution domain."?


To use a restaurant analogy, if you are a chef, and you are opening a new restaurant, you should focus more on the problem of devising the best recipes and experience for your customers, rather than on whether your kitchen has a high-end cooking stove.


You should do a show with this script. Educational and fun. Well done.


Resume Driven Development


You Are Not Google (2017)


You Are (Still) Not Google


How many people on this thread actually are google?


> I understand that Kafka is still useful for lower throughput workloads, but 10 orders of magnitude lower.

Should we stop using MySQL too? That handles more data than I have.


I think a good analogy might be using MySQL for storing a single key/value. You could do it, but why? Instead you could just use an environment variable or simple file.


i work for google and now i'm confused


Been saying this for years. People build these ridiculous contraptions: Spark/Hadoop, key value stores, NoSQL, microservices, distributed filesystems all over the place, fault tolerance and so on. Then you ask "how much data and traffic do you have?" Nearly always you get an answer where a single machine and properly designed software would be more than enough.

I figure, in a way we should be thankful for this cluelesness. If people had any clue whatsoever, IT employment would shrink by a factor of at least 3, and salaries would drop pretty massively as well.


One common mistake I see often is in step 1, understand the problem. Assuming the problem is not understanding the problem.

I'm reminded of an issue my old boss had at his house. His power bill was raising at an alarming rate, he assumed it was the central hvac, spent a few hundred getting it tuned up, no effect. Spent a bunch of weekends replacing weather stripping, adding sealant, etc. minor effect.

I came in with an app that lets you estimate current power usage by looking at the power meter[1], and killed circit breakers one by one while measuring current usage.

The cause turned out to be a malfunctioning swamp pump he forgot the house even had.

He assumed he knew the problem, by focusing on the common cause of such problems, he assumed wrong, and because of this, every solution was addressing a different problem than the one he was trying to solve.

-

Another example is the story[2] of the old as fuck ibm mainframe that was powering a website that would take on the order of 10s of seconds to return data. Everybody assumed all of the delays was to be blamed by the "legacy hardware". Every solution pitched involved migrating off of it, a daunting task no manager wanted to green light. Finally they brought a consultant in to figure out what was going on. Turned out it returned data in 6ms or less, the cause of the lag was the java app that would read the output from the mainframe to transform and send to the browser.

They could have solved the problem years ago if they had just actually tried to understand the problem.

-

[1] (power meters have a thing that happens every n watts, along with a decal that tells you what n is, so you can determine current power usage by measuring time between these n watts, there is a handy dandy app https://play.google.com/store/apps/details?id=com.sam.instan...)

[2] 7074 says Hello World


Not entirely related but one problem solving issue I see often is people not testing and refusing to test their assumptions. Often while trying to help someone with a problem I ask if they tried something and they say

"No that cant be the issue because it shouldn't be affecting it"

Yes under ideal situations this wouldn't be an issue but if the system is not working as expected then how can you be so sure that this bit is working as expected.


We are a 50 person team here and the cool kids have created over a 1500 apis that are served over 6000 container instances. The “system” only serves < 1 mm users. We’re definitely not cargo-culting:)


One gripe I have with this thinking- Until you are, and you have failure.

Things that don't scale or be AA Quality will show up.

Go viral and fail to service. 15 minutes of fame squandered on an avoidable tech mishap.

One time a top comment on a thread was about a misspelling.


Whatttttt


This entire article kind of assumes that you're stuck with your current workload and will never scale, which is awfully pessimistic. Should we all just take the attitude that, oh well, we'll never be massive so why bother trying to reach any sort of throughput?


If you make that assumption you will be right most of the time.

Deal with problems when you have them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: