Amazon are pushing very hard to convince the next generation of devs who grew up writing front-end JS that databases, servers etc. are some kind of technical wizardy best outsourced, when a few weeks of reading and playing around would be enough to get them up to speed.
Hate to be super paranoid, but isn't it rather convenient the top comment on this section expresses exactly this sentiment? If anything, it proves that this perspective is working, or at least, a huge host of devs online really are front-end JS'ers who have this opinion already.
From an engineering/development sense, this is a good thing, because it means that devs are cheaper. Most devs can't even handle being a good client of things like databases. They barely comprehend what the underlying theories are behind SQL (eg sets etc).
Just like early electricity, people ran their own generators. That got outsourced, so the standard "sparky" wouldn't have the faintest idea of the requirements of the generation side, only the demand side.
1. Programmers are supposed to be professionals. They certainly want to be paid like professionals.
2. The effects of the back end have a nasty habit of poking through in a way that the differences between windmill generators and pumped hydraulic storage don't.
Very much this. For most use cases, the out-of-the-box configuration is fine until you hit ridiculous scale, and it's not really all that complicated to keep a service running if you take time to read the docs.
It just came to mind Jason Fried's Getting Real chapter titled "Scale Later". Page 44.
"For example, we ran Basecamp on a single server for the first year. Because we went with such a simple setup, it only took a week to implement. We didn’t start with a cluster of 15 boxes or spend months worrying about scaling.
Did we experience any problems? A few. But we also realized
that most of the problems we feared, like a brief slowdown,
really weren’t that big of a deal to customers. As long as you keep people in the loop, and are honest about the situation, they’ll understand."
Right.. so now your developers don't need to understand how to configure and tune open source directory services and RDBMS's and in-memory caches... they just need to understand how to configure and tune a cloud-provider's implementation of a directory service and RDBMS and in-memory cache..... ?
If you think using a cloud "service" out-of-the-box will "just work" your scale is probably small enough that a single server with the equivalent package installed with the default settings is going to "just work" too.
You did just read our use case didn’t you? Yes we could overprovision a single server with 5x the resources for the once a week indexing.
We could also have 4 other servers running all of the time even when we weren’t demoing anything in our UAT environment.
We could also not have any redundancy and separate out the reads and writes.
No one said the developers didn’t need to understand how to do it. I said we didn’t have to worry about maintaining infrastructure and overprovisioning.
We also have bulk processors that run messages at a trickle based on incoming instances during the day but at night and especially st the end of the week, we need 8 times the resources to meet our SLAs. Should we also overprovision that and run 8 servers all of the time?
> Yes we could overprovision a single server with 5x the resources for the once a week indexing.
You assume this is bad, but why? Your alternative seems to be locking yourself into expensive and highly proprietary vendor solutions that have their own learning curve and a wide variety of tradeoffs and complications.
The point being made is that you are still worrying about maintaining infrastructure and overprovisioning, because you're now spending time specializing your system and perhaps entire company to a specific vendor's serverless solutions.
To be clear, I don't have anything against cloud infrastructure really, but I do think some folks really don't seem to understand how powerful simple, battle-tested tools like PostgreSQL are. "Overprovisioning" may be way less of an issue than you imply (especially if you seriously only need 8 machines), and replication for PostgreSQL is a long solved problem.
You assume this is bad, but why? Your alternative seems to be locking yourself into expensive and highly proprietary vendor solutions that have their own learning curve and a wide variety of tradeoffs and complications.
So having 4-8 times as many servers that unlike AWS, we would also have five times (1 master and 8 slaves) as much storage is better than the theoretical “lock-in”? You realize with the read replicas, you’re only paying once for storage since they all use the same (redundant) storage?
Where is the “lock-in”? It is Mysql. You use the same tools to transfer data from Aurora/MySQL that you would use to transfer data from any other MySQL installation.
But we should host our entire enterprise on digital ocean or linode just in case one day we want to move our entire infrastructure to another provider?
Out of all of the business risks that most companies face, lock-in to AWS is the least of them.
The point being made is that you are still worrying about maintaining infrastructure and overprovisioning, because you're now spending time specializing your system and perhaps entire company to a specific vendor's serverless solutions.
How are we “specializing our system”? We have one connection string for read/writes and one for just reads. I’ve been doing the same thing since the mid 2000s with MySQL on prem. AWS simply load balances rate readers and adds more as needed.
And you don’t see any issue on spending 5 to 9 times as much on both storage and CPU? You do realize that while we are just talking about production databases, just like any other we company we have multiple environments some of which are only used sporadically. Those environments are mostly shut down - including the actually database server until we need them and scales up with reads and writes with Aurora Serverless when we do need it.
You have know idea how easy it is to set the autoscsling read replica up do you?
I’m making up numbers just to make the math easier.
If for production we need 5x of our baseline capacity to handle peak load are you saying that we could get our server from a basic server provider for 4 * 0.20 ( 1/5 of the time we need to scale our read replicas up) + 1?
Are you saying that we could get non production servers at 25% of the cost if they had to run all of the time compared to Aurora Serverless where we aren’t being charged at all for CPU/Memory until a request is made and the servers are brought up. Yes there is latency for the first request - but these are our non production/non staging environments.
Can we get point in time recovery?
And this is just databases.
We also have an autoscaling group of VMs based on messages in a queue. We have one relatively small instance that handles the trickle of messages that come during the day in production that can scale up to 10 at night when we do bulk processing. This just in production. We have no instances running when the queue is empty in non production environments. Should we also have enough servers to having 30-40 VMs running with only 20% utilization?
Should we also set up our own servers for object storage across multiple data centers?
What about our data center overseas close to our offshore developers?
If you have more servers on AWS you’re doing it wrong.
We don’t even manage build servers. When we push our code to git, CodeBuild spins up either prebuilt or custom Docker containers (on servers that we don’t manage) to build and run unit tests on our code based on a yaml file with a list of Shell commands.
It deploys code as lambda to servers we don’t manage. AWS gives you such a ridiculously high amount of lambda usage in the always free tier it’s ridiculous. No, our lambdas don’t “lock us in”. I deploy standard NodeJS/Express, C#/WebAPI, and Python/Django code that can be deployed to either lsmbda or a VM just by changing a single step in our deployment pipeline.
Basic replication is maybe close to what you could call solved, but I'd say that there's still complications like georeplicating multi master write machines are still quite complicated, and need a dedicated person to manage. Hiring being what it is, it might just be easier to let Amazon hire that person for you and pay Amazon directly.
I see cloud services as a proxy to hire talented devops/dba people, and efficiently multiplex their time across several companies, rather than each company hiring mediocre devops/dba engineers. That said, I agree that for quite a few smaller companies, in house infrastructure will do the job almost as well as managed services, at much cheaper numbers. Either way, this is not an engineering decision, it's a managerial one, and the tradeoffs are around developer time, hiring and cost.
> I see cloud services as a proxy to hire talented devops/dba people, and efficiently multiplex their time across several companies
It's only a proxy in the sense that it hides them (the ops/dbas) behind a wall, and you can't actually talk directly to them about what you want to do, or what's wrong.
If you don't want to hire staff directly, consulting companies like Percona will give you direct, specific advice and support.
If something goes wrong, we can submit a ticket to support and chat/call a support person immediately at AWS. We have a real business that actually charges our (multi million dollar business) customers enough to pay for a business level support.
But in your experience, what has “gone wrong” with AWS that you could have fixed yourself if you were hosting on prem?
"No one said the developers didn’t need to understand how to do it. I said we didn’t have to worry about maintaining infrastructure and overprovisioning."
If you do not do something regularly, you tend to lose the ability to do it at all. Personally, and especially organizationally.
There is a difference between maintaining MySQL servers and the underlying operating system and writing efficient queries, optimizing indexes, knowing how to design a normalized table and knowing when to denormalize, looking at the logs to see which queries are performing slowly etc. using AWS doesn’t absolve you from knowing how to use AWS.
There is no value add in the “undifferentiated heavy lifting”. It is not a companies competitive advantage to know how to do the grunt work of server administration - unless it is. Of course Dropbox or Backblaze have to optimize their low profit margin storage business.
Why not run eight servers all the time? If you are running at a scale where that is a cost you notice at all, you are not only in a very early stage, you're actually not even a company.
There are many, MANY software companies whose infrastructure needs are on the order of single digits of normal-strength servers and who are profitable to the tune of millions of dollars a year. These aren’t companies staffed with penny-pinching optimization savants; some software, even at scale, just doesn’t need that kind of gear.
Build server - CodeBuild you either run with prebuilt Docker containers or you use a custom built Docker container that automatically gets launched when you push your code to GitHub/CodeCommit. No server involved.
Fileshare - a lot of companies just use Dropbox or OneDrive. No server involved
FTP - managed AWS SFTP Service. No server involved.
DHCP - Managed VPN Service by AWS. No server involved.
DNS - Route 53 and with Amazon Certificate Manager it will manage SSL certificates attached to your load balancer and CDN and auto renew. No servers involved.
Active Directory - Managed by AWS no server involved.
Firewall and router - no server to
Manage. You create security groups and attach them to your EC2 instances, databases, etc.
You set your routing table up and attach it to your VMs.
Networking equipment and routers - again that’s a CloudFormatiom template or go old school and just configuration on a website.
Yes I realize SFTP is not FTP. But I also realize that no one in their right mind is going to deliver data over something as insecure as FTP in 2019.
We weren’t allowed to use regular old FTP in the early 2000s when I was working for a bill processor. We definitely couldn’t use one now and be compliant with anything.
I was trying to give you the benefit of a doubt.
Amateur mistake that proves you have no experience running any this.
If it doesn’t give you a clue the 74 in my name is the year I was born. I’ve been around for awhile. My first internet enabled app was over the gopher protocol.
How else do you think I got shareware from the info-Mac archives over a 7 bit line using the Kermit protocol if not via ftp? Is that proof enough for you or do I need to start droning on about how to optimize 65C02 assembly language programs by trying to store as much data in the first page of memory because reading from the first page took two clock cycles on 8 bit Apple //e machines instead of 3?
We don’t “share” large files. We share a bunch of Office docs and PDF’s as do most companies.
Yes, you do have to run DNS, Active Directory + VPN. You said you couldn’t do it without running “servers”.
No we don’t have servers called
SFTP-01
ADFS-01
Etc.
either on prem or in the cloud.
Even most companies that I’ve worked for that don’t use a cloud provider have their servers at a colo.
We would still be using shares hosted somewhere not on prem. How is that different from using one of AWS storage gateway products.
9 servers (8 reads and 1 writer) running all of the time with asynchronous replication (as opposed to synchronous replication) with duplicate data - yes the storage is shared between all of the replicas.
Not to mention the four lower environments some of which the databases are automatically spun up from 0 and scaled up as needed (Aurora Serverless)
Should we also maintain those same read replicas servers in our other environments when we want to do performance testing?
Should we maintain servers overseas for our outsourced workers?
Here we are just talking about Aurora/MySQL databases. I haven’t even gotten into our VMs, load balancer, object store (S3), queueing server (or lack there of since we use SQS/SNS), our OLAP database (Redshift - no we are not “locked in” it users standard Postgres drivers), etc.
AWS is not about saving money on like for like resources as you would on bare metal, but in the case of databases where your load is spiky you do. It’s about provisioning resources as needed when needed and not having to either pay as many infrastructure folks. Heck before my manager who hired me and one other person came in, the company had no one onsite that had any formal AWS expertise. They completely relied on a managed service provider - who they pay much less than they would pay for one dedicated infrastructure guy.
I’m first and foremost a developer/lead/software architect (depending on which way the wind is blowing at any given point in my career), but yes I have managed infrastructure on prem as part of my job years ago, including replicated MySQL servers. There is absolutely no way that I could spin up and manage all of the resources I need for a project and develop at the efficacy level at a colo as I can with just a CloudFormation template with AWS.
I’ve worked at a company that rented stacks of servers that sat idle most of the time but we used to simulate thousands of mobile connections to our backend servers - we did large B2B field services deployments. Today, it would be running a Pythom script that spun up an autoscaling group of VMs to whatever number we needed.
The question is how often is that necessary? Once again the point goes back to the article title. You are not Google. Unless your product is actually large, you probably don't need all of that and even if you do, you can probably just do part of it in the cloud for significantly cheaper and get close to the same result.
This obsession with making something completely bulletproof and scalable is the exact problem they are discussing. You probably don't need it in most cases but just want it. I am guilty of this as well and it is very difficult to avoid doing.
You think only Google needs to protect against data loss?
We have a process that reads a lot of data from the database on a periodic basis and sends it to ElasticSearch. We would either have to spend more and overprovision it to handle peak load or we can just turn on autoscaling for read replicas. Since the read replicas use the same storage as the reader/writer it’s much faster.
Yes we need “bulletproof” and scalability or our clients we have six and seven figure contracts with won’t be happy and will be up in arms.
OTOH the last "cloud" solution I've seen was a database that allowed a team to schedule itself. It was backed on Google cloud, had autoscaling backend. It was "serverless" (app engine), and had backups etc configured.
Issue is, what would the charge be for a typical application for some basic business function, say scheduling attendance.
So QPS really low, <5. But very spread out, as it's used during the workday by both team leader and team members. Every query results in a database query or maybe even 2 or 3. So every, say 20 minutes or so there's 4-5 database queries. Backend database is a few gigs, growing slowly. Let's say a schedule is ~10kb of data, and that has to be transferred on almost all queries, because that's what people are working on. That makes ~800 gig transfer per year.
This would be equivalent to something you could easily store on say a linode or digital ocean 2 machine system, for $20 month or $240/year + having backup on your local machine. This would have the advantage that you can have 10 such apps on that same hardware with no extra cost.
And if you really want to cheap out, you could easily host this with a PHP hoster for $10/year.
If you have a really low, spread out load, you could use Aurora Serverless (Mysql/Postgres) and spend even less if latency for the first request isn’t a big deal.
And storage for AWS/RDS is .10/gb per month and that includes redundant online storage and backup.
I’m sure I can tell my CTO we should ditch our AWS infrastructure that hosts our clients who each give us six figures each year to host on shared PHP provider and Linode....
And now we have to still manage local servers at a colo for backups....
We also wouldn’t have any trouble with any compliance using a shared PHP host.
You keep mentioning six figures contracts as it's something big. Most of fortune 500 companies have their own data center and still read database manuals somehow...
No I’m mentioning our six figure customers because too often small companies are seen as being low margin B2C customers where you control for cost and you don’t have to be reliable, compliant, redundant or scalable from day one. One “customer” can equate to thousands of users.
We are a B2B company with a sales team that ensures we charge more than enough to cover our infrastructure.
Yes infrastructure cost scale with volume, but with a much lower slope.
But it’s kind of the point, we save money by not having a dedicated infrastructure team, and save time and move faster because it’s not a month long process to provision resources and we don’t request more than we need like in large corporations because the turn around time is long.
At most I send a message to my manager if it is going to cost more than I feel comfortable with and create resources using either a Python script or CloudFormation.
How long do you think it would take for me to spin up a different combination of processing servers and database resources to see which one best meets our cost/performance tradeoff on prem?
Depends on how you architecture it, theoretically spinning on prems or in baremetal or vsphere should be an ansible script with its roles and some docker file regardless.
Just for reference we ""devops"" around 2 thousand VMs and 120 baremetal servers and a little of cloud stuff through same scripts and workflows.
We don't really leverage locked in cloud things because we need the flexibility of on prems.
In my business hardware is essentially a drop in the bucket of our costs.
P.s: I totally think there are legit use cases for cloud, is just another tool you can leverage depending on the situation
Spinning up on prem means you have to already the servers in your colo ready and have to pay for spare capacity. Depending on your tolerance for latency (production vs non-production environments or asynchronous batch processing), you can operate at almost 100% capacity all of the time:
Lambda vs VMs (yes you can deploy standard web apps to
Lambda using a lambda proxy)
Yes, you can schedule automatic snapshots where it will take snapshots on a schedule you choose. You get as much space for your backups as you have in your database for free. Anything above that costs more.
You also get point in time recovery with BackTrack.
The redundant storage is what gives you the synchronous read replicas that all use the same storage and the capability of having autoscalinv read replicas that are already in sync.
Actually with zfs and btrfs you can service mysql stop, put checkpoint back and bring it back up at roughly the same speed as AWS can do it.
I also have done enough recoveries to know that actually placing back a checkpoint is admitting defeat and stating that you
a) don't know what happened
b) don't have any way to fix it
And often it means
c) don't have any way to prevent it from reoccuring, potentially immediately.
It's a quick and direct action that often satisfies the call to action from above. It is exactly the wrong thing to do. It very rarely solves the issue.
Actually fixing things mostly means putting the checkpoint on another server, and going through a designed fix process.
Have you benchmarked the time it takes to bring up a ZFS based Mysql restore?
a) don't know what happened
b) don't have any way to fix it
And often it means
c) don't have any way to prevent it from reoccuring, potentially immediately.
Can you ensure the filesystem consistency when you restore with ZFS?
a) don't know what happened
b) don't have any way to fix it
And often it means
c) don't have any way to prevent it from reoccuring, potentially immediately.
In the example that was given in the parent post, we know what happened:
Someone inadvertently (hopefully) did
Delete * From table
You restore the table to the previous state and you take measures to prevent human error.
Why do you have a specific need to read the database periodically polling it instead of just pushing the data to elastic search at the same time that it reaches the database?
I don’t know anything about your architecture, but unless you’re handling data on a very big scale probably rationalising the architecture would give you much more performance and maintainability than putting everything in the cloud.
Without going into specifics. We have large external data feeds that are used and correlated with other customer specific data (multitenant business customers) and it also needs to be searchable. There are times we get new data relevant to customers that cause us to reindex.
We are just focusing on databases here. There is more to infrastructure than just databases. Should we also maintain our own load balancers, queueing/messaging systems, CDN, object store, OLAP database, CI/CD servers, patch management system, alerting monitoring system, web application firewall, ADFS servers, OATH servers, key/value store, key management server, ElasticSearch cluster etc? Except for the OLAP database. All of this is set up in some form in multiple isolated environments with different accounts in one Organizational Account that manages all of the other sub accounts.
What about our infrastructure overseas so our off shore developers don’t have the latency of connecting back to the US?
For some projects we even use lambda where we don’t maintain any web servers and get scalability from 0 to $a_lot - and no there is no lock-in boogeyman there either. I can deploy the same NodeJS/Express, C#/WebAPI, Python/Django code to both lambda and a regular old VM just by changing my deployment pipeline.
Did you read the article? You are not Google. If you ever do really need that kind of redundancy and scale you will have the team to support it, with all the benefits of doing it in-house. No uptime guarantee or web dashboard will ever substitute for simply having people who know what they're doing on staff.
How a company which seems entirely driven by vertical integration is able to convince other companies that outsourcing is the way to go is an absolute mystery to me.
No, we are not Google. We do need to be able to handle spiky loads - see the other reply. No we don’t “need a team” to support.
Yes “the people in the know” are at AWS. They handle failover, autoscaling, etc.
We also use Serverless Aurora/MySQL for non production environments with production like size of data. When we don’t need to access the database, we only pay for storage. When we do need it, it’s there.
I agree for production because you want to be able to blame someone when autoscaling fails but we trust developers to run applications locally, right? Then why can't we trust them with dev and staging?
By the way what is autoscaling and why are we autoscaling databases? Im guessing the only resource thst autoscales is the bandwidth? Why can't we all get shared access to a fat pipe in production? I was under the impression that products like Google Cloud Spanner have this figured out. What exactly needs to auto scale? Isn't there just one database server in production?
In dev (which is the use case I'm talking about) you should be able to just reset the vm whenever you want, no?
The big innovation of aurora is they decoupled the compute from the storage [1]. Both compute and storage need to auto scale, but compute is the really important one that is hardest. Aurora serverless scales the compute automatically by keeping warm pools of db capacity around [2]. This is great for spiky traffic without degraded performance.
I've had servers with identical configurations, identical setup, identical programs installed the exact same way (via docker images), setup the same day
It stands to reason that they weren't identical because if they were they would both have worked.
The thing separating you (in this case) and Amazon/Google/* is that they make sure their processes produce predictable outcomes that if they were to fail can be shutdown and the process can be redone until it yields the correct outcome.