I don't really think this is a Kubernetes-specific problem. If you have a million machines, want your database to run on one of them that is selected by some upstream orchestrator, and want the physical SSDs with that data on it to be in the same machine, you're going to have to do some work. But at the same time, you have to realize that you are doing this to get that tiny last bit of performance (most likely that 99.9%-ile latency) and that that last 0.1% is always the most expensive. Sometimes you need it, and I get that, but it's not a problem that everyone has.
Most applications are not so IOPS limited that they depend on the difference between PCIE request latency and going out over the network to get stuff that's actually stored on a nearby rack. And in that case what Kubernetes offers (with the help of cloud providers) is fine. You make a storage class. You make a persistent volume claim. Your pod mounts that. Not all that hard. If the performance isn't good enough, though, then you have to build something yourself.
I am used to a completely different model that we had at Google. You could not get durable storage in your job allocation. Everything went through some other controller that did not give you a block device or even POSIX semantics, and you designed your app around that. If you needed more IOPS you talked to more backends and duplicated more data.
Meanwhile in the public cloud world, you get to have a physical block device with an ext4 filesystem that can magically appear in any of your 5 availability zones, provisioned on the type of disk you specify with a guaranteed number of IOPS. It's honestly pretty good for 90% or even 99% of the things people are using disks for. (In my last production environment I actually ran stuff like InfluxDB against EFS, the fully-managed POSIX filesystem that Amazon provides. It did fine.)
For databases, local storage beats SAN storage massively. I can get better performance and IOPS from an Intel NUC with a decent PCIe SSD than I can get from an AWS RDS instance that costs as much per month as the NUC did to buy. If I optimize a proper rack mount server for database I can get an insane amount and storage and performance compared to even a few months of RDS. Even a pair of 40 Gbps SAN links would not compare.
Only if your (storage) network is slow (eg 1/10/25 GbE).
With a decent network (eg modern infiniband, 40+GbE, etc) for the storage, the latency and throughput to the storage shouldn't make a difference.
For example (years ago), I used to set up SSD arrays - SATA at the time, as M.2 wasn't a thing - and have them served over a 20Gbs Infiniband network to hosts in the same data centre.
The access times from an OS perspective to hit that storage over the network were the same as for hitting local disk. But the networked storage was higher bandwidth (multiple SSD's, instead of a single per host).
I'm not sure this is necessarily true. 2 x 40Gbps is bandwidth which is typically not the limiting factor. If it was you can go with higher bandwidth like FC. RDS is a database service, not SAN. Look at SAN boxes from storage companies, you can get things like 60 SSDs in a single box. To match the IOPs of that you'd need a lot of servers with local storage. I think in general dis-aggregating compute and storage is the optimal approach, whether some particular solution is better/price effective is a different question. Having all your drives in a box means they're easier to share, easier to service, easier to replace the servers as well ... lots of wins there.
EBS is a compromise, its allows your data to run anywhere in a region, on any machine. That means lots of hops and lots of interconnects.
the latest SAS stuff runs at 12 gigs(most likely 4 lanes per cable, and dual linked too.), _but_ thats dedicated for local traffic. The performance difference between having a SAS disk inside a box, or in the next rack is negligible. A decent SAN that exports over a 56gig connect (inifiniband et al) will be >> than a local pcie in terms of iops and sustained bandwith (at the expense of latency.)
Crucially, in somewhere that runs it's own datacenter, the storage is modelled for a specific workload. In VFX land we had 32 60 disk file servers, each capable of saturating two 10 gig ethernet links. But it'd be appalling for mixed VM hosting.
EBS is a hedged bet, it had a _boatload_ of caching to make any sort of performance. It had a huge amount of QoS to stop selfish loads stealing all the IOPs.
To get the best performance from EBS you have to have large volumes (5Tb+) and large latest generation instances (c5, r5, m5) 4 or 8/9 xlarge ec2 instances. This costs money to run.
This sort of portability is usually handled by the DB routing layer in cloud native ("webscale") database deployments. eg in Cassandra.
of course there are a lot of systems where one MySQL DB instance is needed and sufficient for the forseeable future. so that's a very static resource allocation one VM on a hypervisor with local disk, (and with backups every day), makes a lot of smaller sites very happy, and if they get the local provisioned performance instead of the EBS variant they will be happier for longer.
There's a wide range of performance with storage systems but why are you comparing that to RDS? That's a managed database service that just runs on EC2 instances with EBS volumes, and EBS is designed for affordable and scalable persistence instead of extreme performance.
EBS is not your typical SAN, which usually offer much more throughput, IOPS, and reliability in exchange for more complexity and latency, however you probably won't even notice that latency if your SAN system is close enough and using highend links.
That has almost nothing to do with SAN and everything to do with Amazon's implementation. A nuc will never give you the resiliency or performance of a properly sized SAN.
Furthermore, you'll never get close to saturating a 40gbps link with a database workload, your limitation is iops, not throughput, and you don't have anywhere near enough cpu to max out those links in a loaded server much less a nuc.
The standing joke amongst my co-workers is that ”enterprise” in this context is the tier you reach when you have gotten ripped of enough by third party vendors.
You usual run of the mill enterprise use most of this horsepower to heat air and have large clusters of oracle db’s power even more expensive SAP modules.
There’s a reason no serious iaas provider build infra this way.
I have one example, actually an old customer to the vendor that used to employ me, that had probably well over 1k of tennants all with VM’s and DB’s backed by humongous SAN (replicated ofc!).
Not nice when corrupt data was mirrored due to a software bug in said SAN system.
This went to a national level, really, due to the nature of many of the customers.
Sure you can build a reasonable architecture using the products and vendors mentioned, but, it doesn’t scale and it really locks you in.
I’ve seen and worked this stuff at numerous enterprises.
This only makes it my experience and one datapoint, so please bear that in mind! =D
they don't build it this way because cloud is cheap and slow. I have a UCS farm with 2000VMs, connected to 3PB of usable all-flash across 2 arrays, replicated active/active to another array, second-hopped to cloud. That's one site. About 100k users across 2 domains - 1k? I said enterprise. A SAN is not scalable but internal disk is? You need to look up what a SAN is.
The reason there is a standing joke with your coworkers is because you don't work on important things that run the world. When 1 minute of downtime costs you over $1mil, your "ripped off" Oracle costs, your SAN costs, etc, are lost in the rounding errors. I can have a SAN have hundreds of arrays, across multiple datacenters, and dynamically grow and shrink my storage needs. It is the definition of scalable. I can have one VM farm vmotion to another VM farm 30 miles away transparently, which will sit on a different array attached to the SAN. What happens when the storage needs of your server outgrow the drives you can shove in there? There are servers and databases a petabyte in size, pushing a million IOPS. They're in charge of money. If there is corruption on the SAN, well it infrequently happens, just like with servers and memory. Twice in my vast experience. You can roll back all writes on the replication software, and usually keep an undo journal a few days long, and something like hourly snapshots. This also protects against cryptoviruses and other types of corruption.
This is why people like you work at small companies, fiddling with your cute little projects, while the world moves forward with you on the sidelines. I've been doing this for 20+ years, and have been at most fortune100 companies, in 80 countries. But yeah, your opinion, while being dismissive, is cute.
The CIO is the guy in charge of getting this stuff at enterprises. It's clear you don't understand the impact of design. I am guessing you are not at an architect level. There's a reason for that, and a reason other people - the ones you put down in your post, are making the big decisions. People like you would cost the company millions of dollars in loss per year.
I said tenants, not users. Tiny little detail that you overlooked there. =)
Last company I worked for had a €150.000.000 tech budget and about ~90k employees.
Again, if you read what I said you'll notice that I pointed out that you actually can build decent architectures using enterprisey stuff, it will however never scale horizontally in a resonable way.
What happens when you have filled all your shelves in the array box? Ah, you'll need a new $250K box...
Mentioning vmotion and all... yeah so... VSAN et al... brrr... I'll put my trust in the open source world rather than shoddy software from 3rd party vendors trying to appease to the latest fad. Hyperconverge my *ss! =)
KISS is what rules but vendors are busy feeding channels and partners with impossible to penetrate acronyms leaving you in a mess sooner or later.
Focus from these vendors is to keep integration at a minimum, which means API's are usually crappy and achieving a reasonable level of automation is often a chore.
Now there is probably a little percentage of "enterprises" that uses tech in a sane way, but my bet is that the CIO you mention is a blockchain expert, as well as putting AI at the top of strategic actions to "implement" this year.
He probably have commissioned a pre-study from Accenture that outlines these strategic imperatives.
This is of course a little rant, but also true for the 90% of the 90% mentioned.
Don't be defensive and close minded. That's what's leaving most enterprises in the dust.
There's just a lot of inertia inherent within certain businesses that will let them continue to burn through cash on pointless tech for yet a while.
again, cute. $250K. very cute. Try $5mil for a box, at least, after a 50% vendor discount. VSAN has nothing to do with SAN, and no one on a SAN uses VSAN. You are again showing your lack of enterprise experience, yet you are strongly trashing what you don't understand. A server sees many arrays, over a SAN. That is called horizontal scaling. You can't do that with internal disk.
Oh, 250 was for the box, no disks. 75% discount. :)
Keep it old-school!
Vsan btw is vmwares attempt to actually achieve _horizontal_ scalability.
I just don’t trust vmware with anything except for that bytecode vm. It used to rock, but time went ahead, and even as they IPO’d I thought they would end up dead in the water eventually.
I had just been introduced to the wonderful zones in Solaris and it just made hypervisor vm’s seem silly.
If your horizon extends to what vendors are prepared to sell you, take that 50 discount and run with it!
One final thought: is that disk not ”internal” to the array? This is what makes blockstorage notoriously difficult to scale horizontally.
You’ll need very clever software!
yeah, I know what VSAN is. It's you who does not if you think you run VSAN on top of a SAN-connected cluster. Yes, the disk is internal to the array. A server can see a hundred arrays on the same HBAs. It doesn't care what array the storage comes from. I am positive at this point you know nothing about what a SAN is.
What "box" - the 42U Rack? I don't think so. The rack is always free. You then have DAs connected to the disk, and FAs connected to the SAN, which are on a pair of directors. You literally cannot get those w/o disk.
You don't know what a SAN is, you've never seen an itemized quote for an array. Thanks for your link. It's like sending a hooked on phonics link to an English professor. Cute. Keep it cute! People like you are the reason people like me get paid a lot.
For free! Good one. Cause’ that’s whats really happening... right? With a straight face?
I was certain I had nothing more for you, but this is too much fun!
You clearly don’t grasp the difference between vertical and horizontal scalability which means you have never been subjected to a bunch of scenarios requiring the latter.
Dinosaurs taking the p*ss are the reason many enterprises opt to off-shore and out-source.
yes idiot. I have worked for several vendors and sold this stuff. I have also been on the customer side buying this stuff. You're paying 250k for an empty 42U that you call "box" - you literally are lying.
there are, literally, zero enterprises off-shoring their hard IO hitting datacenters. In fact, having been in 80 different countries for these enterprises, they usually have many datacenters all over the world.
VSAN is for hyperconverged systems only. racks with a thousand 1u nodes that all have disk, connected to a fat ethernet backplane. Things on a SAN are not hyperconverged - they are on a SAN. VSAN is for tier 2 stuff that goes on hyperconverged - like web servers and DMZ things. The fastest processing is done on solid databases like Oracle or UDB on clusters of large servers, connected to a SAN. VSAN is even positioned by sales for tier 2 from all major vendors.
You literally picked up some technical words you heard around the office, googled a few things, and now consider yourself and expert so you give authoritative opinion on here about things you have never worked with. I bet you are deskside support or a code monkey, and have never architected a solution. When someone gives you a budget of 20mil and says you can average 1min of application unavailability per year or it impacts billions in company's bottom line and your whole team gets fired, do let me know. I'm sure your suggestion would be to cluster together a bunch of old dell laptops over wifi.
So, you’re saying vsan for horizontal scalability and you regular san for single point of vertical scalability.
Yup, sounds about right!
I’ve only consumed vsan as a dev, and that turned out... not so good.
Scaling performant storage horizontally is a science and no fun to troubleshoot.
Implying you get stuff for free from EMC is just a bad joke, or you have no clue how high the markups are.
It does not matter what the invoice tells you, bottom line counts.
Last time I was involved, admittedly many years ago, the enclosure actually came with a price tag.
I was there when the company I used to work for was first through the gates in emea to san boot their physical x86 boxes. What a ride and very little gain. Costed a fortune tho!
I’ve worked at a hardware/software vendor (499 of the 500) as an architect within professional services which means I have way to much exposure to the madness going on at the 500s.
I’ve built all flash boxes running zfs as well as xfs for specific scalability needs for customers, usually on-prem clouds but other use cases as well.
Commodity hardware. Blazing performance.
If I need proper storage experts I turn to the zfs dev mailing list f ex.
If you could lower your guard for a sec you’ll see that I have no less than two times pointed out that you can build decent infrastructure with precious enterprise gear.
You’ll pay through your nose tho, and personally I’d rather hire 5-10 super generalist FTEs rather than a vendor labeled expert and a bunch of hardware.
I know: “that’s not SAN”, but the point I’m making is the mark-up. If you are working at a place that allows you to play the lone ranger and blow everyone away with three letter vendor acronyms, good for ya’!
Lone rangers are the biggest blocker for most enterprises from a technical perspective, second only to politics.
Not a chance buddy. I've been here for a while, and this place has become reddit, with mods from /r/incels. I have two accounts. When I see someone with zero knowledge of subject matter, like OP, authoritatively and dismissively spew complete bs, I reply like I do.
People with a clear agenda spewing fake crap gets upvoted here now. Disagree with a mod - your comments are shadow removed, and possibly your account is shadow banned, with no warning, so the insecure mod can feel better.
I give actual valuable information from 20 years of experience at almost all the fortune100 firms. Snarkness? Yes. Much better and funnier than the guy I replied to.
I know you disagree. That's specifically why I include sentences like the last - to draw attention to the issue. comment karma has become useless here. Stuff is greyed out, stuff is shadow deleted - good things, filtered by idiots. I don't want my content filtered by insecure idiots.
Hey budz, many around here have 20+ years of experience.
I've fought all my career to get "enterprise" folks to understand that spending $100s of mill on hardware vendors and 3rd party software is madness, if what you are doing is actually mission critical.
Money should be spend on acquiring talent and adapting to actual business needs.
The business you support could not care less about your shiny toys, they want to transform, and probably yesteryear!
A digression here, from one of your other posts: if your business applications rely on live migration of vm's, you're doing it wrong.
You should take this up with your CIO or even better the CTO, but he's probably busy looking at Gartner magic quadrants and contemplating the dire need of a blockchain.
Keep an open mind, and keep it simple. Best advice ever in this crazy business!
plenty of talent, and for stability and performance all that talent spends millions of dollars on top of the line proven solutions. Applications rely on live migrations for load balancing a farm, and it happens automatically in the background around the clock to rebalance farms. Hardware maintenance is another use case. So is a bucket of water. So is change control - I want to load new HBA firmware? I'll evacuate the ESX to somewhere else first.
no one contemplates blockchain, but for someone CIO level, gartner quadrants present a quick high-level summary of where the industry is going. and $100mil+ for hardware is completely normal to run things that process billions of dollars. The US treasury department is a great example. Mastercard is another.
I agree that 90% of people do not need that, but there is a scale between 1 machine and 1 million machines.
It is not just IOPS but latency. Some of my production servers are connected one-to-one by crossover cable, in clusters with as many extra nics as needed (say 3 machines: 2 extra nics per machine) just to shave that little extra overhead of going through a router as yes, it matters in some applications!
Try using fiber then. I was surprised and pleased to learn that my 10G fiber has something like a third or quarter of the latency of my 10G copper in my new house.
You mean use (X)SFP(+), not “use fiber”. The SFP+ copper twinax cables I use in my lab at home have lower latency than the majority of MMF/SMF transceivers on the market.
Yes, thank you for the clarification/nomenclature. My SFP+ on MMF (multi-mode fiber) has significantly lower latency than the copper 10GbaseT I use. I haven't measured the direct-connect copper (coax? twinax?) SFP+ latency, though.
I would be curious if the SFP for 1G using fiber also has lower latency than copper 1000baseT.
A large data center has 20-30k machines. Amazon disclosed that back in 2014. Most likely you want to partition the machine into some chunks, so each application or each database only use some of them. So you don't need to talk to many machines at once.
In general, a distributed database can be really fast. DynamoDB claim to be 3ms (AWS reinvent 2014). The current technology can probably achieve close to 1ms, which is equivalent to memcache. With such performance, you don't really need local storage.
You can get much faster IO if you use many local SSDs. The downside is utilization. It is very rare a single machine has a workload that fully utilize local disk. You end up over provision greatly fleet-wise to get high performance. A managed database over a network is more likely to utilize disk/SSD throughput.
> and that that last 0.1% is always the most expensive
This is one of the things that gets me. If you get the last .1% that can ever be gotten, what do you do for an encore?
When your userbase grows another 5% or 10% or 20%, it won't be enough. You'd be better off trying to figure out ways to give your users something they want that doesn't require most exotic thing that can be procured. It's expensive to begin with, and it's the end of the road. You don't want to be at the end of a road trying to figure out what to do next.
Nobody’s getting “the last .1% that can ever be gotten”, so it’s a moot point. Google and Facebook are still growing. There’s always room for more improvements.
That these kind of micro optimizations are worth it. They reduce bulk and tail latency, and probably make the system more load-bearing. Of course, since now the system depends on a more "lucky" allocation of resources, it has more failure cases, thus a tighter operational envelope.
You keep implying that Facebook and Google are making these micro-optimizations with no support. I don't think you can support them, which means your premise is flawed and your conclusions are wrong.
They have ridiculous numbers of servers. Rather than fiddling to reduce server count, they're improving their ability to scale out. It's cheaper and it's repeatable. Doing crazy things like building custom hardware and power distribution buses (Facebook) to reduce heat in the data centers.
There are services where latency is important, they launch multiple requests to different replicas but to optimize this when one of the replicas is serving the response it sends a request cancel to the other one.
Horizontal scaling and vertical scaling is both very important.
Both G and FB pours many engineering hours into maximizing per server "ROI". Just think about how much they work on scheduling work/tasks so they can do more with the same number of servers.
Its more of an issue with the OS, *nix is showing its age as a useful OS as we are moving away from a static and simple system (PC's, servers, etc) to more distributed computing.
Most applications are not so IOPS limited that they depend on the difference between PCIE request latency and going out over the network to get stuff that's actually stored on a nearby rack. And in that case what Kubernetes offers (with the help of cloud providers) is fine. You make a storage class. You make a persistent volume claim. Your pod mounts that. Not all that hard. If the performance isn't good enough, though, then you have to build something yourself.
I am used to a completely different model that we had at Google. You could not get durable storage in your job allocation. Everything went through some other controller that did not give you a block device or even POSIX semantics, and you designed your app around that. If you needed more IOPS you talked to more backends and duplicated more data.
Meanwhile in the public cloud world, you get to have a physical block device with an ext4 filesystem that can magically appear in any of your 5 availability zones, provisioned on the type of disk you specify with a guaranteed number of IOPS. It's honestly pretty good for 90% or even 99% of the things people are using disks for. (In my last production environment I actually ran stuff like InfluxDB against EFS, the fully-managed POSIX filesystem that Amazon provides. It did fine.)