Honest question, if you're running at 10%, why have you gone with 12-core, 512 G...

TheBobinator · on Oct 13, 2016

Who rips the $1,000 processors and $200 voltage control modules out of their servers to upgrade to $2,000 processors and $250 voltage control modules, then re-plans their entire infrastructure and possibly code back around that?

Symmetry between servers has a value.

The main board, raid controller, network, and usually the storage is going to be planned out meticulously ahead of time based on the maximum load the server is going to see during it's lifetime. Often, Processor and Memory come down to "If we needed X Feature and we didn't have it" or "If we took the server down for 1 hour to upgrade it, how expensive is that?".

shermanyo · on Oct 20, 2016

I misunderstood, sorry. I was talking in the context of virtual servers that can scale resources somewhat dynamically. (side note, isn't it awesome we see numbers like "512GB RAM" and don't immediately assume its not a single node in a deployment?)

I initially pictured someone picking the highest spec option they can afford when setting up a new service, rather than choosing based on actual demand of each node.

> "If we took the server down for 1 hour to upgrade it, how expensive is that?"

Putting my CI / DevOps hat on for a sec: who takes production servers down for upgrades without some level of HA to avoid downtime? ;)

chrissnell · on Oct 13, 2016

Several reasons. We did some POC testing before we deployed this gear and knew that we would achieve high density. Honestly, I didn't expect this much. Hard to believe that we've basically deployed every non-database app we have on this cluster and we're only 10% utilized.

Every service gets run in two environments: test and prod. Both are co-located on the same Kube cluster in different namespaces. We also don't put any datastores in Kube. That stuff still lives in OpenStack for now. Ceph can make it possible but for fast disk I/O, it's tough to beat the local SAS bus.

snuxoll · on Oct 13, 2016

> Ceph can make it possible but for fast disk I/O, it's tough to beat the local SAS bus.

Personal anecdote - we have everything on-prem using a pretty standard vSphere setup, I've got a couple PostgreSQL databases that aren't even that heavily used (I mean, they top out around 20-40 tx/sec) backed by a hybrid Tegile array. Randomly throughout the day my IOWAIT starts spiking because even over 8Gb fiber channel our storage latency starts spiking because everything else is on the same storage (well over 200 full VM's), or sometimes I need to run a table scan over a 40GB table for a one-off query and the storage bottlenecks the crap out of us because only our small active set is cached on the SSD's.

You can run databases on remote storage, people obviously do it, but there's a reason why even on AWS the best practice is to use your instance disks instead of EBS volumes if you care about performance. Noisy neighbors suck.

shermanyo · on Oct 20, 2016

I didn't consider latency spikes on slower resources like that, thanks for the examples.

beachstartup · on Oct 13, 2016

because the marginal cost of hardware once the chassis is racked is trivial compared to the public cloud, which is insanely marked up for retail on exactly this concept.

traicepearson · on Oct 13, 2016

I have heard from a staffer on the GCE team that they deploy three CPU cores for every user facing core. That might have something to do with the cost.

jakevn · on Oct 13, 2016

I wonder why it is that companies like DigitalOcean and RamNode are able to provide far better performance for a fraction of the price.

GCE seems poorly engineered if the profit margins are indeed low (normal, that is).

p_l · on Oct 13, 2016

GCE provides significantly more reliability promises to us, from the PoV of user - As far as I know, DigitalOcean and RamNode don't provide things like no-downtime host maintenance (hell, AWS doesn't provide that)

Interestingly, the "3 cores for 1 user acing one" would explain the pricing of "preemptible" instances - which can cost as little as 1/3rd of full instance. And the main difference is that they are fully rescheduled every 24h or more often, and there's no live-migration for maintenance...

… I am now wondering if GCE runs fault-tolerant VMs. Because if it does, holy moly

brianwawok · on Oct 13, 2016

Except they don't. You give up reliability and uptime and not being on oversold hosts for a tiny bit more money?? So some real world benchmarks between DO and GCE. Not the same ballpark.

jsmthrowaway · on Oct 13, 2016

I'm aware of one similar company that used to fit sixty $20/mo VPS in 1U. They were working on a couple hundred in 2 or 3U when I left (harder than you'd think), but they also have a cheaper tier now because of competitors pushing VPS down. Margins on VPS, even without oversubscribing RAM, are pretty decent once you get past the capital but they are decreasing in a race to the bottom just like shared hosting before it.

jacques_chester · on Oct 13, 2016

Resizing a Digital Ocean VM takes me several minutes.

A GCE VM spins up in less than a minute. I've seen them come up in 20 seconds.

newsat13 · on Oct 13, 2016

I am guessing having many instances helps them test their setup. And also allows for multiple environment deploys.