I had the pleasure of helping to build and manage these facilities, both hardware and software, for 5 years. It's nice to see some of Google's real innovations reach the public eye. Some of the smartest folks I ever worked with at the company build absolutely mind blowing tech that the outside never has the opportunity to see or appreciate.
In fact, while much of the content in the article has been written about before, it's still probably 2-3 years or more behind where Google is actually at. I left in 2010 and did't read about anything I had not experienced.
Reminds me when msn search spent a billion dollars (or whatever was reported) saying we have more pages than google. Google simply updated the number of pages indexed after Microsoft was done huffing and puffing.
It was pretty funny at the time but the lesson wasn't lost on me: With competition, get ahead, stay ahead, and have things already done and implemented so you can announce big accomplishments when it's strategic for you.
Yes. You can get to the top by competing with others; you stay on top by competing with yourself.
There's a great graph in "Toyota Kata" that shows per-worker productivity of major car companies for the last several decades. They all rise together for the early part of the graph. In the 60s, the American car companies level off; Toyota keeps growing. They focused on continuous improvement, while American car companies floundered.
The really interesting part of this to me is that it's rooted in a philosophical difference. Toyota was started and run by engineers. The American car companies gave birth to the MBA approach to business. Engineers naturally seek improvement; MBAs seek profit.
Google is one of the few major companies with a philosophical background like Toyota's. It's run by nerds. Their goal isn't to increase shareholder value; it's to build great stuff and organize the world's information. Like Toyota, by following their vision, they have generated vast profits and dominated their industry.
One of the details which I find highly relevant to software and, particularly, operations is rejecting the mindset of dealing with failure by finding someone to blame and instead changing the system so that one person can't inadvertently cause a failure. I see this a lot with massive ops run books which require humans to repeatedly perform complex tasks without mistake rather than automating it and regularly testing your automation.
If you ever get to be around with someone who works at a car insurance company, ask them about failure rates for cars… japanese cars (and Toyota notably) are among the lowest (that is, they are among the most reliable cars in the world by far).
I still wonder how they can manage to build affordable, reliable cars that last for years, while many expensive car makers have absurdly high failure rates.
One of the interesting reasons comes back to accounting.
Toyota's approach focuses on value from the customer perspective. So all defects are seen as waste, and are targeted for elimination.
The MBA approach just looks at P&L. Which is why they concealed the Pinto's tendency to explode; it was cheaper to pay the lawsuits than to fix people's tanks. Nevermind that many more people would die without the recall; that wasn't relevant to increasing shareholder value.
Another good example comes at the beginning of Bob Lutz's "Car Guys vs Bean Counters". Lutz, a car lover and an automotive exec for decades, once fixed a problem with transmission manufacturing. The problem was causing a lot of people's cars to die right after the warranty expired. He got yelled at because it blew a hole in their revenue projections; they were looking forward to a lot of highly profitable transmission repairs.
Toyota can make those reliable cars because they see every worker not just as a meat robot, but as a brain that should be engaged in eliminating waste. MBA thinking looks at slow order periods as a time to cut labor costs. Toyota, cognizant of how much they have invested in their workers, looks it as a time for training, plant improvement, and other value-creating activities.
Even if this is taken as a fact, I don't see how it explains why American car companies level off. It's not like potential profit is bounded but potential improvement isn't.
The long-term value of a company is based on the amount of value they create for customers. The short-term value of the company depends on profit.
So, for example, an MBA can increase profits by cutting R&D. Or by cutting costs in a way that harms product quality. The company will do well for a while because it takes a while for things like reputation and mindshare to decline. You can hide the declines for longer by investing more in promotion.
The engineer-style approach, in contrast, is to focus on cutting waste rather than cost. This is a high art in the Toyota Production System:
I don't know that the leveling off is a necessary consequence of the MBA approach. But it's certainly what I've seen, and it makes some intuitive sense. Given that some ways to increase profit improve productivity and some harm it, it's plausible that all the cheap ways to improve productivity would be exhausted early in the MBA approach.
I also think the MBA approach can lead you into a local maximum that's pretty screwed up:
Thinking further, another factor may be that MBA thinking tends to be focused on external competition, while engineers tend to optimize regardless of competition. So the American car companies could have leveled off because their major competitors were doing equally well. By the time Toyota was an obvious threat, they were too far behind to even understand how they were being beaten.
Google is another good example. They didn't seek out data center "best practices". They radically bested the competition, proceeding one step at a time, with careful attention to what they needed. In an MBA analysis, that would be seen as spending a lot of money on risky R&D with no obvious ROI. In fact, they'd want to re-task all those expensive engineer brains to something more directly related to revenue. And probably drop the quality of the ops staff as a money-saving measure.
It's only when the years of patient engineer-style optimization add up to an insurmountable lead that it looks good in a B-school spreadsheet.
> In fact, while much of the content in the article has been written about before, it's still probably 2-3 years or more behind where Google is actually at. I left in 2010 and did't read about anything I had not experienced.
What if they just plateaued and didn't really go beyond what you had done when you were there, and this is totally accurate to today?
I can believe this in a heartbeat. I know that if Platforms and datacenter/cluster management innovation stopped, I'd see a mass exodus of my Googler friends (as well as a very noticeable change in Google's products).
There should be a guide for hacks on how to view articles on a single page. Yeah, wired is easy, there's a link, but often times it calls for a little php knowledge, or viewing in "print" mode. Or something random. Not always complexly obvious, but a list of tricks to try might be handy...
It's a shame that heat is just dumped outside most of the time.
EDIT:
The article talks about Google's impressive technical achievements. But there's a lot of energy that's wasted in industry. I don't mean "used inefficiently" (although that's bad too); I mean actually wasted.
I used to work at a tiny electronic sub-contracting factory. The morning shift would arrive, turn on the air compressor (2 KW), the reflow ovens (10 KW and 12 KW); and the other machines (about 7 KW).
But they'd do that even if the machines were not going to be running. All these KW were being used for no reason at all. And the machines are pretty inefficient anyway. (One of the owners thought powered machines looked more impressive. Energy costs were included in the rent so there was no incentive to think about when the machines were on or off. )
Counting that waste across all the tiny factories in the world, and including all the waste in offices - it's quite a lot.
Low grade heat (< 400 degrees C) is really difficult to do much with. If you happen to have need for it right at the spot where it is generated (basically heating buildings) great.
Otherwise you are pretty much out of luck - The efficiency of energy extraction from a heat engine is thermodynamiclly limited by the difference between the hot and cold sides. And trying to transport it any significant distance ends up being more trouble than its worth as pumping water gets energy intensive very quickly (and air has terrible heat capacity.)
It is a shame. However, it is surely much easier to reduce wasted heat by simply buying more modern, more efficient CPUs that will probably be more powerful.
Few office buildings require heat, even in cold-weather climates. The residual heat from lighting, office equipment, bodies, etc., generally has to be removed, not augmented.
Not always the case, but I can assure you that in California, office heating demands are very, very low.
I heard secondhand that this is even the case in Chicago in the winter. (In my apartment in a highrise there, I never ran my heat. The building was naturally 75 degrees. On some days, I even turned on the AC to get it down to a more comfortable 72!)
To me, broadly speaking inefficiency in this context is the core infrastructure capability: if you run an engine that is 20% efficient at converting gasoline to electrical power then that is what you are stuck with until you replace or upgrade the equipment.
Waste is operating any equipment when not needed, regardless of its internal efficiency.
Another example: Using an incandescent light bulb instead of an LED is inefficient - but leaving either on when not needed is wasteful.
Understood, which is why Google, if it's going to all the effort of setting up open door press days, is missing a huge opportunity by not employing hypervisualized metaphors/exhibits to demonstrate how truly amazing a modern data center is, not to mention reinforcing how critical it is for so much of what the average user does every day.
They should probably contract a couple Disney Imagineers and do it right. They could benefit from the the humanizing effects and even create a quirky, niche destination in the process. Hell, throw in an "Android Experience" showroom and it'd probably even have legitimate commercial value.
Just think how much wasted effort and embarrassment you could have saved Pixar, Disney / Disneyworld, HollyWood, Shakespeare and JK Rowling if you'd been there to point this out to them.
I wonder why they've mirrored the image (the left side is quite clearly the right side flipped--take a look at the machine identifier labels). What's being hidden?
The blue LEDs in the picture you linked to, indicate that the servers are running smoothly. [1] It's possible that some servers were faulty at the time the picture was taken. More likely, it's to make the image look perfect.
Hi Google Platform people. Very nice work. As you may know, Randall Monroe (of xkcd fame) has recently started a feature called "what if" on their site. I would like to post a question to you along those lines:
What if Google was tasked with building an orbiting datacenter? How about a Dyson ring, or sphere? How would you do it?
If we were to use all matter in the solar system for commodity linux hardware, how much gmail storage would I get? How many flops? And what sorts of computation could you do on this monster?
Space would be a terrible environment for building a datacenter. The main goals of a datacenter is to make computation as cheap, as fast and as reliable as possible. Having the datacenter orbit around Earth would not help us accomplish any of these goals.
First off, building a datacenter in space would not be cheap. It costs around $25,000 to send a kilogram of equipment into a geostationary orbit. [1] So let assume we were to use Dell PowerEdge C1100. Each server costs $14,000 and weights 18kg. [2] This means for each server sent in orbit, you could buy 32 extra ones on Earth.
Then, there is the issue of cooling. Although outer space is really cold, its vacuum prevents the heat generated by the machines from being dissipated quickly. Controlling the temperature of a such datacenter would be a very interesting engineering challenge.
And then how would you power this datacenter? Converting the excess heat back into electricity could be an interesting option. But most likely, it would need a lot of solar panels. This would make the datacenter cheap to run once built, but the upfront costs would be enormous.
And we haven't talked about speed and reliability yet. Since the signal would need to travel about 35,000 km from the geostationary orbit to reach us, communications between Earth and the datacenter would have significant delays. Even at the speed of light, the minimum round trip time would be about 250 milliseconds if we ignore all other possible sources of delay.
The hostile space weather would also make it pretty hard to run servers reliability. Radiations would destroy electronics, caused bit to flip randomly and do all bunch of fun stuff to the equipment.
But... anyhow! Let's assume anyway that by some magical work of science and Google engineering, we figure ways to manufacture a datacenter directly space for almost nothing by mining the Moon, discover some amazing thermoelectric generators with near 100% efficiency and space shields that blocks almost all radiations.
So back our previous example, a high performance PowerEdge gives us up to about 300 GFLOPS of computing power, 192 GB of RAM and 12 TB of storage.
Now if we were to convert the total mass of the Moon (7.34767309 × 10²² kg) into one monstrous datacenter, this would give us about 4.0 × 10²¹ servers. It would gives us a whooping 1.2 billions YottaFLOPS (or put differently, 1.2 × 10³³ FLOPS) of compute madness, 0.8 billions YottaBytes and 49 billions YottaBytes of storage. This monster would consume about the equivalent of 1% of the Sun's total power output.
Thanks for playing along! But realize that one of the great reasons to put a datacenter in space is physical security. Another reason would be unparalleled data connectivity to the entire planet. But yes, it is a very harsh environment, and waste heat is difficult to dissipate. And of course launch costs are very high. The real reason I asked is because I think it's bloody good fun (and figured the Google folks would get a kick out of it).
Some follow up questions: let's assume that we need to move 10^10 yottabytes from the MoonPC to the earth. How do we do it? What's the fastest we could do it without transfering so much heat that it melts either end of the connection?
A heatsink works because there is some sort of medium that absorbs heat from the sink and moves away, thereby moving heat away. On earth, we use air for this, sometimes with the help of a fan.
If you stick a heatsink on equipment in space, there's no air that can move the heat away, since space is mostly empty. You'll bleed off some through infrared radiation, but that's not going to be enough.
That's pretty amusing that the corporate firewall is making your system far less secure. There has most likely been hundreds of security issues fixed since Chrome 15.
Google should one-up Amazon and get into the Datacenter As A Service market. Service segments: normal cages (I'd rather lease cages from Colorful Pipes, Inc than Equinix), pay-n-go turnkey same-hardware in 3 georedundant locations, and lease-by-rack in multiples of 10 pre-populated racks (racks specified as compute-only or storage-only with 10G interconnects between racks).
It's doubtful that they could directly compete as is. Amazon has done well with their services because they eat their own dog food. From everything I read Bezos basically forced them to build this system and consume it for Amazon's own needs. Google has never taken this approach with their APIs and the difference shows very clearly when you consume these products.
I assume you're referencing stevey's rant, and if so you're conflating two issues. Steve was talking about the use of APIs between services at the application level, not the API for the datacenter/cloud as a platform.
I think cr is talking about how you get "baby bigtable" and "baby cloudscale" that are copies of internal services, but not what core platforms use themselves.
EC2 is relevant today, but it will become less and less relevant as the Platform-as-a-Service services (S3, SimpleDB, App Engine, Heroku, etc) get better. There will reach a point where fewer and fewer companies will actually use VMs directly. Seems like a reasonable play for Google to sit this round out and focus on winning the next one.
How did Google do this time? Pretty well. Despite the outages in the corporate network, executive chair Eric Schmidt was able to run a scheduled global all-hands meeting. The imaginary demonstrators were placated by imaginary pizza.
How does one decide what will placate imaginary demonstrators? Who calls them off?
For the purposes of tests like that, they probably just wanted to see that "reasonable action was taken", which will (hand-waving) probably take care of most instances of that type. In the event of real demonstrators, it would just be the opening salvo of damage control, but it's too hard to predicate how a crowd of angry people would react past the first move.
I start to be annoyed with the "a power efficiency of 2 is the standard in datacenters". My servers are hosted in a datacenter with a global efficiency of 1.15, proved after more than a year in operation. Announcing that Google is doing 1.2 is simply announcing something wrong and I suppose Google is very happy with this number being provided to the press. It means that some competitor will use it as "Google is the best, they do 1.2, we are at 1.3 we are not too bad", where I bet Google is now near 1.1 or less (they operate without cooling in Belgium for example).
You're right -- those PUE numbers from the article were talking about their PUE at the time. Google's 2012 average PUE across all facilities was 1.12/1.13 with a minimum PUE of 1.09/1.10.
Also, Google puts enormous care into the process of calculating PUE since it's kind of black art and if you aren't careful you'll leave out some aspect of your operation that will mislead you into thinking your PUE is lower than it is.
PUE 2 probably was the standard when Google started building their DCs. There's a wide range between enterprisey DCs that consider going from 2 to 1.8 an epic win and clouds/hosters who are getting below 1.3. Google's PUE data since 2008 is published at http://www.google.com/about/datacenters/efficiency/internal/...
It is unfortunate (for the rest of us) that datacenter tech is such a competitive advantage for Google. If they were able to share their breakthroughs more readily with others, imagine how much less of the "1.5% of all power globally" datacenters could be using.
Officially announcing things that "everybody knows" already can still make a difference. It means that you can ask Google executives about those things in public appearances and they'll at least acknowledge the question even if they refuse to answer substantially.
In fact, while much of the content in the article has been written about before, it's still probably 2-3 years or more behind where Google is actually at. I left in 2010 and did't read about anything I had not experienced.