Hacker News new | past | comments | ask | show | jobs | submit login
Lessons from a Google App Engine SRE on how to serve over 100B requests per day (googleblog.com)
168 points by rey12rey on April 6, 2016 | hide | past | favorite | 25 comments



I used to be a scientist and I went to work at Google to apply their technology to science problems. My first team was SRE- and I have to say, Google's SRE approach to computing completely changed how I thought about things, and more importantly, how I programmed systems that went to production. I've read the SRE book and can highly recommend learning from the principles it lays out.


I've skimmed it, but even the appendices alone are worth a look. Not always immediately practical in every situation, as with anything Google, but it's definitely a "handbook" I'll be studying closer over the next few weeks. :)

Oh and while Google Play may be cheaper, it's also on Safari books if folks have subscriptions to that, e.g. via libraries/proquest.


Can you please give reference to that book ?



The book they mention[1] is very good so far.

[1]https://play.google.com/store/books/details/Betsy_Beyer_Site...


Upvoting because this link is cheaper than the Amazon link.


If you are willing to wait, many times of the year O'Reilly will have their ebooks on sale for 50%-60% off. The best time is black friday.


> Advance preparation, combined with extensive testing and contingency plans, meant that we were ready when things went [slightly wrong] and were able to minimize the impact on customers.

Following that link provided some interesting reading (for a mundane error report, at least): https://groups.google.com/forum/#!msg/google-appengine-downt...

TIL that even Google have datacenter fluctuations they can't figure out. It's nice that they quietly make this info publicly available, and also nice that I've now discovered where to find it :)


I love their solution.

They turned it off, then on again


I've been seeing a lot of references to SRE recently. Is Google trying to market this position and acquire more engineers?

The SRE book, and Google in general, have mentioned that SREs are notoriously hard to hire, and I'm wondering if they are doing a marketing push.


> I've been seeing a lot of references to SRE recently. Is Google trying to market this position and acquire more engineers?

There is a bit of a gap (in terms of attitude and skill set) between what Google calls an SRE and what most other companies call an SRE.

I think Google is trying to steer the public usage of the word so that their term doesn't get diluted. One possible reason might be so that SREs at Google don't feel like they're making a bad career move by having the term "SRE" on their resume.

If you spent 5 years working for the state of California, designing safer future-proof and state-of-the-art treatment plans and plants for gray and black waste water in metropolitan and rural areas, improving life expectancy by 2% for people who live in California, but your title was "Sanitation Engineer" the whole time, you're going to be a bit put out if you learn that during that time all the high schools in the state changed the custodians' titles to the same thing.


SREs are very hard to hire, speaking from experience. At Google SRE directors and VPs will often cherry-pick promising candidates from the mainline SWE hiring pipeline and give them a "hero call" to convert them to SREs. SREs at Google are also paid more, controlling for level and performance, as a way to hire and retain.


Interesting. Can you expand on "hero call"? What does that entail?


Donning a cape and meeting destiny.

In all seriousness they make it out to be more than it is. From my experience going through their hiring pipeline there seem to be two tracks in SRE; software and sysadmin. If you score higher in algorithms and data-structures, presumably, you'll end up working more on tools and libraries whereas in the other you'll work more on infrastructure and automation. Either way both tracks work together on the same team towards the same goals.

If you want in be prepared to solve simple-to-tough algorithms problems and be quizzed on TCP re-transmission, Linux system calls, and memory pressure. It's a bit challenging because you not only have to know Big-O well enough to estimate the asymptotic complexity of an arbitrary algorithm but you might also be asked what a sequence of TCP packets would look like if you sent some data and pulled the plug or what the parameters are to a given system call on Linux. You quite literally have to know everything from how virtual memory works, how to implement a fast k-means, how the network stack works from top to bottom, etc, etc.

If you've done any work in cloud development and supporting moderately large one it's that but bigger. Make one a hero, it does not.


It's just a guy on the phone telling you how you won't be like those other chumps, you'll be a hero. The few and the proud. Standard recruitment techniques.


Anecdata here, but Google have been constantly hiring in SRE for as long as I can remember (in London).

As someone who's moving in that direction career wise I don't think they're any harder to hire than a good software engineer, the skillset differ somewhat, but the culture/thought process is very similar which is what you hire for.


I'd imagine so. Due to SRE becoming such a buzz-word title with many companies doing this so differently, it has increased the difficulty of hiring.


I'm actually really really glad that Google released this book because I think they are one of the few companies that is actually doing this SRE thing right. I think the hardest bit about the SRE paradigm (like DevOps) is having companies wholly adopt it, and I think that this book being out will help change that.


This got me wondering what the AWS services' work load per day was. Best numbers I could find were from this 2013 article about serving ≈95 billion requests per day for just S3. The size and scope of cloud providers is truly cool and fascinating engineering.

https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-obje...


I don't know why this isn't on HN, but this is another interesting post from the Google Cloud Platform blog from today:

https://cloudplatform.googleblog.com/2016/04/Google-and-Rack...



s/lessons/lesson/

"If you put a human on a process that’s boring and repetitive, you’ll notice errors creeping up. Computers’ response times to failures are also much faster than ours. In the time it takes us to notice the error the computer has already moved the traffic to another data center, keeping the service up and running. It’s better to have people do things people are good at and computers do things computers are good at."


1,157,407 requests per second.


One Node.js server can do 10x that! /s


Only 100 bytes? That's easy... Sheesh.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: