I think that decoupling teams boils down to giving teams complete ownership. And Amazon got parts of it right. It means that your team owns everything it builds. You own the code, you own the testing and you own the operations: you own the product. Various tools are laid at your feet, and you are asked to build.
Clearly, a benefit is that you can move fast. You don't need permissions from someone half a building away to do something. You don't need to touch code that needs another team's approval. There are no committees that decides on global rules. Your team decides on your team's rules.
Like a shared nothing architecture, there's very little that is shared between teams. Teams are often connected only via their service interfaces. Not much else beyond common tooling.
But even their tooling reflects decoupling. Every tool follows the self-service model ("YOU do what you WANT to do with YOUR stuff"). Their deployment system (named Apollo, mentioned in the slides) and their build system, and their many other tooling, all reflect this model.
Cons. What happens is that you might be reinventing the wheel at Amazon. Often. Code reuse is very low across teams. So there's no shared cost of ownership at Amazon, more often than not. It's the complete opposite at Google w.r.t. code reuse. There are many very high-quality libraries at Google that are designed to be shared. Guava (the Java library) is a great example.
Another con. You may not know what you're doing. But as a team you will still build a rickety solution that gets you to a working solution. This is the result of giving a team complete ownership: they'll build what they know with what they have. Amazon is slowly correcting some of these problems by having teams own specific Hard Problems. A good example is storage systems.
And a lack of consistency is a common issue across Amazon. Code quality and conventions fluctuate wildly across teams.
Overall, Amazon has figured out how to decouple things very well.
How do these services communicate with each other? How can a single page make hundreds of requests to build a page and yet get it all together in a fraction of a second?
There's several different communication methods between services, including REST, SOAP, message queues, and an internal service framework. Its a perfect example of bonafidehan's post.
As for the second question, a page generally doesn't have to make hundreds of requests. You're thinking of a flat architecture. Think of it more like a pipeline: data goes in at A, flows from A->B->C->D, page reads D. So you end up having to call a handful of services. That can be scaled by 1) caching, 2) careful selection of service calls (don't call ordering service unless you're placing an order), 3) asynchronous requests (you're typically going to be IO bound on the latency, so just spin up X service requests and then wait on them all). There are also other tricks that are fairly well known for reducing latency, such as displaying a limited set of information and loading the rest via AJAX.
As a disclaimer for the above, my work doesn't involve working with the Amazon.com website directly, so its based on my limited view in my domain space.
If you own a page or service that calls a bunch of other service, you typically collect metrics on latency of your downstream services. Amazon has libraries to facilitate this, and a good internal system for collecting and presenting this data. If one service is particularly troublesome, then you can reach out to that other team and get them to lower their latency. The other option is to pull in their data closer to you, in a format that you can consume quickly.
We recently had our 6 billionth photo uploaded and we (the engineers) just built and rolled-out the new geofences feature. Engineers really get to make a difference at Flickr on a daily basis. http://code.flickr.com/blog/2011/08/30/in-the-privacy-of-our... (That's my kitchen.)
We're hiring backend engineers, designers, and operations. Drop me a line at caudill -at- yahoo-inc.com if you're interested. I'd love to talk to you.
The three reasons that jump out at me would be:
1) Redundancy.
2) Cupertino wants them off their grid.
3) Or they're going to have a unique power footprint that they don't want to share with the world.
Indeed. A lot of hard work went into the photo page redesign/rebuild. It's difficult majorly overhauling a site that serves so many different needs to so many different kinds of users.
Hi, sorry you've had some difficulties with the site. We've made performance a big focus on the photo page but there's other sections of the site we're still trying to make faster. Regarding the broken images, we should be all clear on that (we recently brought up a new image cluster), so if you're still seeing that, let me know. I'm caudill @ yahoo-inc.com.
If you're a smart developer who can get things done, I'd like to talk you. We've got lots of interesting problems to solve on a site that is used and loved by millions of people daily.
We're currently looking for a backend engineer to help build new features that millions and millions of people will use. If the idea of pushing code live a dozen times a day while being the official caretaker of the White House and the British monarchy's photos interests you, I want to talk to you. You can email me at caudill -at- yahoo-inc.com.
Flickr is currently looking for an intern. Email me at caudill at yahoo-inc.com if you want to write a bit of code for us this summer! We're set up in the Financial District in SF.