You roll out incrementally, and keep interfaces between components backwards com...

You roll out incrementally, and keep interfaces between components backwards compatible for all versions presently out and any you may need to roll back to, if you possibly can.

When you cannot, I, personally, believe in partitioning traffic across concurrent versions. This can be done dynamically or statically -- really it depends on the nature of your system, in general something will stand out as obviously right for your situation.

To take an example, if your service is primarily user centric, you can partition the system by user and roll out accordingly. Let's say you have four interacting systems: a font end proxy which understands the partition boundaries, an appserver, a caching system, and a database -- pretty typical.

The front end proxy in this system is shared by all users (this need not always be true as you can do subdomain and dns games, but that is a different major headache), but everything behind it can be dedicated to the partition (this is not necessarily efficient, but it is easy).

Now, let's say we need to make a backwards-incompatible,coordinated change to the appserver and databases associated with the partition. As we cannot roll these atomically without downtime we pick an order, let's say appserver first. In this case we will wind up rolling two actual changes to the appserver and one to the databases.

The appserver will go from A (the initial) to an A' which is compatible with both A and B databases, then the databases will go from A to B, and the appservers from A' to B. You'll do this on one small partition and once done, let it bake for a while. After that, you'll roll the same across more. Typically going to exponentially more of the system (ie, 1 partition, 2 partitions, 4 partitions, 8 partitions, etc).

This means you have a, hopefully short lived, interim release of one or more components, which is probably grossly inefficient, but you wind up in a stable state when complete. The cost of doing this is not pleasant, as you basically triple QA time (final state, interim state, two upgrade transitions) and add a non-trivial chunk of development time (interim state). That said, this is why most folks just take the downtime until the cost of the downtime is greater than cost of extra development.

This is, of course, a pain in the ass to coordinate. It is easy to do with relatively small big systems (less than a few hundred servers, say, assuming you have good deployment automation), and probably the pain of coordinating is still less than the pain of baking component versioning into everything... for a while.

An alternate model, which requires significantly more up front investment, is to support this in a multi-tenant system where you don't (for upgrade purposes) dedicate a clone of the system to each partition. Instead you can bake version awareness into service discovery and dynamically route requests accordingly.

A very traditional (of the blue-suited variety) is to use an MQ system for all RPCs and tag versioning into the request, then select from the queue incorporating the version. This makes the upgrade almost trivial from a code-execution point of view, and can even help with data migration as you can queue up updates during the intermediate state database and play a catch-up game to flip over to the end state database. This is the subject for a blog post, though, rather than a comment, as it is kind of hairy :-)