What are some of the things on the roadmap for the platform? Any immediate plans...

zak · on May 11, 2022

If you haven't used Cloud TPUs in a while, I'd encourage you to try them now with TPU VMs and the latest versions of JAX, PyTorch / XLA, or TensorFlow. We've gotten a lot of positive feedback from customers and TRC users, so we think the overall experience has improved a lot, though there's always more we want to do.

People especially seem to find Cloud TPUs easy to use in comparison to alternatives when they are scaling up ML training workloads. Once you have a model running on a single TPU core, it is relatively straightforward from a systems perspective to scale it out to thousands of cores. You still need to work through the ML challenges of scaling, but that is more tractable when you aren't simultaneously struggling with systems-level issues.

In particular, you don't need to master a sequence of different networking technologies as you scale up, and the TPU interconnect is so much faster at scale than other technologies (10X last time I checked) that you don't have to work as hard to avoid network bottlenecks. Support for model parallelism on Cloud TPUs is improving across the ML frameworks, too.

To be clear, training ML models at scales that we currently consider large is still very challenging on any platform - for example, the logbooks that Meta recently published are fascinating: https://github.com/facebookresearch/metaseq/blob/main/projec...

We aim for Cloud TPUs to simplify the process of training models at these scales and far beyond: https://ai.googleblog.com/2022/04/pathways-language-model-pa...