What are some of the things on the roadmap for the platform? Any immediate plans to close the command-line gap for TPU utilization, etc.?
My overall impression is TPUs are pretty awesome but the software stack is still a bit hard to use compared to mature GPU tools. I’d imagine it’s pretty hard for inexperienced users to use them.
If you haven't used Cloud TPUs in a while, I'd encourage you to try them now with TPU VMs and the latest versions of JAX, PyTorch / XLA, or TensorFlow. We've gotten a lot of positive feedback from customers and TRC users, so we think the overall experience has improved a lot, though there's always more we want to do.
People especially seem to find Cloud TPUs easy to use in comparison to alternatives when they are scaling up ML training workloads. Once you have a model running on a single TPU core, it is relatively straightforward from a systems perspective to scale it out to thousands of cores. You still need to work through the ML challenges of scaling, but that is more tractable when you aren't simultaneously struggling with systems-level issues.
In particular, you don't need to master a sequence of different networking technologies as you scale up, and the TPU interconnect is so much faster at scale than other technologies (10X last time I checked) that you don't have to work as hard to avoid network bottlenecks. Support for model parallelism on Cloud TPUs is improving across the ML frameworks, too.
To be clear, training ML models at scales that we currently consider large is still very challenging on any platform - for example, the logbooks that Meta recently published are fascinating:
https://github.com/facebookresearch/metaseq/blob/main/projec...
My overall impression is TPUs are pretty awesome but the software stack is still a bit hard to use compared to mature GPU tools. I’d imagine it’s pretty hard for inexperienced users to use them.