Cloud TPU VMs are generally available

usmannk · on May 10, 2022

TPU VMs are an incredible, incredible product. If you were put off by the previous TPU Node architecture, you should definitely come back and give this a shot. The dev/debug loop is just so much nicer now that the TPU is local to your VM.

That said, the one thing I'm missing on these is orchestration. GKE supports TPU Nodes but it does not support TPU VMs. It seems to me like this is a must-have feature so I'm sure it's on the roadmap.

If anyone from the team is here, do you have updates on this? And if anyone reading this has orchestration techniques they use and recommend for TPU VMs, please let me know what they are!

zak · on May 10, 2022

Thanks for the very kind feedback! We've wanted to provide TPU VMs since the beginning of the Cloud TPU program, and I'm delighted that you're enjoying them. Many people across Google contributed to this launch.

We're definitely working on GKE integration, and we intend to provide more support for orchestration over time since individual Cloud TPU workloads are getting bigger and bigger. We'd love to hear any more detailed feedback you might have.

bigcat12345678 · on May 10, 2022

Hi Zak, great achievement.

Xoogler, had connection with people working on TPU.

AFAIK, TPU support on GCP has been started in 2018? Or earlier? It's been 4 years to GA.

Rumors have it that the project has been incredibly difficult to get out due to Googler's poor cross-org collaboration support. It has been a sad story, another one in the list of things that Google could have been putting more resources and made available sooner to the public.

zak · on May 10, 2022

I started pitching the Cloud TPU program in 2016. Many, many people have contributed since then to build the products that are available today.

Google is a large and complicated place, but we're getting closer to providing the magical interactive supercomputing experience we've wanted for a long time.

The deep learning landscape is evolving very rapidly, and there is a lot of interest in scaling up further, so the next few years will be exciting.

ec109685 · on May 10, 2022

Curious what is the difficulty with GKE integration? From afar, it seems like the flavor of VM shouldn’t impact ability to run K8s.

londons_explore · on May 10, 2022

The GKE team probably has other priorities.

ec109685 · on May 15, 2022

Question was around what makes these instance types harder to support than any other.

zak · on May 10, 2022

Founder of the Cloud TPU program here. If you'd like to experiment with TPU VMs for free and are willing to share your work with the world somehow (e.g. via publications or open-source projects), you can apply to participate in the TPU Research Cloud (TRC) program here: https://sites.research.google/trc/

jasonphang · on May 10, 2022

Hi Zak, this question is a little out of your scope, but perhaps you may know the answer: Do you know if/when Colab TPU runtimes are likely to be updated to support newer JAX functionality like pjit? (Or put another way: are Colab and Cloud TPU runtimes expected to be in-sync at all?) I'd written some model code that worked great on a TPU-VM and that I was excited to share on Colabs (which is likely far more accessible), but then I found out that Colabs simply don't support pjit (https://github.com/google/jax/issues/8300).

Other than that, really like TRC!

zak · on May 10, 2022

We love Colab and would love to upgrade the Colab TPU integration to support TPU VMs! No timeframe yet, but the right folks across JAX / Colab / Cloud TPU are very aware of this issue.

frankchn · on May 10, 2022

Just wanted to pop in to say congrats on making GA! Really happy to see the program develop from the early days :)

zak · on May 10, 2022

Thanks, Frank! You personally helped more Cloud TPU and TRC users than I can count, and you always came through something needed to get done and fast. I really appreciated it!

bertday · on May 10, 2022

What are some of the things on the roadmap for the platform? Any immediate plans to close the command-line gap for TPU utilization, etc.?

My overall impression is TPUs are pretty awesome but the software stack is still a bit hard to use compared to mature GPU tools. I’d imagine it’s pretty hard for inexperienced users to use them.

zak · on May 11, 2022

If you haven't used Cloud TPUs in a while, I'd encourage you to try them now with TPU VMs and the latest versions of JAX, PyTorch / XLA, or TensorFlow. We've gotten a lot of positive feedback from customers and TRC users, so we think the overall experience has improved a lot, though there's always more we want to do.

People especially seem to find Cloud TPUs easy to use in comparison to alternatives when they are scaling up ML training workloads. Once you have a model running on a single TPU core, it is relatively straightforward from a systems perspective to scale it out to thousands of cores. You still need to work through the ML challenges of scaling, but that is more tractable when you aren't simultaneously struggling with systems-level issues.

In particular, you don't need to master a sequence of different networking technologies as you scale up, and the TPU interconnect is so much faster at scale than other technologies (10X last time I checked) that you don't have to work as hard to avoid network bottlenecks. Support for model parallelism on Cloud TPUs is improving across the ML frameworks, too.

To be clear, training ML models at scales that we currently consider large is still very challenging on any platform - for example, the logbooks that Meta recently published are fascinating: https://github.com/facebookresearch/metaseq/blob/main/projec...

We aim for Cloud TPUs to simplify the process of training models at these scales and far beyond: https://ai.googleblog.com/2022/04/pathways-language-model-pa...

jacquesm · on May 10, 2022

Nice achievement and great to see you are still there after all that time to see it through to GA. Congrats! It must be a bit like seeing your child cycle to school alone for the first time :)

zak · on May 11, 2022

Thanks very much! We've come a long way, but there is always more interesting work required to keep up with the deep learning frontier and enable Cloud TPU customers and TRC users to expand it further.

teleforce · on May 10, 2022

Thanks Zak, already applied.

Just wondering does TPU VM support Vectorflow?

https://github.com/Netflix/vectorflow

zak · on May 10, 2022

No, Vectorflow is not supported out of the box, and I'm not sure the workloads it targets are the right fit for Cloud TPU hardware. However, be sure to check out the "Ranking and recommendation" section of the linked blog post above - Cloud TPUs are able to accelerate the ML models with very large embeddings that are increasingly common in state-of-the-art ranking and recommendation systems.

BooneJS · on May 10, 2022

Congrats Zak!

zak · on May 10, 2022

Thanks, and congratulations to many others across many teams who have supported the Cloud TPU program over the years!

ramchip · on May 10, 2022

For those like me who had no idea what this means:

> Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads.

hoosieree · on May 10, 2022

Is this[1] the best resource for understanding TPU (or TPU-like) architectures in general?

I would love to see a TPU implementation in the style of the Nand2Tetris[2] CPU, reduced to its bare essentials for learning purposes.

Are systolic array processors the best place to start for this kind of info? Thanks in advance if anyone answers. :)

[1]: https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...

[2]: https://www.nand2tetris.org/course

barbarbar · on May 10, 2022

I wish I had seen this comment earlier. Maybe here tags/labels could be helpful.

carbocation · on May 10, 2022

Does this solve the data-loading issues raised by sillysaurus in this prior thread?

https://news.ycombinator.com/item?id=24721229

zak · on May 10, 2022

Yes, TPU VMs dramatically improve the Cloud TPU user experience. You now have direct access to the VM on each TPU host whether you are using JAX, PyTorch, or TensorFlow, which provides a lot more flexibility and control and can often improve performance.

zwaps · on May 10, 2022

I struggle to understand precisely what you mean by user experience and ‘often improved performance‘.

Previously, there was no actual support for crucial features of the TPU related to data loading when using PyTorch, say. In turn, using a TPU over a GPU on that setup was frequently not worth it due to that exact issue. Your answer suggests it might be different now: are TF, Jax and PyTorch now on par in all stages?

zak · on May 10, 2022

In the previous Cloud TPU architecture, PyTorch and JAX users had to create a separate CPU VM for every remote TPU host and arrange for these CPU hosts to communicate indirectly with the TPU hosts via gRPC. This was cumbersome and made debugging difficult.

With TPU VMs, none of this is necessary. You can SSH directly into each TPU host machine and install arbitrary software on a VM there to handle data loading and other tasks with much greater flexibility.

The blog post provides an example of training cost improvement using PyTorch / XLA on TPU VMs in the "Local execution of input pipeline" section. Hopefully we will be able to provide more tutorials on using PyTorch / XLA with TPU VMs soon.

With TPU VMs, workloads that require lots of CPU-TPU communication can now do that communication locally instead of going over the network, which can improve performance.

ultrons · on May 10, 2022

Here is an example post that shows how to train a PyTorch/XLA model with data pipeline reading from cloud storage.https://cloud.google.com/blog/topics/developers-practitioner...

Reubend · on May 10, 2022

Do these TPU VMs have preemptible/spot pricing? I'm just a hobbyist, but it always seemed annoying before that one would have to orchestrate VMs to spin up/down alongside spot instances of TPUs manually.

zak · on May 10, 2022

Yes, there are a couple of ways to use Cloud TPUs at lower priority and lower cost. If you are a hobbyist, I highly recommend trying out Cloud TPUs for free via the TPU Research Cloud: https://sites.research.google/trc/

If you are flexible, you (or your scripts) can access a lot of compute power at odd hours!

Filligree · on May 10, 2022

That’s for researchers, however. For hobbyists, I guess we just have colab?

zak · on May 11, 2022

I started the TRC program alongside the Cloud TPU program to make interesting amounts of ML compute available to a broad group of creative people, not only to academic researchers.

The TRC program welcomes hobbyists, artists, students, independent learners, technical writers, and a variety of others. We love it when the TPU Research Cloud enables people to do something that wouldn't have been possible otherwise.

I definitely recommend applying to the TRC program - please feel free to say directly that you are a hobbyist. The sign-up form is short, and it's likely that you can access a lot of compute if you are flexible and persistent.

mahastore · on May 10, 2022

First thing you should explain is what the hell a TPU is/

zak · on May 10, 2022

Here's a quick overview of TPUs from last year's Google I/O keynote: https://www.youtube.com/watch?v=XFFrahd05OM&t=1565s

ChadNauseam · on May 10, 2022

The section on error-correcting qubits right after the bit you timestamped is really interesting too, I wonder what the status of that is

bpodgursky · on May 10, 2022

It's an inter-dimensional temporal portal which tunnels into the future and borrows incomprehensible alien technology to execute deep learning algorithms on GCP client datasets.

armchairhacker · on May 10, 2022

https://en.wikipedia.org/wiki/Tensor_Processing_Unit

AnonMO · on May 10, 2022

Google cloud is part of a company called alphabet which also own a little tool called google search you should try it. it’s very handy. Also if you don’t know what a tpu is the product probably is targeted at you

moritonal · on May 10, 2022

Mostly a jest, but now you're releasing these to the public, does that mean Google's moved onto something better?

formerly_proven · on May 10, 2022

I'd be more interested in the edge TPU / coral stuff for local inference, but it seems perpetually unavailable.

Havoc · on May 10, 2022

It mentions tensorflow specifically. Is there a penalty for using PyTorch instead?

zak · on May 11, 2022

No, Cloud TPUs support JAX, PyTorch, and TensorFlow, and the new TPU VM architecture provides enough low-level access that users could add support for additional frameworks themselves if they are willing to put in substantial effort.

echelon · on May 10, 2022

Do existing PyTorch models work out of the box with cloud TPU, or does it require some tweaking?

Are there cost savings over traditional GPU workloads?

When will GKE support TPU? And will there be preemptable instances?

johndough · on May 10, 2022

Last time I checked, you had to change a few lines to use torch_xla.

If there are cost savings depends on your workload. If you are training huge models, it might be cost effective, but honestly most tasks I have seen would have been fine on a CPU.

The pricing page lists preemptible instances: https://cloud.google.com/tpu/pricing#single-device-pricing

rg111 · on May 10, 2022

Cloud TPUs are the ultimate form of vendor lock-in. I vehemently oppose using it for anything truly critical.

zak · on May 11, 2022

We try to make it easy to switch back and forth between Cloud TPUs and other hardware platforms using JAX, PyTorch, and TensorFlow. This is a difficult technical challenge, but the XLA compiler helps a lot, and switching is easier now than it has ever been.

joelthelion · on May 10, 2022

Any benchmarks comparing this to regular Nvidia gpus?

zak · on May 10, 2022

The MLPerf 1.0 results provided an apples-to-apples comparison of large-scale TPU and GPU systems across several ML workloads: https://cloud.google.com/blog/products/ai-machine-learning/g...

In MLPerf 1.1, we showcased model training at larger scale: https://cloud.google.com/blog/topics/tpus/google-showcases-c...

The deep learning workloads that people find most interesting and the underlying hardware and software systems are all changing very rapidly. In addition to following MLPerf, we generally recommend that people run rigorous performance and cost comparisons on the actual workloads that they care about accelerating.

ramoz · on May 10, 2022

Available in new regions other than us-central?

sreeramb93 · on May 10, 2022

are google cloud TPUs available in mumbai region?