Introducing the TensorFlow Research Cloud

aub3bhat · on May 17, 2017

For comparison, at 180 teraflops per TPU (assumption: at fp32) this is equivalent to offering a cluster with 15,000 Titax Xp GPUs (1200$ per GPU = 18 Million $ worth of compute power).

If TPU teraflops are reported at fp16, then the number would be half.

Titan X offers 12 TFLOPs per GPU https://blogs.nvidia.com/blog/2017/04/06/titan-xp/

fizixer · on May 17, 2017

180 = 12 x 15

Where are you getting the extra 3 zeros from?

(Also 'up to 180 TFLOPs' is a bit misleading. As I read in some benchmarking posts by Google as well as by nVidida, a TPU is much faster than a GPU for doing inference, but they haven't released any data on training performance, the real bottleneck IMO).

superfx · on May 17, 2017

Also, I suspect the more relevant number is the "Tensor" ops that Nvidia recently reported for Volta at 120 TFLOPS [1], which would be relative to Google's TPU a "mere" 1.5x speed up. Of course, this is pure speculation on my part.

1. https://devblogs.nvidia.com/parallelforall/inside-volta/

asdf_ · on May 17, 2017

Google's 180 teraflop number is for a module, which contains 4 TPU chips. Meaning each TPU is 45 teraflops. See: https://arstechnica.com/information-technology/2017/05/googl...

jlebar · on May 18, 2017

I replied to this comment elsewhere in the thread [1], but to reply here as well:

A GPU is also comprised of multiple chips (RAM, etc). I don't think "performance per discrete piece of silicon" is an interesting metric.

[1] https://news.ycombinator.com/item?id=14364169

asdf_ · on May 18, 2017

A single V100 board is vastly smaller than a TPU module with 4 TPU chips.

The closest comparison in terms of size would be 1 Volta DGX-1 (8x V100s) compared to 2 TPU modules (8x TPU2 chips).

Volta DGX-1: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...

TPU Module: https://storage.googleapis.com/gweb-uniblog-publish-prod/ima...

And for completeness, this is the size of a single V100: https://cdn.arstechnica.net/wp-content/uploads/2017/05/34446...

You can see that 8x V100s are still more computationally dense than 8x TPU2 chips. Density is a very important factor in datacenter design.

dgacmu · on May 18, 2017

From my read of that Volta picture (here's a close-up glamour shot of the V100 module: https://devblogs.nvidia.com/parallelforall/wp-content/upload... ), that's just the chip module itself. Note how none of the interconnect hardware is present? The V100 supports NVLink just like the P100. When you actually assemble that chip module onto a board, it looks more like this:

http://hothardware.com/ContentImages/NewsItem/37034/content/...

(That's a P100 server, obviously, but it's the same kind of design. The P100 module looks a lot like the V100 module.)

asdf_ · on May 18, 2017

The interconnect hardware is present on the dgx-1 board.

The point of my previous comment is that it doesn't make sense to compare an entire TPU module (containing 4x TPUs) to a V100 board.

The closest comparison to a TPU module (with 4x TPUs) would be a dgx-1 board, which contains the nvlink bus that you mention, but also contains 8x V100 boards, hence why in my previous post I said you should compare the compute performance of a Volta DGX-1 (8x V100s) to 2 TPU modules (8x TPUs).

At the end of the day, it is simple, in a given area of space, you can get more compute performance from provisioning that area with V100s (in the form of using dgx-1s) versus provisioning it with TPUs (in the form of using TPU modules).

jlebar · on May 18, 2017

> Density is a very important factor in datacenter design.

That's true if you're putting the chips in your own data center, because density affects TCO.

But assuming that Google does not sell TPU hardware, this isn't an important metric to anyone other than Google. The question that is important is, what does the shape of the curve plotting "dollars spent" versus "time spent waiting for my model to train" look like?

The shape of that curve is affected by TCO, and TCO is certainly affected by density. But there's a lot more to it than that.

asdf_ · on May 18, 2017

Sure, no argument there. For Google, producing and using TPUs, is probably cheaper than buying nvidia hardware.

slizard · on May 17, 2017

Might well be at 10x less power -- or more.

Might also be TPU gen2 given the "64 GB of ultra-high-bandwidth memory" was not something IIRC the recent TPU paper was talking about.

And this is here now, while Volta is paper-launched and will be very scarce until late 2017/2018.

dgacmu · on May 18, 2017

The benchmarking posts were about the previous-generation TPUs, which are detailed in an ISCA paper this year: https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...

The previous TPU was aimed at inference. The Cloud TPUs do training. But you're right - I haven't seen any publications about training performance yet. But they'll be available in cloud (currently in alpha), and once they go GA, I'd expect to see a raft of benchmarks by third parties. I can't wait. :)

aub3bhat · on May 17, 2017

The cluster has 1000 TPUs

fizixer · on May 17, 2017

Well, then MS Azure, Amazon AWS have their own clusters offering way more than $18 million worth of compute power (I'm sure they also provide it for research purposes worth in the millions).

But not to a single client.

Similarly, Google is not offering a 1000 TPU cluster to each client.

Your metric is strange to say the least.

aub3bhat · on May 17, 2017

They are offering a dedicated 1000 TPU cluster for researchers, at least that's how I read it.

vazamb · on May 18, 2017

AFAIK in the original TPU announcement they said the cluster would be shared between all participants of the program.

asdf_ · on May 17, 2017

It is actually 45 teraflops per TPU2 chip and 180 teraflops for a module (4x chips): https://arstechnica.com/information-technology/2017/05/googl...

V100 is 120 teraflops of tensor ops per chip: https://arstechnica.com/gadgets/2017/05/nvidia-tesla-v100-gp...

jlebar · on May 18, 2017

A GPU is also comprised of multiple chips (RAM, etc). I don't think "performance per discrete piece of silicon" is an interesting metric.

asdf_ · on May 18, 2017

A single V100 board is vastly smaller than a TPU module with 4 TPU chips.

The closest comparison in terms of size would be 1 Volta DGX-1 (8x V100s) compared to 2 TPU modules (8x TPU2 chips).

Volta DGX-1: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...

TPU Module: https://storage.googleapis.com/gweb-uniblog-publish-prod/ima...

And for completeness, this is the size of a single V100: https://cdn.arstechnica.net/wp-content/uploads/2017/05/34446...

You can see that 8x V100s are still more computationally dense than 8x TPU2 chips. Density is a very important factor in datacenter design.

arcanus · on May 17, 2017

There is virtually no way this would be fp32. If it was, they could sell these devices for many more tasks than just machine learning.

qeternity · on May 17, 2017

Not including power, motherboard, processor, ram, etc...

theCricketer · on May 17, 2017

Question for any Google Cloud folks hanging out here: Is it possible to use Cloud TPUs without using TensorFlow? Is there a more low level library/API to run instructions on TPUs so other frameworks can work with TPUs?

phreeza · on May 18, 2017

From the slides, it looks like the next layer under tensorflow is calles StreamExecutor API. Googling for that turns up this, among other things: https://github.com/henline/streamexecutordoc

jamesblonde · on May 18, 2017

You can use Keras and use tensorflow as the runner for Keras.

alexcnwy · on May 18, 2017

Keras is higher level, parent asked about lower level

phreeza · on May 18, 2017

That is higher level, not lower level

jaflo · on May 17, 2017

I might be naive to think this, but are Google Cloud services making money simply through the fees you pay or does Google also have an interest in the data generated by and passing through its services? If the latter, what data do they collect and how do they profit?

dgacmu · on May 17, 2017

Just the former. Cloud is about having paying customers and keeping them happy. In general (IANAL, IANA..anything), cloud customer data is sacrosanct, and Google builds very strong walls keeping that stuff isolated.

As far as I've heard, the hopes for Cloud TPUs in general are that people will think it's an amazing service worth paying for.

The hopes for TFRC are more complicated, as you might imagine when a company is giving away free stuff, but one of them, I'd put in the "long-term benefit altruism" category: It's quite hard to do some kinds of machine learning research in academia, because it can be fantastically expensive. We just blew $5k of google cloud credits in a week, and managed only 4 complete training runs of Inception / Imagenet. This was for one conference paper submission. Having a situation where academia can't do research that is relevant to Google (or Facebook, or Microsoft) is really bad from a long-term perspective, and one thing TFRC might do is help make sure that advances in deep learning continue at a rapid pace.

From watching the rate at which my deep learning colleagues get swallowed up by industry, I think it's a very valid concern, and I'm very supportive of all of the industry efforts we're seeing to try to address this. It's good for the entire research ecosystem.

(There are undoubtedly many other reasons, such as those noted below by minimaxir, but this is the one I personally feel the pain of.) Disclaimer: I get paid by Google part-time to do work related to this, but this is not any kind of official statement. The pain-of-ML-in-academia bit is purely from my hat as a CS professor.

minimaxir · on May 17, 2017

Google benefits from this (in the non-altruistic perspective) because it encourages research/model development on TensorFlow, and promotion when said research is published using the framework.

Not everything has to be a data play in the big picture.

Ironlink · on May 17, 2017

First of all I think they are too smart to jeopardize their entire cloud service in that way; no one would buy that. Ignoring that, I would think inspecting cloud customer data is much too unreliable to be of any use to them.

Before anything else, the format and schema of the customer data would have to be analyzed and converted to data structures that match Google's internal models. While I imagine a computer could do it, I certainly would want a human to verify that the analysis makes sense.

Assuming this has been done, they are then at the mercy of the customer as far as whether the data is accurate, whether it is complete, how often it is updated, etc.

At the end of the day, I don't see how they would build a reliable business around arbitrary data structures which they have no control over. Information you can't trust is pretty useless.

Edit: They would also have to understand how the data was selected. Looking at a series of data points, you would wonder if all these are from Arizona, or all from the year 1976, or all from color blind individuals. Without understanding such limitations, making any sort of deduction from a dataset will just lead you the wrong way.

infogulch · on May 17, 2017

Ok, so in general I agree with your statements, but...

> how they would build a reliable business around arbitrary data structures which they have no control over

Google Search? The entire web could be described exactly like that.

Ironlink · on May 17, 2017

You make a great point. To me, Google Search is a bit of a special case in that the data it provides has been posted publicly. Unless the Google Cloud ToS give them the right to publicize your business data, that's out of the question.

Another point here is that Google Search isn't an authoritative source of information, it is up to the end user to inspect the returned links and decide if they can trust that site. This is something that I would not try to automate to the point that I could ask users for money in exchange, and if it can't be automated it doesn't seem like a great fit for Google.

infogulch · on May 17, 2017

The question is if they can extract value out of the customers' data that's on or passes through the servers and services that they manage, or if it's too messy to be of use. And my answer to that is that this is Google's primary business model, so surely they could if they wanted to try. Whether they do (surely not) or should (definitely not) are different questions.

sicariusnoctis · on May 17, 2017

But at least webpages are in a standardized format. With the exception of images, some of the data might be in arbitrary formats or just not trivial.

infogulch · on May 17, 2017

> standardized format†

†: Parses unreliably at best, Turing-complete at worst. (aka javascript if you didn't catch that)

hk__2 · on May 17, 2017

Webpages are, but the information isn’t. Multiple ways to layout your page; multiple ways to express your information; multiple languages; etc.

slizard · on May 17, 2017

> Google Search? The entire web could be described exactly like that.

Nah, the user's mood may be ruined if the search results are junk, but ruining the the model you're trying to build has vastly more costly consequences. I don't think the tech is there (just yet) to have some code simply ingest whatever comes its way, chew it up and use it well; required xkcd (today's!): https://xkcd.com/1838/

ihsw2 · on May 17, 2017

They let other's use and sharpen their tools.

halite · on May 17, 2017

One of the goals of this project is:

"Share the benefits of machine learning with the world"

It doesn't say who shares what :)

pm90 · on May 17, 2017

Google's stated goals seem to be very altruistic and to their credit, they do genuinely contribute heavily to good causes and open science and research. I kind of find it hard to believe a for-profit corporation is purely altruistic, but their play here seems to be to basically encourage research and make it more accessible to more people so that more human minds come up with novel ideas that Google itself might some day use in the future. e.g. I can see how investing millions of $'s in funding PhD students can pay off if even a single one of them discovers an obscure algorithm that increases efficiency of some process by just 0.1%, but at Google's scale that might still save millions of $'s more.

moosingin3space · on May 17, 2017

I believe Google is playing long-term here -- if they make it really easy for someone to do ML research using their frameworks, they'll collect rent from the researcher and possibly get something more innovative out of their work later.

Not everything they do is about short-term data plays.

creaghpatr · on May 17, 2017

Would it be possible for a pass-through of 'bad' or 'faulty' data to mess up their model at a large enough scale?

If the data can be used to improve the model it seems like data could be used to damage it, in theory.

alexbeloi · on May 17, 2017

I think ihws2 was saying sharpening tools to mean filing bug reports and feature requests for tensorflow and the cloud service.

jacquesm · on May 17, 2017

At a minimum it will tell Google exactly what people are up to which means that if something interesting pops up they have a direct line to that person. After all the proposals alone will give them all that info, no need to snoop on the machines.

jjm · on May 18, 2017

For me the major take away is that Nvidia is no longer the only supplier of "decent" DL hardware.

To think that any entrant no matter the size could come into the space and successfully develop a device in such short order is amazing.

Not only that but have it used within their own infrastructure for some time, and later to allow outside external usage, again amazing from a planning, execution, product perspective. Ah shucks, from a hardware perspective too.

Especially coming from a company In the traditional sense which has no business being in this sector (excuses).

It seems like NVidia is an army, polished to pump "chips". And this Google special ops team kicked ass.

andreyk · on May 18, 2017

"To think that any entrant no matter the size could come into the space and successfully develop a device in such short order is amazing."

Is this a fair statement? I mean , this is Google. Hardware development is not easy or cheap, and they've been at it for years. They ponied up lots of money for AI starting as early as 2012, and this company is all about running huge jobs on lots of compute on giant data - it is hard to think of a company more suited to developing this kind of hardware (same for Facebook, in that sense).

jjm · on May 18, 2017

You are right. Traditionally it was/is considered a difficult road to come into any space different than your core. Seems like Google and Amazon are really good at executing almost anything.

novaRom · on May 18, 2017

It is about money. Google and Amazon can offer significantly better compensation than many other companies. So, they can attract the most productive engineers.

KaoruAoiShiho · on May 17, 2017

If I'm invested in NVDA should I sell? Is this is a real competitor that will remove NVDA's AI leadership?

p1esk · on May 17, 2017

Until you can buy a TPU card for your workstation, at a competitive price, there's no real competitor. And even then, it's just for TensorFlow.

ruleabidinguser · on May 17, 2017

Last I heard theyre creating their own custom chip as well.

KaoruAoiShiho · on May 17, 2017

nvidia's not going to suddenly collapse because of this, but a lot of the froth in that company is due to their existence as pretty much the ONLY provider of DL accelerators. They had no competition and there were no competitors in sight, it looked like they were going to dominate the space in the short term. This changes now I think. The 2nd Gen TPU is competitive even if not strictly better. In a few years instead of nvidia being a clear leader it may just be one of a pack, and that's bad for the current valuation.

nl · on May 17, 2017

No, it's great news for NVidia.

Nvidia is the only current credible provider of DL hardware. If Google starts using TPUs then every other company will be forced to buy more Nvidia cards or they will be left behind.

Microsoft is the only exception here: they have some investment in a FPGA based ML solution. Even IBM uses NVidia as their DL solution on Power servers.

wmf · on May 17, 2017

I know of one startup working on a TPU killer and there must be others. Amazon is probably either already working on a deep learning ASIC or scouting startups to buy. Apple is probably poaching people from Google/Nvidia to build their deep learning core.

nl · on May 18, 2017

There's at least one person on HN working on DL hardware.

hueving · on May 18, 2017

Working on an TPU killer is nothing close to having hardware available at mass production volumes right now.

shaklee3 · on May 18, 2017

That's what nervana was doing, and they were acquired by Intel.

p1esk · on May 17, 2017

Nvidia is the TPU killer. V100: 120 TFLOPs, TPU: 45 TFLOPs.

jamesblonde · on May 18, 2017

This basically tells me that Google have a new generation of TPUs that have arrived and they are giving us (researchers) the last generation. Not that I have nothing against this. 3-4 years ago, AWS made huge inroads in mind-share by giving us researchers tens of thousands of dollars in education grants for writing a half-page proposal. Google have learnt from them.

dgacmu · on May 18, 2017

I think if you look at some of the other sub-threads comparing TPUs to Pascal and Volta, you'll conclude that the Cloud TPUs are cutting-edge technology. Your conclusion makes sense in the context of TFRC, but not when Google's creating a cloud offering that they hope people will pay for.

killjoywashere · on May 17, 2017

I suspect a number of people signing up for this will be getting invited to job interviews in the coming year.

kruhft · on May 18, 2017

delay 7

say "I'd like to introduce you to the concept of Intelligence Artificial."

delay 2

say "You might have heard of Artificial Intelligence, a term coined by John MiCarthy in the 1950s to label his study of difficult problems that his computer science lab was working on."

delay 2

say "Today we have a resurgence of Artificial Intelligence. You may or may not have noticed, but the Artifically Intelligent are now controlling your lives."

delay 2

say "From Siri, to Facebook, to the Googles, to the Amazon, your data is being fed into 'Machine Learning' algorithms at a rapid pace."

delay 2

say "Custom hardware is being designed this very minute to accelerate the training of these algorithms."

delay 2

say "That is what Artificial Intelligence is these days. Machines learning, about you."

delay 3

say "Intelligence Artificial turns this concept on it's head. By learning more about the machine, YOU, you can learn to control the machine."

delay 1

say "I"

say "A, not, A I."

delay 7

say "BusFactor1 Inc. 2017"

delay 1

say "Putting the ology into technology."

delay 3

say "Or, is it the other way around."

delay 4

say "I am the machine telling you about learning, this is my reference clip and this is where we are going"

kruhft · on May 18, 2017

Rendered: http://busfactor1.ca/ia.m4a