An In-Depth Look at Google's Tensor Processing Unit Architecture

struct · on April 5, 2017

Interesting points I took from the paper[1]:

* They actually started deploying them in 2015, they're probably already hard at work on a new version!

* The TPU only operates on 8-bit integers (and 16-bit at half speed), whereas CPU/GPUs are 32-bit floating point. They point out in the discussion section that they did have an 8-bit CPU version of one of the benchmarks, and the TPU was ~3.5x faster.

* Used via TensorFlow.

* They don't really break out hardware vs hardware for each model, it seems like the TPU suffers a lot whenever there's a really large number of weights and layers that it must handle - but they don't break out the performance on each model individually, so it's hard to see whether the TPU offers an advantage over the GPU for arbitrary networks.

[1] https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...

nickpsecurity · on April 5, 2017

Regarding 8-bit numbers, here's a thread on why 8 bits are enough and an old product that used that to good effect:

https://news.ycombinator.com/item?id=10244398

http://www.eetimes.com/document.asp?doc_id=1140287

It's something that keeps getting rediscovered. I know embedded industry shoehorns all kinds of problems into 8- and 16-bitters. Some even use 4-bit MCU's. Might be worthwhile if someone does a survey of all the things you can handle easily or without too much work in 8-16-bit cores. This might help for people building systems out of existing parts or people trying to design heterogenous SOC's.

tarlinian · on April 5, 2017

The lack of real 8-bit comparison data makes the whole paper a little suspect IMO...it's sort of like the early GPU papers that claimed 100x improvement over the CPU while running x87 scalar CPU instructions...the benefits are definitely still there but handicapping one architecture when it has features that are specifically capable of doing this is a bit stupid. It's not like they didn't have to do a lot of work on TF to make it output TPU instructions. When you're down to a 4x improvement...the benefits of specialized accelerators start to become somewhat questionable.

I do like that they highlighted the importance of low latency output though...that's even more critical for future non "Web" applications which have to run in real time.

bobdole1234 · on April 5, 2017

The difference is the power numbers.

3.5x faster than CPU doesn't sound special, but when you're building inference capacity by the megawatt, you get a lot more of that 3.5x faster TPU inside that hard power constraint.

make3 · on April 5, 2017

"The TPU only operates on 8-bit integers" The 8 bit part is fine, but integers? what the hell. that's a new one for me

dlubarov · on April 5, 2017

Anecdotally, it seems most models can be quantized to 8 bits without much loss of accuracy, and fixed point arithmetic requires much less hardware. Training is still done with floating point though.

ChuckMcM · on April 5, 2017

This, when you get right down to it a lot of models do fine with only 256 unique weights.

alttab · on April 5, 2017

Agreed - however as we progress I expect a comment like this to be akin to Bill Gate's 64K comment.

mannigfaltig · on April 5, 2017

The brain appears to spend about 4.7 bits per synapse (26 discernible states, given the noisy computation environment of the brain); so it seems to be plenty enough for general intelligence. This could, of course, merely be a biological limit and on silicon more fine-grained weights might be the optimum.

Here is another paper demonstrating very good results with just 6 bit gradients: https://arxiv.org/abs/1606.06160

ChuckMcM · on April 5, 2017

Almost certainly, and depths would have to increase. Like all series expansions the coefficients on later terms have less and less impact so their dynamic range is less and less important to the final value. But the dynamic range on the initial terms is proportionately important. I expect the dynamic range of the weights will turn out to be logarithmic with respect to overall depth of the network.

gcr · on April 5, 2017

Yeah...you could implement 16-bit multiplication/addition as two 8-bit multiplications plus carry... so worst case, if you want 16-bit multiplies, you implement it yourself

stephencanon · on April 5, 2017

Four multiplications, not two (three if you only need the low 16b of the result).

yorwba · on April 6, 2017

You can get the full 32-bit result in only three multiplications using https://en.wikipedia.org/wiki/Karatsuba_algorithm , but it requires more additions.

stephencanon · on April 6, 2017

Yeah, in practice Karatsuba isn't a performant option for small operands unless your multiplier is catastrophically slow. (And it still doesn't get you to two multiplications.)

koja86 · on April 6, 2017

There are hints that even less than 8 bits per weight might be usable (for certain cases and on custom hw). Not sure if it's practical but it is definitely interesting.

https://arxiv.org/pdf/1502.02551.pdf

https://arxiv.org/pdf/1610.00324.pdf

I wanted to have some basic idea about hardware so I did some "research" (googling) and ended up giving a short informal talk. My slides with some links are here:

http://ml.kjx.cz/hw4nn2017.pdf

pklausler · on April 5, 2017

Indeed, any 8-bit x 8-bit function with an 8-bit result is just a 64KiB look-up table.

struct · on April 5, 2017

I imagine that once they've trained the floating-point models, they'll then quantize them into integers to make inference faster. It's not something I've done, but I imagine that the limited range of the of integers may cause problems (though they say in the paper that the 16-bit product can be accumulated to something that's 32-bit). The features to do this will be coming fairly soon to regular TensorFlow too.[1]

[1] https://youtu.be/0r9w3V923rk?list=PLOU2XLYxmsIKGc_NBoIhTn2Qh...

mtgx · on April 5, 2017

Nvidia has shipped a couple of INT8 inference cards as well:

http://www.anandtech.com/show/10675/nvidia-announces-tesla-p...

hamilyon2 · on April 5, 2017

Yes, how do they perform? Linked article cites 47 TOPS on 8 bit integer arithmetic and they are useful for training too.

woodson · on April 5, 2017

Take a look at Google's gemmlowp low-precision GEMM library:

https://github.com/google/gemmlowp

(Used in tensorflow)

throwaway71958 · on April 6, 2017

Note however, that on Intel it's actually slower than run off the mill float32 linear algebra library like Eigen or OpenBLAS. Its main forte seems to be ARM.

shepardrtc · on April 5, 2017

Here's a paper they published a little while ago about limited numerical precision and deep learning:

https://arxiv.org/abs/1502.02551

pavanky · on April 5, 2017

Those limitations sounds awfully similar to that of an FPGA..

vvanders · on April 5, 2017

I was going to say as well. It seems like if caches are the bane of sequential processing(CPU) then routing has to be the counterpart on the parallel(FPGA/ASIC) side of the equation.

mtgx · on April 5, 2017

And what's amazing is that it was built on 28nm. So TPU 2.0 could increase by another 2x in perf/W just by going to 14nm (most likely) - even more if it's built on newer processes than that.

Intel's latest chips will be even further behind compared to the next-generation TPU than Haswell was compared to TPU 1.0.

struct · on April 5, 2017

28nm was quite a cheap fabrication technology even in 2015, but it costs a lot to have a completely custom production run. My guess it that it approximately works out in savings of power and space over the lifetime of the chip. It probably doesn't make sense for them to move to something smaller (and thus more expensive) whilst the performance benefit remains so substantial. If I were Intel, I probably wouldn't lose too much sleep over it either, because you still need something to attach the highly-specialised TPU to, and that'll be a Xeon for the forseeable future.

monk_e_boy · on April 5, 2017

Agreed, the market for these chips is pretty small at the moment. What other platforms would need these other than cloud? Cars? Phones .... maybe?

mooneater · on April 5, 2017

"This first generation of TPUs targeted inference" from [1]

So they are telling us about inference hardware. Im much more curious about training hardware.

[1] https://cloudplatform.googleblog.com/2017/04/quantifying-the...

cec · on April 6, 2017

The paper says "we started a high-priority project to quickly produce a custom ASIC for inference (and bought off-the-shelf GPUs for training)."

sgk284 · on April 5, 2017

Using approaches like OpenAI's recent evolution strategies paper would remove the need for backprop, likely allowing these TPUs to be using for training without any changes.

p1esk · on April 5, 2017

Evolution strategy method is used in reinforcement learning models. How are you planning to use it for supervised learning?

mdda · on April 6, 2017

People have known that training NNs (for any purpose) using evolution works well since the 1990s. The rise of the NN frameworks has made doing differentiation much easier now than it was before (and having gradient hints is intuitively a good idea). But for OpenAI to allow their PR people to declare this as a novel advance is ... surprising.

p1esk · on April 6, 2017

Citation for training NN for image classification task where evolution works well?

Let's say you want to use a genetic algorithm to find a good set of weights: you generate, mutate, combine and select many random networks, and repeat this process many times. How many networks and how many times? That depends on the length of your chromosome and complexity of the task. Networks that work well for image classification need at least a million weights. The entire set of weights is a single chromosome. You realize now how computationally intractable this task is on modern hardware?

mdda · on April 7, 2017

> NN for image classification task

You've created your own straw man here.

> "You realize now how computationally intractable this task is on modern hardware?"

Here are the people that prove it isn't computationally intractable : https://blog.openai.com/evolution-strategies/ - but to say they've discovered a new breakthrough method is over-selling the result.

p1esk · on April 7, 2017

You said: "training NNs (for any purpose) using evolution works well". I gave you an example of a purpose where it does not work well. So, let me ask you again: can you give an example of evolutionary methods that work well when applied to training NNs, other than this breakthrough by OpenAI, which only works for RL?

monk_e_boy · on April 5, 2017

GPUs I would guess. Great floating point speed.

slizard · on April 5, 2017

It's a pity they omitted comparing against the Maxwell-gen GPUs like the M40/M4. Those were out already in late 2015 and are also on 28 nm.

Perhaps the reason is simply that they don't have them in their servers, but we'll see if Jeff Dean replies on G+ [1].

[1] https://plus.google.com/+JeffDean/posts/4n3rBF5utFQ?cfem=1

visarga · on April 6, 2017

Not really, it's not such a pity. It's not for us, mere mortals, unless you have billions of predictions to make and megawatts power to save. For your personal project where you make a few predictions per day you can use a GPU or even a CPU.

TPU excited me too at first, but when I realized that it is not related to training new networks (research) and is useful only for large scale deployment, I toned down my enthusiasm a little.

dgacmu · on April 6, 2017

Neither Google Cloud nor Amazon Web Services offers Maxwell-series GPUs. Both jumped, or, to be more precise, are in the process of jumping, from the K-series to the P100 series.

When I google around a bit, I see several results talking about the software licensing cost model for the M-series GPUs.

keltor · on April 6, 2017

There are no datacenter class Maxwell-series GPUs. They never released a version with ECC SRAM and so Amazon and Google never used them in production.

Part of the fault was GDDR5's limitations that involved trickery to make the Kepler-series work.

Pascal is coming with ECC because HBM2 comes with ECC built-in.

p1esk · on April 6, 2017

Several cloud providers have been offering P100 GPUs for a while now: https://www.nimbix.net/nimbix-cloud-demand-pricing/

MichaelBurge · on April 5, 2017

It's interesting that they focus on inference. I suppose training needs more computational power, but inference is what the end-user sees so it has harder requirements.

Most of us are probably better off building a few workstations at home with high-end cards. The hardware will be more efficient for the money. But if you're considering hiring someone to manage all your machines, power-efficiency and stability become more important than the performance/upfront $ ratio.

There's also FPGAs, but they tend to be much lower quality than the chips Intel or Nvidia put out so unless you know why you'd want them you don't need them.

throwaway71958 · on April 6, 2017

They're also not very interested in making it easier for you to train models at home. Not that it's a big risk for them if you were able to do so: you don't have the data, and your models are only as good as your data, but they'd rather you came to their cloud and paid $2/hr per die for an outdated Tesla K80. Which, to their credit, they've made it very easy to hook up to your VM. Literally, you just tell them how many you need and your VM starts with that many GPUs attached. Super slick.

thesandlord · on April 6, 2017

P100s are coming soon!

(I work on GCP)

danmaz74 · on April 5, 2017

"learn" once, apply what you learned millions of times...

dgacmu · on April 6, 2017

Right. Or billions -- or trillions. Consider something like the Inception-like convolutional model that's one of the workloads in the paper. Training Inception is "relatively" easy -- one week of 48 K80 GPUs. (I'm lying, of course, because you retrain, and you train many times to do hyperparameter optimization, but still).

Then consider the possible applications of that at Google scale -- there are "an awful lot" of images on the web, over 13PB of photos in Google photos last year [1], a gajiggle of photos in street view and google maps, an elephant worth in google plus, and probably a few trillion I'm not even thinking of. :)

Same applies, of course, to Translate, and to RankBrain, also mentioned as NNs running on the TPU. 100B words per day translated [2], and .. many, many, many Google Searches per day, even if RankBrain primarily targets the 15% of never-before-seen queries [3].

Add that to the fact that GPUs are poorly-suited to realtime inference because of the large batch size requirements, and it's a solid first target.

[1] https://en.wikipedia.org/wiki/Google_Photos [2] http://www.k-international.com/blog/google-translate-facts/ [3] https://www.bloomberg.com/news/articles/2015-10-26/google-tu...

(work at Google Brain on Mondays, but speakin' for myself here.)

zitterbewegung · on April 5, 2017

Looking at the analysis of the article one of the big gains of this is that they have a Busy power usage of 384W which is lower than the other servers while having performance that is competitive with the other methods (although only restricting to inference).

maga · on April 5, 2017

I was wondering how it compares to other solutions in terms of performance/watt, luckily they address it in the paper[1]:

> The TPU server has 17 to 34 times better total-performance/Watt than Haswell, which makes the TPU server 14 to 16 times the performance/Watt of the K80 server. The relative incremental-performance/Watt—which was our company’s justification for a custom ASIC—is 41 to 83 for the TPU, which lifts the TPU to 25 to 29 times the performance/Watt of the GPU.

[1] https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...

zackmorris · on April 5, 2017

While this is interesting for TensorFlow, I think that it will not result in more than an evolutionary step forward in AI. The reason being that the single greatest performance boost for computing in recent memory was the data locality metaphor used by MapReduce. It lets us get around CPU manufacturers sitting on their hands and the fact that memory just isn’t going to get substantially faster.

I'd much rather see a general purpose CPU that uses something like an array of many hundreds or thousands of fixed-point ALUs with local high speed ram for each core on-chip. Then program it in a parallel/matrix language like Octave or as a hybrid with the actor model from Erlang/Go. Basically give the developer full control over instructions and let the compiler and hardware perform those operations on many pieces of data at once. Like SIMD or VLIW without the pedantry and limitations of those instruction sets. If the developer wants to have a thousand realtime linuxes running Python, then the hardware will only stand in the way if it can’t do that, and we’ll be left relying on academics to advance the state of the art. We shouldn’t exclude the many millions of developers who are interested in this stuff by forcing them to use notation that doesn’t build on their existing contextual experience.

I think an environment where the developer doesn’t have to worry about counting cores or optimizing interconnect/state transfer, and can run arbitrary programs, is the only way that we’ll move forward. Nothing should stop us from devoting half the chip to gradient descent and the other half to genetic algorithms, or simply experiment with agents running as adversarial networks or cooperating in ant colony optimization. We should be able to start up and tear down algorithms borrowed from others to solve any problem at hand.

But not being able to have that freedom - in effect being stuck with the DSP approach taken by GPUs, is going to send us down yet another road to specialization and proprietary solutions that result in vendor lock-in. I’ve said this many times before and I’ll continue to say it as long as we aren’t seeing real general-purpose computing improving.

darksaints · on April 5, 2017

Are people really using models so big and complex that the parameter space couldn't fit into an on-die cache? A fairly simple 8MB cache can give you 1,000,000 doubles for your parameter space, and it would allow you to get rid of an entire DRAM interface. It's a serious question, as I've never done any real deep learning...but coming from a world where I once scoffed at a random forest model with 80 parameters, it just seems absurd.

mattnewton · on April 5, 2017

Yes. Each layer can have millions of parameters if your data set is large enough.

Convolutional networks easily get up there, especially if you add a third dimension that the network can travel across (either space in 3D covnets for medical scans, or time for videos in some experimental archetecture). Say you want to look at a heart in a 3D covnet, that could easily be 512x512x512 for the input alone.

In fully connected models, for training efficiency, many features are implemented as one-hot encoded parameters, which turns a single caragory like "state" into 50 parameters. I think there is some active research into sparse representations of this with the same efficiency but I've never seen those techniques, just people piling on more parameters.

vincentchu · on April 5, 2017

The latest deep learning models are indeed quite large. For comparison, Inception clocks in at "only" 5M parameters, itself a 12x reduction over AlexNet (60M) and VGGNet (180M)! (source: https://arxiv.org/abs/1512.00567)

A further point is that even if the model has relatively few parameters, there are advantages to having more memory--- namely, you can do inference on larger batch sizes in one go.

sounds · on April 5, 2017

I think something that "goes without saying" is that their first-rev design has some essential simplicity to it:

- No control flow instructions (though apparently some operations can have a repeat count)

- Fundamentally simple architecture

This allows them to get through validation and tapeout very quickly.

dnautics · on April 5, 2017

if you'd like a good answer to that question drop a line.

deepnotderp · on April 5, 2017

hahhahahahhaahah

The SOTA networks are around 300MB+...

darksaints · on April 5, 2017

Not sure if you meant to laugh at a serious question. I am fully aware of my ignorance of the space.

Since it appears you're in the deep learning hardware business, what would be the impediment to using eDRAM or similar? eDRAM is too costly at those sizes for general purpose processors, but I imagine the reduced latency and increased bandwidth would be a huge win for a ridiculously parallel deep learning processor, and would definitely be a tradeoff worth making.

deepnotderp · on April 5, 2017

Sorry, that was more of a laugh at the state of deep learning model sizes than anything.

Okay, so about eDRAM. There are two types of eDRAM: On-die and on-package. On-die eDRAM refers to manufacturing of DRAM cells on the logic die, which would be a big boon in terms of density since eDRAM cells can be almost 3x as dense as SRAM. The problem however, is that on-chip eDRAM has been impossible to scale beyond 40nm, which mitigates any advantages you would receive from using eDRAM.

On-package eDRAM is more interesting but the primary cost in memory access is the physical transportation of the data, which is a physical limit and can't be circumvented. You can call it all sorts of fancy names such as "eDRAM", but the fact of the matter is that you're still moving data. For reference, the projected cost of movement of a 64-bit word on 10nm (ON CHIP) according to Lawrence Livermore national laboratories is ~1pJ, whereas the cost of a 64-bit FLOP is estimated to be 1pJ also. As you can see, the cost of data movement dwarfs the cost of computation.

Of course, you gain a lot compared to DRAM, but HBM can offer the same efficiency gains of course.

Didn't meant to be rude with the first response. Let me know if you have any other questions, I'd be happy to answer them :)

mdale · on April 5, 2017

Interesting stuff; really points to the complexity of measurement of technical progress against the Mores law; it's really a more fundamental around how institutions can leverage information technologies and organize work and computation towards goals that are valued in society.

cr0sh · on April 5, 2017

This appears to be a "scaled up" (as in number of cells in the array) and "scaled down" (as in die size) as the old systolic array processors (going back quite a ways - 1980s and probably further).

As an example, the ALVINN self-driving vehicle used several such arrays for it's on-board processing.

I'm not absolutely certain that this is the same, but it has the "smell" of it.

sgt101 · on April 5, 2017

Does anyone have a view as to how much deep kernels might be useful for riding to the rescue for the rest of us?

https://arxiv.org/abs/1611.00336

amelius · on April 5, 2017

Are they using it in feedforward mode only? Or also for learning?

agravier · on April 5, 2017

It's mostly designed for inference.

deepnotderp · on April 5, 2017

Inference only, 8-bit integer won't work for training without bad accuracy degeneration.

andrepd · on April 6, 2017

They're comparing against 5-year old Kepler GPUs. I wonder how it had fared vs the latest Pascal cards, since they're several times more efficient than Kepler.

modeless · on April 6, 2017

5-year-old Kepler GPUs are the best you can get in the cloud right now, and that's Nvidia's fault. So it's releavant to compare against them.

p1esk · on April 6, 2017

There are several providers which have been offering P100 GPUs for a while now: https://www.nimbix.net/nimbix-cloud-demand-pricing/