Running TensorFlow at Petascale and Beyond

fizixer · on Feb 6, 2019

Petascale used to be a buzzword 10 years ago. I think it's outdated for many years now.

Case in point. A gaming GPU from a few years ago, GTX 1080 Ti, easily does more than 10 TeraFlops. So you only need 100 such GPUs (about $70k worth in January 2019) to do PetaFlop computation. This doesn't even include high-end GPUs especially upsized for deep-learning computation. Furthermore, those DL GPUs are dwarfed by what ASICs like Google TPU, and nVidia's DL ASIC can do (more than 100 TeraFlops a board I think, though arguably those ASICs are mainly for inference not learning).

DeepMind routinely utilizes a few PetaFlops for its cutting edge DRL systems (AlphaGo, AlphaStar, etc).

IMO, In the next 5 years or so, and somewhere between 0.1 and 1 ExaFlop, we'll probability hit human-level AI.

KenoFischer · on Feb 6, 2019

Traditionally the HPC community means double precision when they talk about FLops, which consumer GPUs are not super great at (somewhere around 400GFlop or the GTX1080 Ti). The V100 is better at 7 or so TFlops, but also an order of magnitude more expensive. The smaller of the two machines in the article (Cori) is a ~20 PF (Float64) scale machine. However, even then it is quite challenging to get anything close to peak performance out of these systems (a bit easier if you're doing DL will is more amenable to hand-optimized vendor libraries).

Of course for DL, people routinely use reduced precision as you have alluded to. There, we're looking at exascale at the moment. The second machine mentioned in the article (Summit) has ~3EF of Float16 performance (of which their application reached ~1EF). For comparison (and assuming about 50% MXU utilization - which is about the maximum I've seen), you'd need about 200 TPUv2 pods (at $384/hour/pod for the general public) to reach the same scale on this problem. Now Google does have that scale, but it's certainly non-trival.

Half-precision petascale is fairly routine these days. You can get that on the cloud for < $100/hr and I do routinely spin up such systems for ML training. However, petascale fp32/fp64 and exascale (b)f16 as discussed in the article are still fairly rare and usually preceded by months of planning to make sure things go right and the computer power is used usefully.

Disclaimer: not involved in the work, but working closely with the folks who are and I've used both the mentioned systems myself.

fizixer · on Feb 6, 2019

Thanks for the additional info.

And I don't disagree with anything you said. However, for any concerns with computation between 10 and 800 Petaflops, I would consider them "super"-petascale (with below 10 being Petascale, above 800 being Exascale).

Summit is clearly an Exascale machine. I looked at Cori and I was surprised to find out that it has no mention of GPU in the specs, despite the project starting as late as 2015. I personally think trying to scale up CPU-only supercomputers in a super-Petascale era is a losing battle, but that's just my 0.02.

dekhn · on Feb 6, 2019

cori uses Knights Landing Xeon Phi accelerators, not GPUs

short_sells_poo · on Feb 6, 2019

> IMO, In the next 5 years or so, and somewhere between 0.1 and 1 ExaFlop, we'll probability hit human-level AI.

That is a mighty big prediction to just throw around so easily. Do you have any material to back that claim?

In my humble opinion, we are nowhere near general AI, let alone something on the human level. For all intents and purposes, even the most sophisticated NN models used and built these days are "dumb" when compared to biological intelligence.

It's not even clear whether general intelligence is an emergent property of all complex neural networks, or if there is some other secret sauce that is required. Having such fundamental question unanswered and yet claiming that in just 5 years we will be capable of matching the highest level intelligence we are currently aware of seems very bold to me.

Had you said something at least a bit more nebulous like 50 or a 100 years, one could take a bit of a leap of faith and posit that a massive breakthrough will happen that will give us insight into how to build proper AI, as opposed to what is curve fitting with ever more parameters.

I'm not saying that curve fitting with ever more parameters cannot eventually reach human level intelligence, but so far there is zero indication that it can.

fizixer · on Feb 6, 2019

> That is a mighty big prediction to just throw around so easily. Do you have any material to back that claim?

It's part opinion and part based on progress. Call it a highly highly educated guess.

I divide the issues related to human-level AI into three categories: Hardware, Algorithm, Biology

1) Hardware: We have the hardware. As the other comments show, the biggest supercomputers are cranking out 1 ExaFlops or more. So I consider it a solved problem.

2) Algorithm: We don't have the algorithm yet but we have made massive gains in the past 5-10 years. 5 years ago, if you asked an ML expert how soon a machine will be able to beat Go champions, and Starcraft 2 champions, they would've said something very similar to what you said: that we don't even know how to tackle this problem, and we might do it in 20 to 40 years. Today both Go and Starcraft are solved problems. Research groups like DeepMind have nothing to do but to work directly towards moving closer and closer to achieving human-level AI.

3) Biology: Even if we are completely stuck at figuring out the algorithm, we have a fallback. We look at biology. Actually it's not even a serial thing. Neuroscientists are already looking at mammalian brains, and trying to reverse engineer it, regardless of what CS/ML/AI folks are doing. We have already mapped the full connectome of the human brain [1]. We know very well how biological neurons work, barring some qualifications regarding quantum behavior. We have already simulated, or in the process of simulating, CNS of a worm [2]. Once we full simulate the brain of a mouse, I don't think we'd be very far from scaling it up.

[1] http://www.humanconnectomeproject.org/

[2] http://www.artificialbrains.com/openworm

speedplane · on Feb 6, 2019

You're assuming that processor power is still exponentially increasing. What if it isn't? Basic CPUs aren't getting much faster, ASICs are a one-time boost for performance by creating custom chips.

There's an ideology of Moore's law that assumes it's a given indefinitely, but just because it's been true for 50 years won't necessarily mean it's true for the next 50.

dagw · on Feb 6, 2019

Basic CPUs aren't getting much faster

Not pr core, but they are getting faster pr. $ and pr. Watt. And when building huge compute clusters those are things you care about the most.

rbanffy · on Feb 6, 2019

Not only that - memory isn't getting much faster either. If you need more memory bandwidth, you need more memory controllers, better NUMA access, more logic dedicated to keeping caches coherent and so on. In many applications, memory is the bottleneck and all that FP64 capacity will be begging for numbers to crunch.

Besides, not all problems are embarrassingly parallel.

speedplane · on Feb 8, 2019

Building huge compute clusters definitely provide an economies of scale, and can certainly reduce the cost of CPU, but it's a one-time fix. After a certain size, doubling the size of your datacenter doesn't have much if any reduction in cost.

Large modern data-centers have certainly added a few years to Moore's law as far as perceived performance, but they will also run up against a wall. In fact, the very emergence of large data-centers (as well as the rise in popularity of ASICs) are likely a response to the slowing down of Moore's law.

dekhn · on Feb 6, 2019

petaflops really aren't about just summing up the flops of many independent nodes (at least within a supercomputer context, which this one is).

It's about synchronized petaflops. DM and Google's internal networks aren't like the MPI-Infiniband networks- as published, they are significantly lower bandwidth and higher latency.

A lot has changed in the past few years as people are starting to figure out what are the appropriate models for large-scale synchronized flops-computing. In the TF world, that includes things like Collective AllReduce (now in TF< originally implemented in projects like Horovod).

When I worked in the supercomputer world, I couldn't get my code to do synchronized scaling beyond 128 nodes, so I couldn't really get time to run my codes there (so I left for Google, and ran exacycle, which provided more unsynchronized CPU flops than anybody for a while in the mid 2010's).

ovi256 · on Feb 6, 2019

>somewhere between 0.1 and 1 ExaFlop, we'll probability hit human-level AI

You mean we'll hit human-level processing power in a single computer. The software/modelling part will still be sorely missing.

A promising path seems predictive coding, as popularized in the book "Surfing Uncertainty". Reviewed for example here: https://slatestarcodex.com/2017/09/06/predictive-processing-...

agumonkey · on Feb 6, 2019

> IMO, In the next 5 years or so, and somewhere between 0.1 and 1 ExaFlop, we'll probability hit human-level AI.

the social implications of that will be as cosmic as terrifying

rbanffy · on Feb 6, 2019

It's still a human brain with the size of a warehouse.

The scary part starts a few years down the road, when the building houses a brain with a human equivalent IQ of 10,000.

agumonkey · on Feb 6, 2019

No need for desk sized RISC-humans for it to be damaging.

I sense society is getting depressed because the value and meaning of work has evaporated. How people will react by feeling existentially subpar ?

paganel · on Feb 6, 2019

> How people will react by feeling existentially subpar ?

If the few SF books I've read on the subject are any guidance we will continue to feel superior to any AI-entity no matter its IQ or "computation speed". I'm also very, very skeptical that the singularity is even remotely within our technical reach, but that's another subject.

agumonkey · on Feb 6, 2019

Well, I'd be surprised if people really don't feel dead when machines will over-smart them on everything. (and I'm not even thinking of singularity like times but anyway). Still weirds me out.

rbanffy · on Feb 6, 2019

I know I wouldn't be overly concerned. I'd certainly hope the machines would be employed to make human life as complete and fulfilling as possible. There is also no real reason to be sad for not being the smartest animal on Earth (which is debatable, anyway - dolphin and mice come to mind) and some reason to be a little proud that we, humans, gave birth to a new kind of life that transcends the limitations of our flesh.

If we are to continue to evolve, intelligent machines will be our partners and our extensions. There is so much flesh can achieve and we can dream much bigger than that.

What will it mean to be human in a 100, 1000, or a million years from now is an open question, and a fascinating one.

agumonkey · on Feb 6, 2019

I fear this because I think a vast majority of our society is built on personal skills.

It's possible that humans will now be freed from productivity driven skill acquisition and go for passion / sharing in a land self managed by smart assistant. Which could be great you're right.

rbanffy · on Feb 6, 2019

I'll continue writing software - it's what I love to do. If I can partner with an AI that can do it much better than me, I'll learn, just like I learn from my friends. In the end, we'll all get better at what we do, even if all the serious, vital work is done with help from or by AIs.

Voloskaya · on Feb 6, 2019

> those DL GPUs are dwarfed by what ASICs like Google TPU, and nVidia's DL ASIC

What is your definition of 'dwarfed'? 15%?

fizixer · on Feb 7, 2019

You can take the example of nVidia Titan Xp (GPU that targets ML, scientific computing market) and Google TPUv2.

- Titan Xp has single precision power of 12 TeraFlops in boost mode.

- A Google TPUv2 device claims a power of 180 TF (consisting of four TPUv2 chips, of 45 TF each).

So I guess by dwarfed, I mean 6.67%

Voloskaya · on Feb 12, 2019

This comparison makes 0 sense:

You chose to compare TPUs to a Titan Xp, which is far from being the best ML GPU from NVIDIA and you are comparing it agains 4 TPUs.

I invite you to check out the different benchmarks that were made comparing NVIDIA V100s againt TPUv2/v3: they all show that the performance are very similar, even slightly better for the V100 overall iirc.

The only reason why TPUs are interesting is if you plan to rent an accelerator (that's also why Google doesn't sell them), you get around 50% savings for the same results using TPUs.

deepnotderp · on Feb 6, 2019

nVidia's Volta is used for training primarily and we've already hit 1 ExaFlop in some DL HPC systems.

Also most HPC (and now slowly deep learning too), is limited by data movement not raw FLOPs. LINPACK is out, HPC-G is in.

agumonkey · on Feb 6, 2019

I didn't know that..

https://www.techpowerup.com/238758/nvidia-announces-saturnv-...

Announcement from late 2017 ..

    FP16: exaflop
    FP64: petaflop

Interesting

pedro_hab · on Feb 6, 2019

I got interested in what parameters they were computing, I guess cosmological constant or anything space expansion related.

Anyways, cool stuff, wish they had a peak in the results.

collyw · on Feb 6, 2019

Ok so they are running huge amounts of data through tensorflow.

I am more interested in what the data is and what can be gained by using such a large amount. I was under the impression that the benefits of training data plateaued off after a certain amount.

dagw · on Feb 6, 2019

Sure for a fixed dataset, more training plateaus after a while. However with more power we can work on different types of data. For example instead of using 2D CNNs on images, we can now use 3D CNNs on high resolution pointcloud data. Also as the speed of data acquisition grows being able to train a given model on a given type and size of dataset to a fixed 'level' as fast as possible becomes even more important.