Hacker News new | past | comments | ask | show | jobs | submit login
Benchmarking TensorFlow on Nvidia GeForce RTX 3090 (evolution.ai)
105 points by rkwasny on Sept 29, 2020 | hide | past | favorite | 102 comments



That second table is aa good example of why always including units (or even just a "higher is better") is a good idea... I have no clue what I'm looking at.

Edit: It's been edited, thx Evolution :) (or I totally glossed over it the first time around... but I don't think so)


Even after the edit, it's a bit confusing in that, without looking at the table, you get the impression that FP16 is slower than FP32.


Perhaps it has been edited. Now the table contains the following title "Training performance in images processed per second".

A "Higher is better" might still be interesting although redundant.


Even after being edited, it's still wrong. It shows the significantly lower Inception4 performance as a "40% speedup" instead of 40% of baseline images/sec.


"Training performance in images processed per second"?


Do you mean this table which has a caption over it that reads "Training performance in images processed per second" ?? Looks pretty self-explanatory.


This is a poor comparison of performance. All of these networks are CNNs, and very old architectures at that. They are all probably memory bottlenecked which is why you see the consistent 50% improvement in FP32 perf.

It is also not clear what batch sizes are being used for any of the tests. If you switch to FP16 training, you must increase the batch size to properly utilize the Tensor Cores.

If you compare these cards at FP16 performance on large language models (think GPT-style with large model dimension), I am confident you will see Titan RTX outperform the 3090. The former has 130 TF/s of FP16.32 tensor core performance while the latter has only 70 TF/s.

Link: https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/a...


The 3090 RTX is also $1000 cheaper than the Titan, so there's that. It would be nice if there was a good way to express value per dollar. Perhaps in GLUE accuracy and training time.


Totally agree! I think 3090 could be a lot more cost effective for researchers to dabble with NLP. But it really grinds my gears when people post these misleading benchmarks... the 3090 is handicapped at half-rate tensor core performance while the Titan RTX is not.

So if you're someone who does their work mainly in FP32, you will see improved performance with the 3090. On the other hand, if you are an FP16 speed demon who needs to train GPT-3 over the weekend, stick with your Titans :)


What do you think about TF32 in 3090? Could it replace FP32 with 5x speedup?


I've done a lot of work in ML numerics, and I think TF32 is a completely safe drop-in for FP32 for ML workloads. NVIDIA seems to think so too, which is why on A100 it won't even be an option, it will be the default mode for any FP32 matrix multiplies.

But on 3090, I don't think the speedup will be 5x, it should be closer to like 2x. The 3090 has 35.6 TF/s at TF32 and the Titan RTX has 16.3 TF/s at FP32. Once again I think there is handicapping going on for 3090.


So basically no difference to FP32. That sounds very handicapped.


For many of us, the Inception-style CNN workloads--especially at FP32--are much more realistic than large language models that may be better suited to take advantage of the tensor cores. If I'm going to be memory bottlenecked either way, I probably don't want to spend an extra $1000 on 400 tensor cores I can't take full advantage of.


If I may ask, why are the Inception style workloads still popular, rather than architectures like EfficientNet?

Also, why FP32? CNNs are some of the most robust models to train in FP16 (much easier than language models) so you could get yourself a quick XXX speedup and 2x memory savings by switching over.

(btw not intending to be accusatory or anything, I just think FP16 training deserves a lot more adoption that it currently seems to have :)


Aside: Nit: Don't use gradients for discrete categories in a graph. Use a discrete color palette that perceptually distances colors as much as possible using a tool like this: https://medialab.github.io/iwanthue/


Seems like a good speedup relative to the Titan, especially for the money. I’d be interested to see the performance relative to the 3080 though. There are obviously vram limitations with the 3080 but it would still be interesting to see the difference in raw compute performance.

In games the 3090 only gives a 15% performance bump relative to the 3080. If that pattern holds for machine learning tasks there is probably a scenario where it makes sense to buy two 3080s rather than one 3090.

If you are vram constrained then obviously the 3090 the way to go.


If this isn't OT...

Could you kindly advise what kind of computer would make sense to purchase to begin learning about ML? I was assuming I'd get a 3080. Should I get a case that could potentially house 2 x 3080's? Does the case require any special cooling considerations, or just whatever will fit the cards? What CPU would you get?


If you're "learning about ML" there is no point in buying anything. Just get the cloud compute instead, and for home use and testing literally anything will do. I have friends who work with ML professionally and even they say it's just hard to justify running any computations at home once you factor in the electricity and hardware cost - GCP compute just beats the cost, easily.


What's the (pre-Ampere) GCP price for a V100? On AWS it was $3/hr, so at 100% use and market prices a Titan V would pay for itself vs the cloud inside a month. Is GCP significantly cheaper? Or are we talking about pricing at ~0% utilization?


100% utilization is a pretty huge assumption.

And if you ARE actually running it that hard, you'd better budget for fairly frequent replacement cards.


At 50% utilization it beats the cloud in 2 months. 10% utilization, it beats the cloud within a year. If you're dabbling, definitely go with the cloud, but if you're turning around experiments on a regular basis, buying gets attractive quickly.

And no, cards don't just keel over in a few months at 100%. Crypto miners ran that experiment. A typical card has years of 100% in it.


Only if your power is free (it isn't) and the machine the card is in is free (it isn't) and said machine produces no heat or noise (it does).


Power doesn't cost nearly as much for TCO to get anywhere near even the "preemptible price" of V100 (and probably A100 when it's ready) over a period of half a year. And now that 3090 has 24gb, which is needed for larger models, a solution with a couple of consumer cards is even more competitive for experimentation. You can also sell your cards in a year or two, and recover some of the costs. (All that of course if Nvidia's gimping of consumer cards doesn't significantly affect your code)


So for the people who I know do it professionally, the answer is really simple - when they run calculations for clients, they can add GCP compute time on the invoice. You can't bill a client for your own electricity usage. I mean you can factor it into the price of your service, but then that brings a whole pile of other issues with it.


Cloud if:

- Someone else is paying

- You expect to dabble

- You need burst capability

Buy if:

- Cost sensitive & capable


I rounded down AWS's price by $.06/hr, more than the price of electricity and cooling around here.


Wait, what? I used to mine crypto on GPU back when that was a profitable thing to do, which involves leaving the card maxed out for long periods of time.. no real damage.

Are modern cards really so fragile you can expect them to die off under heavy use even if properly cooled and not overclocked?


No. Obviously people who use them for gaming can play for thousands of hours at 100% utilization and they will last for many years on average.


No, they are not that fragile.


Agreed. If you’re just learning or building hobby stuff, you can use Colab, Paperspace, or any number of other services for free or very cheaply.


I want a gaming computer than won't limit my future ML learning. Are there any suggestions for that use case?


There's no such thing as long as you buy an actual mid to high tier GPU. Even an ancient GTX1070 would be more than enough - and for sufficiently large datasets even an RTX3090 will take hours to process whatever you're crunching.

Just buy a PC that you like for gaming(with an Nvidia gpu) and don't worry about ML yet - it's incredibly unlikely that you can pick something that would limit you in any way. Small datasets will run on anything, large datasets will take hours to process no matter what you run them on. It's not a "limit".


Some off-the-shelf gaming PCs are not very Linux friendly though, so they should watch out for that, especially the laptop varieties. Getting a lot of the ML stuff working locally in Windows is a nightmare.


You're probably better off building your own machine and dual booting Windows and Linux. Here's a good guide for ML requirements, only a little out of date (published before the release of the 3080):

http://timdettmers.com/2018/12/16/deep-learning-hardware-gui...


just make sure it's NVidia. whatever graphics card you want -- all their consumer cards will work great for deep learning.

make sure your motherboard and processor support whatever the newest version of PCIe is -- a major factor with deep learning is bandwidth moving data on/off the GPU.

AMD GPUs can theoretically be used for machine learning, but right now software support is lacking -- you will spent more time configuring and installing than learning. (AMD CPUs are fine though.)

it doesn't really matter that much though -- any gaming PC with a new-ish NVidia card can be used to do quite a bit of interesting ML.


This is also a reason why it might make sense to hold off unless you have some kind of time-sensitive project.

Nvidia came to dominate the market at a time when AMD wasn't making particularly competitive GPUs, but that isn't really the case anymore. For anything not so expensive that nobody is really going to buy it anyway, the current and expected (in less than a month) AMD GPUs are competitive on performance.

The result is that a lot of large customers, who see value in not being locked into a single supplier, are going to be pushing for frameworks that work across multiple vendors. And then you could plausibly be wasting your time learning Nvidia-specific technology which is about to become disfavored. So you might want to wait and see.


I tried to go red twice. Red team has been winning at perf/$ for a decade! I thought I did my homework and established compatibility and suitability for the purposes I cared about. Unfortunately, both times I eventually ran into unanticipated incompatibilities I couldn't work around. I wound up paying the green tax anyway and also the price spread + ebay fees. Oof.

Twice bitten... once shy? In any case, I'm going to let someone else be the guinea pig this time.


i think most people would just use TF/PyTorch and ignore the specific technology on the backend. not much GPU specific stuff to learn -- very, very few deep learning people write their own CUDA code.

so the question is just -- when will it be very simple to install these packages for AMD GPUs, with enough mathematical operations implemented and optimized to let you do the things you want to do.

right now things sort of work, but it's definitely in a bleeding edge early adopter state. it's seemed like AMD is on the cusp of catching up for a couple years now, but it's taken longer than I expected.


> very few deep learning people write their own CUDA code.

True, but even once TF/PyTorch support AMD well it's highly possible that an unanticipated CUDA dependency will pop up in one's computational journey. NVidia subsidized CUDA seminars for a decade and now it's all over the place, both in the flagship frameworks and in the nooks and crannies.


That's a terrible advice, while Navi2 might be finally competitive(?), not being able to run most models due to CUDA/ROCm differences would seriously limit one's ML work.


Buy whatever makes you feel happy. I agree with gambiting that anything you choose won’t limit you


I'd honestly start with cloud options if learning is the only reason you're building the computer. You don't want to dump a bunch of money into depreciating GPUs if you're not going to end up using them.

GPUs are only really required in ML if you want to do deep neural network stuff. You can do plenty in CPU on reasonable data sets using any modern laptop.


Well to start off I’m not advising buying two 3080s. I haven’t seen the benchmarks, and on top of that the 3080 doesn’t support SLI so if you do buy two of them you will need to be using software which can utilize two independent GPUs.

If you’re just wanting to learn machine learning you don’t need anything particularly special. I think you would be happy with GTX 1070. There is also the cloud computing route where you basically rent the gpu from AWS. That will be initially more cost effective than buying your own hardware.

One thing to keep in mind if you do go with the 3080 is the power consumption. Ampere cards are going to be much more power hungry than previous generations, and you will need to budget about 320W just to the graphics card. The recommended power supply for the 3080 is 850W.


Nvidia recommends 750W for the power supply.


It's worth noting that this recommendation has nothing to do with delivered power.... a 500W can handle most systems on delivered power.

This has to do with transients, the 3000 series cards have some massive transients that can easily trip OCP protection on powersupplies not designed for that kind of transient. A 700W powersupply is able to handle those transients much better than a 500W PSU is.


That is true, but it all depends on how much the rest of your system uses. If you just have a mid-tier CPU and SSD, you may find that the rest of your system only uses 200W, and you can fit a 3080 on a 600W power supply. There are multiple reports of high-quality 600W PSUs working, so YMMV.


Go for it! Get a motherboard with reinforced PCIe slots for both GPUs though, the cheaper mobos only have one armoured slot. Also, you really should use the 3080 Founders Edition I guess for a dual setup as they blow out part of their heat in the back. Otherwise you need good air control and as thin 3080 cards as you can find (so there's some space between them in the case). Still waiting for good thermal benchmarks on this kind of setup...

Of course if money is an issue, you are well off with only a single 3080 :)

There is no point in buying the 20xx series anymore. The 30xx are twice as good for the same money (if you can get one).


Woah, there. They said they were just learning. No need to purchase special hardware until you're trying to run state of the art models.

You can get very far on any laptop before hardware becomes the main blocker. And before building an ML machine, there are cloud compute options available for far cheaper.


3080 has no NVLink, so 2x 3080 would communicate only via PCIe and it's unlikely they would form a single 20-40GB virtual RAM like what 2x2080Ti with NVLink did under Linux.


If you're just beginning a 3080 is overkill, never mind two. Get a used 1080ti (or even 2080ti if not overpriced), it'll be cheaper and even has a bit more RAM.


I think for high throughput scenarios the 3090 probably has more headroom due to its higher TDP and better (larger) cooling solution, which might really matter here if you're driving the tensor cores at max the whole time.


Most video games probably aren't going to make the most of all of the extra CUDA cores on the 3090. I'm assuming that helps alot with machine learning, can someone who knows for sure confirm?


Most parallel processing scales linearly with core count. But the 3090 is more interesting for machine learning because of its RAM, which it has 24GB against 3080's 10GB. With machine learning, you spent most of the time copying memory between the CPU and GPU, so being able to fit more data to it reduces computation latency.


“ With machine learning, you spent most of the time copying memory between the CPU and GPU”

- this is a sign that you are most likely doing it wrong. Yes, some operations are inherently bandwidth bound, but most important ones such as larger matrix multiplies (transformers) and convolutions are compute bound.


Sure, but compared to the 3080 I'd say that the main deal is the bigger RAM for copying reason than the increased core count.


TIP: if you need to emulate a bigger batch with less RAM available - use gradient accumulation trick. Super easy to implement in Pytorch and it is already implemented as a single flag (accumulate_grad_batches) in Pytorch Lightning.


Your gradient accumulation trick involves multiple cpu to gpu transfers, which is precisely what the parent is trying to avoid with fitting a larger batch in gpu memory.


There will be 20GB version of 3080 soon.


There’s been rumors of this, but it’s not confirmed. A big percentage of the cost of the 3090 (and 3080) is the GDDR6X. A 3080 with 20GB of GDDR6X will still be really expensive, so it’s unclear to me that they will actually release something like that. Potentially they could put that much RAM on a card and then use slower GDDR6, but that’s kind of an odd part in Nvidia’s product offerings because then the 3080 with more RAM would be slower in a lot of situations.


There are rumors about 48GB 3090 with older GDDR6 chips so having the same config for 3080 wouldn't be unexpected, at least until Micron could produce 2GB GDDR6X chips.


what would be the benefit there? 20GB can't be that much cheaper than 24GB, right?


It will be for the gamers who think that 10 GB isn't enough VRAM, and as a way for nVidia to have an answer for rumors that AMD's next GPU will have 16 GB.


2x 20gb 3080 for $1000 each is more cost effective than 2x 3090 for $1500 each, if your model fits into 20 gb.


My guess is it’s going to be $999


A 3080 with 20gb is planned already


Very likely not with the same price as 3080. I am guessing it won't be much cheaper than 3090 as vram is expensive.


Honestly I had an RTX Titan for home use for a while. Eventually I moved to just using a 2080 Super and it performed at nearly the same power for my models. If you don't need ALL the extra memory and have the space for a triple slot then the better value proposition by far for last gen seemed to be a good super.


See also Tim Detter's fantastic post on GPU performance (which doesn't use benchmarks for the latest cards but instead calculates performance with a model):

https://timdettmers.com/2020/09/07/which-gpu-for-deep-learni...

HN Discussion:

https://news.ycombinator.com/item?id=24400603


Seems to be good speedup overall relative to 2080 Ti (including FP16: see relatives 2080 Ti v Titan: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks...). This suggests we should see another Titan card that is even more expensive in the pipeline given the FP16 performance? Or maybe TF32 performance is going to be what NVIDIA promotes in this generation (only if they have better number than FP16?)?


Here's hoping for an A100 titan with un-nerfed FP64. The 3090 is twice as nerfed as previous generations, which were also bad at 1:32. Now it's 1:64 :(


It seems that Radeon VII and Titan V are the last cards with decent fp64 performance for foreseeable future. Both Nvidia and AMD now basically have different architectures for their consumer and data-center products.


Yep, sure looks that way. I'll still be dreaming of a Titan A!


The FP64 units are a separate addition that eat a lot of die space, right? I wouldn't use the word "nerf" for the tradeoff between having more SMs versus having more features in the SMs.


They eat die space but not TDP.


What workloads do you run that require fp64?


Simulation.


Radeon VII is the best bang for the bucks in the FP64 space, then older K80s. Titan V/V100 are still expensive.


Yep! I've got CUDA, so I still have to pay the green tax.


FP64 is not needed for deep learning.


What makes you think I'm doing deep learning?


You’re commenting on the post called “Benchmarking deep learning workloads”.


Can someone explain the difference between fp16 and fp32 in these benchmarks because the difference is pretty dramatic. I assume it's floating point precision(?) but why would lower precision be slower relatively on the 3090? For training jobs how does the precision impact accuracy of the model?

Edit: clarified that I am referring to slower relative performance


Nvidia nerfed at the software level the FP16 performance to disincentivize people from using this card as a TITAN / datacenter ML card replacement.


It isn't at the software level, FP16 goes through the tensor cores on Turing onwards: https://www.anandtech.com/show/13973/nvidia-gtx-1660-ti-revi...)



The ALUs are capable of half precision regardless of the tensor cores and aren’t restricted.

For “tensor ops” in GeForce cards FP16 with FP32 accumulate is done at half rate so you don’t get double the performance which you do get in Quadro and Titan cards using the same die.


Fp16 is faster in this article on most models...


That's because of the improved memory bandwidth. See https://timdettmers.com/2020/09/07/which-gpu-for-deep-learni...


FP16 is faster (units are images per second)


3090 opted for bundling 2x FP32 units Bulldozer-style and now FP16 is processed by those cores as well, so FP16 and FP32 have the same performance (35.58TFlops).

https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622


> FP16 is faster (units are images per second)

But does model get quality hit: need to train for more steps before converging to the similar performance and have more parameters?

FM16 obviously contains less information than FP32.


Sorry I was referring to the relative performance, I edited my question to be clearer


I just want to know how they installed the new nvidia cuda drivers without borking their Ubuntu/tf install.


Hi, we just installed everything using nvidia repo and .deb packages.

+ tf-nightly and other python libraries installed through pipenv


Nvidia has official ppa's with all versions of CUDA, libcudnn and drivers. If you install from there you will not have problems.

It helps keeping to Ubuntu LTS versions though, that's what they support best.


NVidia drivers b0rked rebooting my box for a long time.

A couple of months ago, I removed all the references in apt sources, and followed the newer instructions (several times to get the right driver/cuda/tensorflow match) and my reboots are great, and only one GPU lock up so far (probably due to overheating - I've had to replace a couple of components flag as failed due to the heatwave in summer)

Jupyter hub is just great, I'd like to implement better diagnostics though ... have yet to find a good tutorial for that as yet.


If I have a really remote location and I need to do on-premises inference, am I better off buying one of the gaming GPUs or are they far behind the T4, etc.?


It's the opposite, the T4 is far behind gaming GPUs.

Reasons to buy a T4:

- You need the 16Gb RAM. You usually won't for inference, though.

- You need single slot cards to use in server chassis. Even here you are likely better off buying a Quadros than a T4

- You need the high power efficiency of the T4 (75W TDP)


Oh interesting.

Is there a benchmark chart or something like that where I can review the performance of something like YOLOv3 across cards?


I thought I read Nvidia was nerfing the GeForce cards. Does this disprove it?


NVIDIA nerfs FP64 performance on consumer GeForce for recent years. It's critical for scientific calculations but not needed for ML. Alternatively they banned to run GeForce on datacenter.


No, 3090 has nerfed tensor cores and in some apps Titan RTX is 5x faster (Siemens NX). FP32 accumulate is at 0.5x like with 2080Ti, while Titan's is at 1x.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: