Lots of people are focusing on this being done on a particularly powerful workstation, but the computer described seems to have power at a similar order of magnitude to the many servers which would be clustered together in a more traditional large ML computation. Either those industrial research departments could massively cut costs/increase output by just “magically keeping things in ram,” or these researchers have actually found a way to reduce the computational power that is necessary.
I find the efforts of modern academics to do ML research on relatively underpowered hardware by being more clever about it to be reminiscent of soviet researchers who, lacking anything like the access to computation of their American counterparts, were forced to be much more thorough and clever in their analysis of problems in the hope of making them tractable.
It's a good analogy. In particular, GPUs completely broke the trajectory of benefitting from carefully-managed compute granularity. One could say that CS/EE is now back-filling harder problems that didn't demand to be solved over the last decade because of GPUs.
Optimizing for cache management and branch prediction is very difficult. And most programming is done a level of abstraction that isn't amenable to staying portable (i.e. staying nimble) after optimization.
Plug: Staying algorithmically nimble after optimization is a problem we've "solved for" at our startup (monument.ai)
If anything it seems to me that doing the most work under constraint of resources is precisely what intelligence is about. I've always wondered why the consumption of compute resources is itself not treated as a significant part of the 'reward' in ML tasks.
At least if you're taking inspiration from biological systems, it clearly is part of the equation, a really important one even.
I dunno... $8000 builds a 64c/128t 256 GB RAM workstation with the same GPU these researchers used (https://pcpartpicker.com/list/P6WTL2). That's arguably in the realm of home computer for just about anyone making $90,000 and above, I would think; I would also think anyone working in those fields could command at least that salary or greater, unless they're truly entry level positions. Seems it would be a reasonable investment for someone actively working in the area of machine learning / artificial intelligence.
They used an Nvidia 2080 TI, which is expensive today, but rumours are that the entry level 3XXX series will be similar to the 2080 TI in performance. That should be dropping in about a month!
In a ML setup your GPU is going to be the workhorse. I'd spend less on the CPU and memory and spend more on the GPU. Throwing something together REALLY quick (I'm being lazy and not checking) I'd go something closer to this https://pcpartpicker.com/list/T9Ds27 because of the graphics card. But getting a Ti would improve a lot just for GPU memory. You could get a lot done with a machine like this though (I assume it would be good for gaming too. Disclosure: not a gamer).
tldr: upgrade GPU, downgrade CPU and ram to keep similar pricing.
This depends quite a lot on the domain. In some image processing tasks you can actually be cpu-bound during dataloading. So either you get tons of RAM and preload the dataset, or you use more cores to queue up batches. You still need a good GPU and generally I'd agree to prioritise that first.
You can get CPU bound, I am $100 under and you could put money towards that or RAM. I did also leave a path to upgrade RAM. But that said, I've been working with image generation a lot lately and CPU really isn't my bottleneck.
If doing image processing (or NLP with rec nets), I wouldn't save on VRAM. 11GB minimum (2080/1080TI). Otherwise you can't even run the bigger nets with good image resolution.
I think the more important cuts that went unmentioned are the CPU cooler and mobo. You could arguably cut the CPU cooler completely, since these Ryzen CPUs include one.
The mobo cut is also a pretty useful savings, though it will be an obstacle to multiple-GPU setups.
Also, the parent comment misleadingly suggests that a 3900X costs less than $300. That seems like an error in pcpartpicker, since clicking through reveals a true price of $400+.
well an m5ad.24xlarge is 96 threads and 384G with your own 2xSSD (saves on EBS bandwidth costs). So fewer threads but a bit more memory. (we'll guess that is a 48 core EPYC 7642 equivalent with 96 threads since there is no 96 core version)
How does having your own SSD save on bandwidth? You have to send the data to and from the cores where the computation is taking place.
There are certainly some workloads where it makes sense to own your own storage and rent computation, but you can't assume that by default for a "powerful AI" workload.
There is no exact cloud equivalent - the researchers used commodity hardware for their GPU, something that NVIDIA doesn't allow for use in data centres.
The closest you can get on AWS (more like System #3 in the paper with 4x GPUs) would be something like a p3.8xlarge instance [1] that'll cost you $12.24 (on demand) or $3.65 to $5 (spot price, region dependent) [2].
A single GPU instance (p3.2xlarge) only 16 vCores, though) will cost you $3.06 on-demand or $0.90 to $1.20 (spot).
They mention multiple architectures, including down to a 10 core, 128GB ram, and 1080Ti setup. Next setup is improved CPU, double ram, and a 2080Ti. Third setup is 4 2080Tis. See table A.1
For shallower models CPUs aren't overshadowed by GPUs as much - but beyond a certain # of parameters the CPU loses out as GPUs do vector math highly efficiently.
"The team also tested their approach on a collection of 30 challenges in DeepMind Lab using a more powerful 36-core 4-GPU machine. The resulting AI significantly outperformed the original AI that DeepMind used to tackle the challenge, which was trained on a large computing cluster."
Well, they presumably tested the same CPU with 4 GPUs (2080 Ti I think) - maybe they wanted to compare.
Kind of a weird headline, the vision-based RL this article dubs 'powerful AI' could already easily be trained on a single (pretty expensive) computer. They say as much in terms of the speed ups they provide:
"Using a single machine equipped with a 36-core CPU and one GPU, the researchers were able to process roughly 140,000 frames per second while training on Atari videogames and Doom, or double the next best approach. On the 3D training environment DeepMind Lab, they clocked 40,000 frames per second—about 15 percent better than second place."
May be my AI tasks are to simplistic, but I never had problem training AI on single machine and as many pointed out there always cloud services, otherwise I find it an impressive work, it takes some courage to say well there so many huge companies with there's frameworks, but we can outperform them all in specific problems.
Couple things that stood out from github page:
Currently we only support homogenous multi-agent envs (same observation/action space for all agents, same episode duration).
For simplicity we actually treat all environments as multi-agent environments with 1 agent.
My speculation is that this why they gained such dramatic performance improvement. (but I might be very wrong)
I'm surprised that this is IEEE worthy and not just common sense. Of course there'll be huge speedups if, and only if, your dataset fits into main RAM and your model fits into the GPU RAM.
But for most state of the art models (think gpt with billions of parameters) that is far from being the case.
Yes. Jukebox model was trained on 512x V100 GPUs for 4 weeks. Try doing that on a $8k workstation.
Not saying it wouldn't be a worthwhile goal to improve the algorithms so that it becomes possible. At least on a 8x V100 machine, for Christ's sake. Because that's all I got.
> At least on a 8x V100 machine, for Christ's sake. Because that's all I got.
Well that's still one powerful supercomputer and allows you to pretrain BERT from scratch in just 33 hours [1].
I mean that's $100,000 in hardware you have at your disposal right there, which is still orders of magnitude beyond 8k-level workstation hardware...
It speaks to the sad affair that is SOTA in ML/AI - only well funded private institutions (like OpenAI) or multinational tech giants can really afford to achieve it .
It's monopolising a technology and papers like this help democratise it again.
Yes, it would be great to see AI training becoming more democratized again, but with its mere ~2x this paper won't help that much, plus the most expensive part in training a novel AI might well be to hire all those people that you need to create a dataset spanning millions of examples.
Training data isn't always an issue. There are plenty of methods that don't require labels or use "weakly labelled" data.
Since most contemporary methods only make sense if lots of training data is available in the first place, many companies interested in trying ML do have plenty of manually labelled data available to them.
Their issue often is that they don't want to (or can't for regulatory reasons) send their data into the public cloud for processing. Any major speed-up is welcome in these scenarios.
> His group took advantage of working on a single machine by simply cramming all the data to shared memory where all processes can access it instantaneously.
If you can get all your data into RAM on a single computer, you can have a huge speedup, even over a cluster that has in aggregate more resources.
Frank McSherry has some more about this, though not directly about ML training.
Hi Alex, cool stuff! A clarification from Section 3.2 on coordinating between rollout workers and policy workers: you say that we want k/2 > t_inf / t_env for maximal performance. If t_env is defined as the time required for one thread to simulate one step, it seems like we want k/2 > N * t_inf / t_env, where N is the number of cores/threads on the machine.
For example, say t_inf = 10 microseconds, t_env = 1 microsecond and we are training on a 30 core machine. When k=600, we batch 300 inference steps in 10 mics, and complete 300 simulation steps on the other half of the rollout workers in 10 mics. Both cohorts of rollout workers finish at the same time, achieving optimal performance.
Am I thinking about this correctly? Also, this equation assumes that we can batch up k/2 inference jobs on the GPU without increasing t_inf correct?
The article makes the point that your technique will give an advantage to academic teams that don't have the resourcers of big corporations. To me it seems that your technique optimises the use of available resources, but the amount of available resources remains the deciding advantage. That is to say, both large corporate teams and smaller academic teams can improve their use of resources using your proposed approach, but large corporate teams have more of those than the smaller academic teams. So the large corporate teams will still come up ahead and the smaller academic teams will still be left "in the dust" as the article puts it. What do you think?
The key is that large corporations/labs achieve scale through many distributed machines. This paper explores optimizations that are particular to a single multi-core machine. These optimizations exploit low-latency shared memory between threads on one machine, and thus cannot be replicated on a distributed cluster.
I wish machine learning would have become mainstream on a language with more competent multithreading capabilities than python. During my machine learning course I knew I could squeeze more performance out of my code by parallelising data preprocessing and training (pytorch), but python cannot do proper multithreading. The multiprocessing module requires you to move data between processes, which is slow.
First time I've seen this, looks like its brand new (python 3.8). The problem now is that you have to serialize/deserialize your data and modify your numpy/pytorch code to use the shared memory. It's an improvement in performance, true, but not as fast and easy as just sharing variables between threads.
Well the core of those libraries is already written in C or C++, so it's a matter of writing bindings to interface with the core. I think tensorflow and pytorch today have some C/C++ bindings, but I don't know how good they are.
Out of the question, where should an indie .IO game developer start to build their AI players in their game?
For the record, I am a solo developer in progress of developing a to-be-online browser game. I must make intelligent bots and keep players busy until the time it has a lot of online players.
I had a look at Reinforcement Learning but I am not sure people are really using for this use case.
I would start with David Silvers (DeepMind) youtube series to get an idea of what's possible or not.
Running an already trained reinforcement learning agent is relatively cheap (unless your model is massive).
I suspect the reason people aren't using it yet is because it's a) really difficult to get right in training, even basic convergence is not guaranteed without careful tuning b) really difficult to guarantee reasonable behavior outside of the scenarios you're able to reach in QA.
There‘s a really old Game AI book by O‘Reilly with a lot of clever simple solutions. You dont need the most complex, innovative and state of the art algorithms for games, most games have a really simple ai concept.
Start by making agents by using heurists and not AI. Once you have these in your game at least you can release it. If you are not happy with that performance then you can start to look at RL methods.
It looks obvious when you write it like that but I think many people are surprised by just how much slower distributed computations can be compared to non distributed systems. Eg the COST paper [1]
There's a lot of research that you can get away with on smaller hardware. Much of my own I can do with a single 2080Ti. BUT at times it is extremely frustrating and I don't have the memory to tackle some problems with just that hardware.
If you want to see major improvements in the academic and home side of AI then NVIDIA and AMD need to bring more memory to consumer hardware. But there isn't much incentive because gamers don't need nearly the memory that researchers do.
I'm reminded of the following episode of Pinky and the Brain:
Brain plans to use a "growing ray" (originally a "shrinking ray") to grow
Pinky into super-size while dressed up as Gollyzilla, while Brain would turn
himself gigantic and stop him, using the name Brainodo, in exchange for
world domination. However, the real Gollyzilla emerges from the ocean and
starts to rampage through the city, making Brain think that the dinosaur is
Pinky. The episode ends with the ray going out of control and making
everything on Earth grow, including the Earth itself, to the point that
Pinky, the Brain, and even Gollyzilla are mouse-sized by comparison again.
Being able to train deep RL models on commodity hardware is only an advantage
if there isn't anyone that can train on more powerful hardware (or if somehow
training on more powerful hardware fails to improve performance with respect
to your model). Otherwise, you're still just a little mouse and they have all
the compute.
well, you don't actually have to have your models fight their models. there's lots of things to try that aren't the SOTA rat race.
like, if you want to show relative improvement of some new variation of an RL algorithm, this could be a good way to do it. or if you have a new environment that you want to solve for yourself. right now if you try to train anything in a moderately interesting environment on a PC, it takes just a little too long to get results -- makes the whole research process pretty painful.
I'm afraid that computing resources are not and have never been the limiting factor for innovative work in machine learning in general and in deep learning in particular. I have quoted the following interview with Geoff Hinton a number of times on HN - apologies if this is becoming repetitious:
GH: One big challenge the community faces is that if you want to get a paper published in machine learning now it's got to have a table in it, with all these different data sets across the top, and all these different methods along the side, and your method has to look like the best one. If it doesn’t look like that, it’s hard to get published. I don't think that's encouraging people to think about radically new ideas.
Now if you send in a paper that has a radically new idea, there's no chance in hell it will get accepted, because it's going to get some junior reviewer who doesn't understand it. Or it’s going to get a senior reviewer who's trying to review too many papers and doesn't understand it first time round and assumes it must be nonsense. Anything that makes the brain hurt is not going to get accepted. And I think that's really bad.
What we should be going for, particularly in the basic science conferences, is radically new ideas. Because we know a radically new idea in the long run is going to be much more influential than a tiny improvement. That's I think the main downside of the fact that we've got this inversion now, where you've got a few senior guys and a gazillion young guys.
In other words, yes, unfortunately, everything is the SOTA rat race. At least anything that is meant for publication, which is the majority of research output.
at the same time, if you go to this year's ICML papers and ctrl-F "policy", there are several RL papers that come up with a new variant on policy gradient and validate it using only relatively small computing resources on simpler environments without any claim of being state of the art. probably many would directly benefit from this well-optimized policy gradient code.
Well, that's encouraging. "Pourvu que ça dure !" (as Letizia Bonaparte said).
It's funny, but older machine learning papers (most of what was published throughout the '70s, '80s and '90s) was a lot less focused on beating the leaderboard and much more on the discovery and understanding of general machine learning principles. As an example that I just happened to be reading recently, Pedro Domingos and others wrote a series of papers discussing Occam's Razor and why it is basically inappropriate in the form where it is often used in machine learning (or rather, data mining and knowledge discovery, since that was back in the '90s). It seems there was a lively discussion about that, back then.
Nah, they pursue the problems that are relevant to them, which are a very small subset of the problems being tackled globally... so you can still win your particular race at your snail pace, if they are not running?
That's not exactly how it works. For example, Go, the board game, is not particularly relevant to Google, owner of DeepMind. Yet they spent considerable resources in creating and training a succession of systems that excel at it. Besides some applications, big technology companies see AI partly as a future investment, partly as very good publicity. My guess in any case.
Then again there are the general problems that are relevant to everyone, like natural language understanding, question answering, image recognition and so on, high-level tasks for which solutions have broad applicability. Rather tautologically, such tasks are always relevant to anyone who can perform them well.
In any case, if this was not the case there wouldn't be any motivation for academic teams to find ways to train with smaller computing resources, as the article reports.
Their results on lasertag are very poor. Looks like their network was incapable of solving it. Which probably means for apple-to-apples comparison to IMPALA they need a better network, which might require them to forego their throughput advantage.
The network in our work is exactly the same as in IMPALA paper. The overall score of our agent is slightly higher, it did better on some envs and it did worse on some other envs. These lasertag levels are exploration problems, and with a bit of hyperparameter tuning they are not difficult, it's just the agent that learned to do 30 different things somehow sucks on these levels.
Were other problems less about exploration? If not, I'd argue lasertag might be more important.
In various benchmarks the geometric mean is usually used to compare total score across different tasks to account for severe issues with specific tasks.
If your network is the same as IMPALA, why do you show its results with different hyperparameters? Were some of them necessary for the optimization (e.g. reduced batch size)?
Interesting article, i will check github at weekend, but assuming atari resolution 40x192 single channel and float size 4 byte and 140000 image you got 4300800000 bytes roughly 4gb data. I wonder if they talk about complete training time
A mid-range gaming laptop can do much more and better in 2020 than in 2018, thanks to novel available batch functions, a number of preliminary hacks and much hw-friendlier frameworks.
While that's likely true, I generally find that it's quite rare that enough attention is paid to how information is moved between disk, RAM, CPU and GPU. And paying close attention to that can be extremely helpful. Taking the RAM up to 11 can eliminate a lot of the art to it, which is a good thing.
You aren't wrong - we've vacillated between personal and cloud(mainframe, shared, or centralized) for many years now and it seems that we'll probably continue on this hybrid approach for a while.
Personally I prefer to own my own compute, my own storage and my own hardware. Cloud isn't any cheaper in the aggregate long-term, but it does spread risk and costs across a longer term.
Considering everything I'd rather own the risk and up-front costs just for my own privacy and self-determination.
I don't need to own a fast workstation unless I want to continuously train my models. I can, however, quickly get a cloud instance that's much larger than that and train the model at a fraction of the time and cost of a desktop workstation.
And continuously training models is very hard, practicality in RL environment, even then cloud services over long term is possibly more cost effective solution, then hosting your own small cluster (few tightly packed racks).
I find the efforts of modern academics to do ML research on relatively underpowered hardware by being more clever about it to be reminiscent of soviet researchers who, lacking anything like the access to computation of their American counterparts, were forced to be much more thorough and clever in their analysis of problems in the hope of making them tractable.