The Incredible Power of the Amazon EC2 Cluster GPU Instances

tmurray · on Nov 15, 2010

I'm one of the developers of CUDA at NVIDIA, so if anyone has any questions about that or GPUs in general that they'd like answered, feel free to post it here and I'll do my best.

However, answers might be a bit delayed--I'm at SC10 this week (or if you're at SC10, I'm an easy person to find at the NVIDIA booth).

enjalot · on Nov 15, 2010

I'll come visit you! I've been learning OpenCL and it would be interesting to get your opinions on it as a CUDA dev :)

I know from my professor that CUDA is much nicer to work with since its more mature, but I'm interested in OpenCL's future, especially when it might get some of the nice features CUDA has!

mike-cardwell · on Nov 15, 2010

This is nice. I work at a University and I spoke to the guy who runs our High Performance Computing system about Amazon EC2 the other day. He made a good point which is worth noting:

You can bring up a lot of EC2 instances to run large jobs in parallel and get a lot of CPU horsepower, but the main drawback of EC2 is the network. Most of the jobs that run on our local "super computer" involve processing terrabytes of data. Transferring terrabytes of data to the cloud is painful, and will continue to be painful for a long time to come. I suspect even transferring terrabytes of data between EC2 instances wont be smooth either.

Still, EC2 is great and I love how fast they're bringing out new features.

njharman · on Nov 15, 2010

They have (or had) a service to ship them media (stack of harddrives) for data transference. Good if you'll be using the same data for a while.

AWS doesn't solve every computing infrastructure problem, it's still awesome.

mike-cardwell · on Nov 15, 2010

Cool. I didn't know that.

axomhacker · on Nov 15, 2010

In case you have not stumbled on to it yet: http://aws.amazon.com/importexport

bbgm · on Nov 15, 2010

And we have peering arrangements with several R&E networks and more coming.

aristus · on Nov 15, 2010

That is pretty awesome to hear. Any plans to peer with other cloud providers? a la http://cloud-peering.com

liuliu · on Nov 15, 2010

I guess that is the reason why you really don't want to spin off a lot EC2 instances for large jobs. Instead, you may use the MapReduce framework s.t. you can exploit the data locality. And of course, MapReduce only solves a small portion of parallel problems.

swannodette · on Nov 15, 2010

Huh, isn't that the whole point of EC2 Cluster Instance. You can specify that you want your instances to be in the same physical location.

SpikeGronim · on Nov 15, 2010

The two main points of the cluster compute instance were 10 GBit/s full bisection bandwidth networking and high end hardware. Now you can add GPUs as the third reason to use cluster compute instances.

mike-cardwell · on Nov 15, 2010

I'm not sure I understand how that would help the issue of transferring terrabytes of data over the Internet? I've not come across the "Cluster Instance" though so maybe I'm missing something?

swannodette · on Nov 15, 2010

javacl, http://code.google.com/p/javacl/

idiomatic Clojure wrapper from the developer of Aleph - https://github.com/ztellman/calx

I have to say, running Clojure on instances like these for a couple hours at a time to get a sense of what Clojure offers in terms of concurrency and parallelism on a 8-core machine with gobs of RAM is great fun - http://dosync.posterous.com/clojure-multi-core-amazon-cluste....

It's the kind of computing excitement I imagine Lisp Machine users had.

spolsky · on Nov 15, 2010

I spent a summer with a Symbolics Lisp Machine in 1988. At pretty much every point when Lisp machines were commercially available, they were slower at Lisp tasks than the generic competition. We couldn't wait to get off of the Symbolics onto generic Sun 4s (with 68020 chips, the same as Macintosh IIs).

Lisp machines promised to be unbelievably fast at Lisp because the Lisp interpreter was "in microcode". In reality, the much smaller market for Lisp machines meant that they just weren't developed nearly as quickly as Intel and Motorola were iterating, so by the time they shipped, there were mass market, general purpose CPUs that could interpret Lisp faster than Lisp machines could execute it natively.

swannodette · on Nov 15, 2010

Sorry my point about "excitement" was simply that, computing enthusiasm - not the historical performance of Lisp machine against other hardware architectures of the time.

kragen · on Nov 15, 2010

A couple of quibbles.

First, Sun 4s were SPARCs, not 680x0s. That was the difference between Sun 3s and Sun 4s.

Second, it actually took several years from when the first Lisp Machine was completed in 1974 to when the first "generic" workstation was available in 1982, the year after Lisp machines became commercially available. It took several more years for Lucid Common Lisp and other such standard-CPU Lisps to improve their performance to the point where they were competitive with the specialized hardware. The exact crossover point depended on your application.

I think that by 1988 CISCy LispMs were a clear dead end for high-performance computing, along with every other 1970s architecture except the 360. The folks at Symbolics didn't think so, and kept designing and shipping new Ivory hardware for the next four years, but I think people were only buying them because of Genera.

metageek · on Nov 15, 2010

The Sun/4 was Sparc-based. I remember, because around 1990-91 I was in a department with a mix of Sun/3s and Sun/4s, and the incompatible CPUs were an occasional spot of friction.

rb2k_ · on Nov 15, 2010

Interesting move. We just bought two Tesla Cards for a university project so I know how much people could save by just prototyping on a "small" card that can do ?OpenCL? and then using a Quadruple Extra Large instance at $2.10 per hour for the actual computation instead of buying a 5 digit Euro workstation

tomjen3 · on Nov 15, 2010

The problem is that you need to tailor your problem to the core thats actually running, which means you can't really develop on any old NVIDIA gpu and expect optimal performance on the cluster.

But on the other hand a developer who works for 8 hours a day is going to cost about 22 dollars -- which isn't that much when you take his salary into account.

In other words: stop being cheap and start saving money.

rbanffy · on Nov 15, 2010

> you can't really develop on any old NVIDIA gpu and expect optimal performance on the cluster.

OTOH, you can optimize your code to run on the bigger iron while you check it on the small one on your desk to see if the results are correct. Computers can make billions of miscalculations per second if the programmer is not watchful enough.

And 22 dollars an hour is really cheap.

jedbrown · on Nov 15, 2010

$22 was not salary, it was some estimate of the cost of developing on the GPU instance, per day. (Seems high though, should be $2.10/hour * 8 hours unless there are other fees I didn't notice.) In any case, this is very cheap compared to the developer's salary, and since there are multiple cores, could be shared. OTOH, there is some cost to developing on a remote machine since it takes time to set up an environment as productive as the local one you already have.

tomjen3 · on Nov 15, 2010

I didn't really want to go back, but I though it was $2.30/hour.

JabavuAdams · on Nov 15, 2010

Yeah, I tried an OpenCL FFT example, and it was very sensitive to a number of non-obvious hardware tuning parameters.

That actually turned me off OpenCL at the time. Are there any black boxish GPU-accelerated FFT implementations? I'd like a drop-in replacement for FFTW.

cjenkins · on Nov 15, 2010

I'm not sure how the API compares to FFTW, but if you're on Nvidia hardware the toolkit comes with the CUFFT library http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html which is tuned for their hardware.

I seem to recall needing to transfer the data to GPU memory and other things like that when I last used it, but that was prior to toolkit 2.0 IIRC.

JabavuAdams · on Nov 15, 2010

Thanks. I also just found PyFFT, here: http://pypi.python.org/pypi/pyfft/0.3.1

I'll start with this and see how it goes.

RK · on Nov 15, 2010

This offering blows away the few people that were already trying to compete in the GPU cloud space simply by being AWS.

Our research is just moving into GPU-based processing, and we can probably adapt our current EC2 based framework to work with this relatively easily.

adonix · on Nov 15, 2010

GPU Programming with PyCUDA:

http://pdf.cx/6gc94

http://mathema.tician.de/software/pycuda

jedbrown · on Nov 15, 2010

Tianhe-1A with a peak performance of 4.701 PetaFLOPS

Nit: Linpack is an astoundingly easy benchmark to optimize, and they only attained 2.57 Pflop/s there. Most real science runs at much lower efficiency than Linpack (often more than an order of magnitude), primarily due to architectural reasons, so the theoretical peak number is even less meaningful than Linpack.

matclayton · on Nov 15, 2010

If you are looking at writing gpu code checkout http://www.tidepowerd.com/ startup in this area, just released their first beta, a .net gpu library/compiler

edit: gnu->gpu, for a harsh downvote, iPad auto correct :)

eliben · on Nov 15, 2010

fix gnu code -> gpu code :)

hedgehog · on Nov 15, 2010

Also interesting is that Amazon's CTO takes time out to post to HN.

werner · on Nov 15, 2010

You guys are important customers...

rbanffy · on Nov 15, 2010

I predict Amazon's rank in the top500 is about to improve.

Groxx · on Nov 15, 2010

That's a nigh-trump-card. Wow. AWS: say hello to massive purchases from researchers.

eof · on Nov 15, 2010

I am not qualified to do the calculations, but I wonder if these things can profitably generate bitcoins?

Tuna-Fish · on Nov 15, 2010

No. Based on current exchange rates, the most optimized GPU-based system consumes in electricity more than 10 times more than it makes in completed bitcoins.

There is currently massive oversupply of computing power driving the profit of generating them down. I find this quite silly -- even if you were certain that bitcoins are going to massively appreciate in value, the best way to invest in them right now is not setting up computing clusters but heading for the nearest exchange and buying them.

JoachimSchipper · on Nov 16, 2010

I can't imagine that this is what everyone is doing, but the marginal cost of using a VPS/colocated server's idle capacity is usually $0 to the customer.

JabavuAdams · on Nov 15, 2010

So, I just put together my task list for this week, distributed it and now this.

This. This day I've been waiting for.

There goes my task list. :\

phoenix24 · on Nov 15, 2010

for completeness, here's the official announcement from amazon.

http://aws.amazon.com/about-aws/whats-new/2010/11/15/announc...

kunley · on Nov 15, 2010

How many of you dear readers are really going to need such processing power?

gaius · on Nov 15, 2010

Anyone using Monte Carlo methods, which are commonplace, could benefit from this.

http://en.wikipedia.org/wiki/Monte_Carlo_method

almost · on Nov 16, 2010

You never know you need something until it exists ;)