I'm one of the developers of CUDA at NVIDIA, so if anyone has any questions about that or GPUs in general that they'd like answered, feel free to post it here and I'll do my best.
However, answers might be a bit delayed--I'm at SC10 this week (or if you're at SC10, I'm an easy person to find at the NVIDIA booth).
I'll come visit you! I've been learning OpenCL and it would be interesting to get your opinions on it as a CUDA dev :)
I know from my professor that CUDA is much nicer to work with since its more mature, but I'm interested in OpenCL's future, especially when it might get some of the nice features CUDA has!
This is nice. I work at a University and I spoke to the guy who runs our High Performance Computing system about Amazon EC2 the other day. He made a good point which is worth noting:
You can bring up a lot of EC2 instances to run large jobs in parallel and get a lot of CPU horsepower, but the main drawback of EC2 is the network. Most of the jobs that run on our local "super computer" involve processing terrabytes of data. Transferring terrabytes of data to the cloud is painful, and will continue to be painful for a long time to come. I suspect even transferring terrabytes of data between EC2 instances wont be smooth either.
Still, EC2 is great and I love how fast they're bringing out new features.
I guess that is the reason why you really don't want to spin off a lot EC2 instances for large jobs. Instead, you may use the MapReduce framework s.t. you can exploit the data locality. And of course, MapReduce only solves a small portion of parallel problems.
The two main points of the cluster compute instance were 10 GBit/s full bisection bandwidth networking and high end hardware. Now you can add GPUs as the third reason to use cluster compute instances.
I'm not sure I understand how that would help the issue of transferring terrabytes of data over the Internet? I've not come across the "Cluster Instance" though so maybe I'm missing something?
I have to say, running Clojure on instances like these for a couple hours at a time to get a sense of what Clojure offers in terms of concurrency and parallelism on a 8-core machine with gobs of RAM is great fun - http://dosync.posterous.com/clojure-multi-core-amazon-cluste....
It's the kind of computing excitement I imagine Lisp Machine users had.
I spent a summer with a Symbolics Lisp Machine in 1988. At pretty much every point when Lisp machines were commercially available, they were slower at Lisp tasks than the generic competition. We couldn't wait to get off of the Symbolics onto generic Sun 4s (with 68020 chips, the same as Macintosh IIs).
Lisp machines promised to be unbelievably fast at Lisp because the Lisp interpreter was "in microcode". In reality, the much smaller market for Lisp machines meant that they just weren't developed nearly as quickly as Intel and Motorola were iterating, so by the time they shipped, there were mass market, general purpose CPUs that could interpret Lisp faster than Lisp machines could execute it natively.
Sorry my point about "excitement" was simply that, computing enthusiasm - not the historical performance of Lisp machine against other hardware architectures of the time.
First, Sun 4s were SPARCs, not 680x0s. That was the difference between Sun 3s and Sun 4s.
Second, it actually took several years from when the first Lisp Machine was completed in 1974 to when the first "generic" workstation was available in 1982, the year after Lisp machines became commercially available. It took several more years for Lucid Common Lisp and other such standard-CPU Lisps to improve their performance to the point where they were competitive with the specialized hardware. The exact crossover point depended on your application.
I think that by 1988 CISCy LispMs were a clear dead end for high-performance computing, along with every other 1970s architecture except the 360. The folks at Symbolics didn't think so, and kept designing and shipping new Ivory hardware for the next four years, but I think people were only buying them because of Genera.
The Sun/4 was Sparc-based. I remember, because around 1990-91 I was in a department with a mix of Sun/3s and Sun/4s, and the incompatible CPUs were an occasional spot of friction.
Interesting move.
We just bought two Tesla Cards for a university project so I know how much people could save by just prototyping on a "small" card that can do ?OpenCL? and then using a Quadruple Extra Large instance at $2.10 per hour for the actual computation instead of buying a 5 digit Euro workstation
The problem is that you need to tailor your problem to the core thats actually running, which means you can't really develop on any old NVIDIA gpu and expect optimal performance on the cluster.
But on the other hand a developer who works for 8 hours a day is going to cost about 22 dollars -- which isn't that much when you take his salary into account.
In other words: stop being cheap and start saving money.
> you can't really develop on any old NVIDIA gpu and expect optimal performance on the cluster.
OTOH, you can optimize your code to run on the bigger iron while you check it on the small one on your desk to see if the results are correct. Computers can make billions of miscalculations per second if the programmer is not watchful enough.
$22 was not salary, it was some estimate of the cost of developing on the GPU instance, per day. (Seems high though, should be $2.10/hour * 8 hours unless there are other fees I didn't notice.) In any case, this is very cheap compared to the developer's salary, and since there are multiple cores, could be shared. OTOH, there is some cost to developing on a remote machine since it takes time to set up an environment as productive as the local one you already have.
Yeah, I tried an OpenCL FFT example, and it was very sensitive to a number of non-obvious hardware tuning parameters.
That actually turned me off OpenCL at the time. Are there any black boxish GPU-accelerated FFT implementations? I'd like a drop-in replacement for FFTW.
Tianhe-1A with a peak performance of 4.701 PetaFLOPS
Nit: Linpack is an astoundingly easy benchmark to optimize, and they only attained 2.57 Pflop/s there. Most real science runs at much lower efficiency than Linpack (often more than an order of magnitude), primarily due to architectural reasons, so the theoretical peak number is even less meaningful than Linpack.
If you are looking at writing gpu code checkout http://www.tidepowerd.com/ startup in this area, just released their first beta, a .net gpu library/compiler
edit: gnu->gpu, for a harsh downvote, iPad auto correct :)
No. Based on current exchange rates, the most optimized GPU-based system consumes in electricity more than 10 times more than it makes in completed bitcoins.
There is currently massive oversupply of computing power driving the profit of generating them down. I find this quite silly -- even if you were certain that bitcoins are going to massively appreciate in value, the best way to invest in them right now is not setting up computing clusters but heading for the nearest exchange and buying them.
I can't imagine that this is what everyone is doing, but the marginal cost of using a VPS/colocated server's idle capacity is usually $0 to the customer.
However, answers might be a bit delayed--I'm at SC10 this week (or if you're at SC10, I'm an easy person to find at the NVIDIA booth).