From a Frat Social Network to a GPU Compiler for Hadoop

daniel-levin · on Nov 19, 2013

From the product website:

>> Wish that O(n^3) algorithm could execute in real-time? Now it's possible! We encourage you to push the limits of our platform and disregard what was previously intractable.

One of the most important things to note about asymptotic analysis is that the speed of your computer is immaterial. An O(n^3) algorithm will experience cubic slowdown, irrespective of how it's implemented. A blazing fast, parallelised solution just reduces the constant multiple in front of the n^3, but ultimately, n^3 will outgrow that constant. I'm sure you know this, but it's disingenuous to say that you can disregard an algorithm's complexity because of your data / results. A particular algorithm with particular space complexity (also important) was run on a particular input and was much faster - granted, this is impressive - but not definitive enough to justify labels like '1000x faster' and 'disregard what was previously intractable'.

Also, O(n^3) is considered polynomial, which is absolutely not intractable. The whole point of intractable problems is that they are too complex to solve (given current best known methods). This might be very useful for applications that are currently too slow to run in real-time, but absolutely not a solution to problems of much higher order (read: non-polynomial) complexity

ac2u · on Nov 20, 2013

Strictly speaking, isn't the link between whether an algo is polynomial or not in time complexity only loosely related to its intractability?

ie: polynomial usually correlates with tractable but not necessarily?

My knowledge in this is rusty though so please correct me if that's inaccurate.

Hydraulix989 · on Nov 19, 2013

Right, we're not changing the complexity itself, N-ary parallelization in the ideal case modifies the runtime s.t. C' = C_0 / N.

daniel-levin · on Nov 19, 2013

By Amdahl's law, that idealisation assumes that your program is totally parallelisable. Very few real world applications actually are. Out of interest, was yours? (Or at least very close to it?)

Hydraulix989 · on Nov 19, 2013

Yes, our algorithm performed an O(N^2) loop over the graph vertices so B for us was very close to 1.

sanskritabelt · on Nov 20, 2013

If this is the Moyes algorithm which does graph clustering I think you have to ask yourself a couple of things: 1. can a parallel algorithm that optimizes each node individually get the same result as a serial algorithm that optimizes them sequentially (this is not at all obvious, and for some graph clustering algorithms the answer is "no") 2. is a O(n^3) total work algorithm really that much better than one of the multiple existing subquadratic graph clustering algorithms out there.

daniel-levin · on Nov 19, 2013

OK, but I think you mean B was close to zero then?

Hydraulix989 · on Nov 19, 2013

Fair enough.

sanskritabelt · on Nov 20, 2013

I don't want to run an O(n^3) algorithm on even a medium sized graph if I can help it. (and in the new order of 'big data', even O(nlogn) algorithms can be too expensive to solve).

fiatmoney · on Nov 19, 2013

Hadoop & MapReduce are usually an iffy choice for compute-bound workloads and definitely a bad choice for memory-bandwidth bound problems.

Solving problems caused by misuse of familiar tools (SQL, Hadoop, Excel, etc) seems to be a good meta-strategy for finding good business niches.

tonydiv · on Nov 19, 2013

We've proven that we can also speed up IO-bound and disk-bound workloads.

Memory-bound jobs are not an issue either. In fact, 12GB Nvidia and AMD GPUs were announced this week.

fiatmoney · on Nov 19, 2013

I don't doubt that you have, although I would be curious about the mechanism. Spindles are spindles, whether they're feeding to a GPU or a CPU.

I would be careful about phrasing it as "memory bound is not an issue". There's a distinction between memory bandwidth (which GPUs excel at as long as the data is resident), memory latency (which GPUs are pretty bad at), and memory capacity. 12gb is slim in the grand scheme of things; my laptop has 16gb and you can get machines with >1TB of memory. Outside of the GPU's memory limits, you're relying on streaming data to/from the device, and potentially bound on PCIe bandwidth, or partitioning the problem across multiple GPUs/machines.

tonydiv · on Nov 19, 2013

Correct.

PCIe is the least of our concerns at 16GB/s. Some of our tests indicate that our GPU analyzes data at a rate of ~5GB/s. If we can feed it the data in that amount of time, we're in business. We're also parallelizing disk reads, which is one bottleneck at this point.

Hydraulix989 · on Nov 19, 2013

Right. These problems are all addressable. We're using a streaming scheme to ensure that the bus bandwidth is fully utilized for each GPU. Moreover, a dispatcher will intelligently partition and distribute jobs across multiple hosts.

hendzen · on Nov 19, 2013

A lot of academic work has been done on using GPUs for general purpose workloads. I would recommend familiarizing yourself with the literature if you have not already. Some work that may be interesting to you is the idea of Persistent Thread Groups [0].

Also, have you considered using Xeon Phi instead of GPUs? Seems like this would be a perfect use case, and it would probably be much easier to work with x86.

[0] - http://cs.olemiss.edu/heroes/persistentThread.pdf

wmobit · on Nov 19, 2013

But the 64-bit pointer support isn't there yet, so it's not super useful. You can't use it all within the same program

nemonemo · on Nov 19, 2013

GPU acceleration sounds all great.. until anyone actually generalizes the framework, implements applications, and measures even the most trivial workload. There should be some niche applications you may get 2-10x benefit, but 1000x is either measurement error, apple-to-banana comparison, or a new marketing gimmick. Unless you have a very fitting application, it is very easy for GPU to have less performance than CPU. I would be very curious what application actually could have 1000x performance enhancement. Even memory bandwidth, the main benefit of GPU, does not differ by that amount.

tonydiv · on Nov 19, 2013

We are not making up these numbers. Even for an IO-bound workload, we are seeing 350x speedup.

The very fitting application is MapReduce, which is embarrassingly parallel.

6ren · on Nov 19, 2013

(not parent) Yes, since highend GPUs have 1,000's of cores, it makes sense (though I wouldn't expect one GPU core to be faster than one CPU core.)

Please, could you quote the CPU and GPU for which you got that 350x speedup? Also, for the 1000x speedup.

(also, it would be interesting to know the CPU/GPU that gave the original 1 hour to 0.2 sec (18,000x speedup) - I'm guessing other factors like network latency, low-end CPU + high-end CPU, optimized code etc were part of it.)

nemonemo · on Nov 19, 2013

This is a very bold statement. IO-bound means that the bottleneck is at IO. GPU has little benefit in IO.

Also, mapreduce is not embarrassingly parallel. Out of mapreduce, map is the only embarrassingly parallel phase. It has input parsing, shuffle, reduction phases that are not embarrassingly parallel.

Hydraulix989 · on Nov 19, 2013

I should clarify -- initially I/O bound, but with pipelining, we are able to "impedance match" the flow of data coming off the backing store with the GPU's computational progress. I suppose the better word is "I/O heavy," rather than "I/O bound."

We're certainly not the first to describe mapreduce as "embarrassingly parallel," but I can defend myself nonetheless:

A shuffle executes in parallel where each unit re-indexes into another unit. Reduction is parallelized over each key, and even finer granularity can be achieved by considering a tree-based approach:

http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x...

In fact, I used this tree reduction approach to merge non-disjoint clusters of friend groups in my social network clustering algorithm.

Parallelizing the input parsing is problem-specific, but even that is possible in many cases.

x0x0 · on Nov 19, 2013

350x on what possible workload? how do you speed up i/o bound workloads -- are you offloading compression/decompression to the gpu or something?

steven2012 · on Nov 19, 2013

They mention above they went from Python to presumably C or something else closer to the metal. That probably explains a huge part of the speedup.

tonydiv · on Nov 19, 2013

You're correct, regarding the 18,000x speedup. It's worth mentioning that Java isn't very fast either.

mimog · on Nov 20, 2013

>> It's worth mentioning that Java isn't very fast either. Lies!

alanctgardner2 · on Nov 19, 2013

Why bring Hadoop into it? MR is a paradigm for parallel programming, and having it compiled into OpenCL for you is hugely convenient. But the best Hadoop use cases always leverage massive storage capacity: people who do ML on Hadoop are doing it because the data is already in Hadoop, not because it seemed like a nifty thing to do. Do you actually leverage HDFS and the Hadoop scheduling infrastructure to run ParallelX jobs? Or can you safely jettison them and just sell this as: "Write your ML/graph analysis/whatever code in Java, have it execute on a GPU"

Also, what are the limits like for input data sizes? I've done a little OpenCL, but I've never gone past the GPU RAM size.

tonydiv · on Nov 19, 2013

"Do you actually leverage HDFS and the Hadoop scheduling infrastructure to run ParallelX jobs?" -- Yes.

We could also provide what you're suggesting outside of Hadoop, no problem!

"What are the limits like for input data sizes?" --- There are no limits. You can store your data in AWS, and we will crunch it for you. That said, initially, for IO-bound and disk-bound jobs, ParallelX might not be ideal. This is a problem we are solving as we scale.

Thanks for the feedback! We appreciate it!

espeed · on Nov 20, 2013

Yeah, it's unclear how the Hadoop cluster comes into play here. However, in general, a Spark cluster (http://spark.incubator.apache.org/) will have better IO performance than Hadoop, and Spark code is much simpler than equivalent Hadoop code.

Spark is a new computing framework out of Berkeley's AMPLab (https://amplab.cs.berkeley.edu/software/), and it might be an interesting platform to target.

It's being adopted by Twitter, Yahoo, Amazon (http://www.wired.com/wiredenterprise/2013/06/yahoo-amazon-am...), and it's now commercially backed by Databricks (http://databricks.com/), which just recieved funding from Andreessen Horowitz.

See https://news.ycombinator.com/item?id=6466222

jbooth · on Nov 19, 2013

Are you porting string processing routines to OpenCL? The grid model of GPU computing seems to be a bit of a mismatch with the jagged-edge model of a bunch of variably-sized records in 2 1TB files that I want to join on in my typical hadoop use-case.

I guess I've only delved into GPU stuff for dense matrix math, though, which is a pretty bad fit with Hadoop. Maybe you guys can come up with some other use-cases for them.

Hydraulix989 · on Nov 19, 2013

Yes, we are porting string processing routines to OpenCL. We are going to have many of the Java libraries accessible within the compiled GPU code, including file system I/O.

alanctgardner2 · on Nov 19, 2013

What's a typical input size, if you don't mind me asking?

tonydiv · on Nov 19, 2013

What do you mean 'typical?'

sanskritabelt · on Nov 19, 2013

" My co-founder, Chuck Moyes, discovered a clustering algorithm that identified friend groups in a given social graph. However, the algorithm took an hour to run parallelized across 6 computers. Then Chuck had the clever idea of coding the algorithm on a single GPU. The run time? 0.2 seconds! WHOA."

I do a lot of work with graph clustering algorithms and have done a little with GPUs and I have to say this is, at best, rather unexpected.

groups · on Nov 19, 2013

I genuinely don't understand what "'at best' 'rather' 'unexpected'" means.

tonydiv · on Nov 19, 2013

I'm not sure either! Either way, implementing a clustering algorithm on a GPU was thinking out of the box for us :)

sanskritabelt · on Nov 19, 2013

What it means is I'm hesitant to believe an 18000x speedup.

enos_feedler · on Nov 19, 2013

This was my reaction as well. I think there should be a whitepaper explaining the comparison as many GPU/FPGA application acceleration companies tend to do. I would say a typical GPU speedup would be in the 10-20x range with 100x being possible for highly "regular" data parallelism. I have no doubt the GPU wallclock time is correct so I am guessing the CPU implementation is just very poor.

tonydiv · on Nov 19, 2013

We used parallel Python to implement the first version of the clustering algorithm. Although the CPUs weren't particularly beefy, the implementation was by no means poor. It's worth mentioning that the laptop GPU wasn't powerful either.

sanskritabelt · on Nov 19, 2013

In that case, how much of your 18000x speedup is due to the GPU and how much of it is due to re-implementing the algorithm in a compiled language? Python->C/C++/etc is a 100x-1000x speedup right off the bat.

That's not a fair CPU/GPU comparison at all.

tonydiv · on Nov 19, 2013

You're right, it isn't a perfectly fair comparison. However, even if we had coded it in C++, the run time difference would still have been significant.

We're just sharing our story at this point. Next, we'll be sharing our whitepaper and data. We rather start gathering feedback now, as opposed to after writing 100k LoC for the compiler.

aktau · on Nov 19, 2013

Ah, python... That explains it.

6ren · on Nov 19, 2013

Love the serendipity moral; loved surviving the vicissitudes. The story reads well.

There's a lacuna between Codentical and Parallel X: 1. people willing to pay for it; 2. a spark went off; 3. "validate before you build"

You say you did have paying users, so why did you stop? What was the spark? I'm guessing that there weren't enough paying users - but that's not stated. This makes an inexplicable gap as you race down the homestretch of the story. It's not a huge problem, but since the rest of the story is so great, it's a shame to mar it.

Also, the story would be absolutely compelling if you mentioned the specific CPU, GPU and task that gave the incredible speedups. Your reader wonders, "Why aren't they mentioned?"

tonydiv · on Nov 19, 2013

We had paying users for GraphMuse, not Codentical. GraphMuse was shut down by Facebook, whereas we quit Codentical.

As I stated, we had a good number of people "willing to pay for it," but they weren't paying customers yet.

We will include this information on our website soon! :)

steven2012 · on Nov 19, 2013

Not to be negative, but how soon before these founders quit ParallelX for their next great idea?

I would be extremely hesitant investing the engineering resources to use a product where the founders themselves don't have any conviction in their previous companies. I'd hate to be the victim of another flight of fancy, where the response is "Sorry, but we're pivoting."

tonydiv · on Nov 19, 2013

ParallelX has been an idea in the making for 2.5 years. Your skepticism is valid, and only as we continue developing, iterating, and working with customers will we convince you that we're serious.

This is the first time we'll be working full time, in the same city, if that matters for anything.

Hydraulix989 · on Nov 20, 2013

To be fair, Facebook shutting down GraphMuse left us with no other choice but to abandon the project, as mentioned in the article. Greekdex still has active users, and we'll still push the occasional update on there.

idoh · on Nov 19, 2013

Hey Tony - nice to see that you've bounced back after GraphMuse! I was the PM for Bingo and GraphMuse really did help with the #'s, and Facebook really did shut the whole thing down with no notice.

The lesson with FB is more nuanced though - if you want a startup then dependance is bad, but you can have a very successful lifestyle-type business on them and do well though.

tonydiv · on Nov 19, 2013

Hey Idoh! I'm happy you saw our post, and thank you for the kind words.

You're definitely right about the lesson with FB. I'm just bitter about it :)

iamleppert · on Nov 19, 2013

1000x? I won't believe it until you can provide a simple, open source example illustrating your technology, or at least a technical write-up on your testing methodology and associated raw data from the benchmarks.

Or maybe all that time building apps in college you forgot to attend your statistics class?

enos_feedler · on Nov 19, 2013

There should definitely be some kind of whitepaper accompanying this marketing claim.

tonydiv · on Nov 19, 2013

We are working on that paper right now. If you're interested, feel free to provide us your email on the site.

laglad · on Nov 19, 2013

I've known Tony now for 6 years. He's smart and extremely resilient - two great qualities of an entrepreneur.

I know little about GPU compilers though. What are the key factors to make this successful?

tonydiv · on Nov 19, 2013

I can't thank you enough!

There are a number of factors that will make this product successful:

1. The compiler needs to work well. 2. We need to teach average developers that they can build new applications/algorithms that were previously unfeasible. 3. We need to reduce Hadoop costs for large corporations.

Simple enough, right? :) Also worth noting that ParallelX will be an Heroku add-on as well (with a freemium plan!)

jerven · on Nov 19, 2013

Very nice read! How does this work compare to work on java 8 to bring GPU compute to the jvm by AMD, ARM and Oracle? Do you see this as a competition or do you feel you are in a separate niche?

Hydraulix989 · on Nov 19, 2013

Yes, we are in our own separate niche. Rather than optimizing the computation of individual loops and the like, we're attacking the problem from a "Big Data" perspective where we seek to optimize processing large data sets. We're talking about special filesystems a la FB's Haystack to optimize disk sector layout coupled with RAID schemes to bring data to the GPU as efficiently as possible to overcome I/O bottlenecks. We are initially interop'ing with Hadoop, and we are going to extend this to support Pig and Hive queries (along with database support). I really do think this is an "apples and oranges" comparison, and it's great that they are taking the C++ AMP route of parallelizing loops (perhaps with some parallel primitives mixed in), but at the same time, there's a lot more to the problem we are trying to solve than just parallel compute.

pjmlp · on Nov 19, 2013

Great work!

How would this overlap with the upcoming support for GPU computing target for Java 9, being developed by Oracle and AMD under the HSA foundation?

Assuming it really gets integrated, that is.

Hydraulix989 · on Nov 19, 2013

From what I understand, the GPU computing target is a loop parallelizer a la Microsoft's C++ AMP.

Keyframe · on Nov 19, 2013

"Extraordinary claims require extraordinary evidence".

guiomie · on Nov 19, 2013

"It was originally designed for traffic lights and vending machines, whereas it ended up fueling the computer revolution" ... I thought silicon chips/cpus were designed for military purpose initially, such as radars ?

tonydiv · on Nov 19, 2013

Not from what I've read, but I could be wrong:

"The microprocessors that made the microcomputer possible had originally been developed to run traffic lights and vending machines. They had never been intended to power computers. The first entrepreneurs to attempt this were laughed at; the computers they had created looked hardly worthy of the name--they were so small and could do so little. But they caught on with just enough people for whom they saved time, and slowly, the idea took off."

http://www.fastcompany.com/3002810/aha-moments-made-paul-gra...

catdaddy · on Nov 19, 2013

Great read - I've known Tony for some time, and seen him through a few of his ventures, and this strategy seems like an incredibly savvy way of solving a real problem. Congrats, Tony!

tonydiv · on Nov 19, 2013

Who is this? Either way, thank you very much :D

sitkack · on Nov 20, 2013

One can already run Java->OpenCL workloads on Hadoop using https://code.google.com/p/aparapi/ you don't need buy any magic sauce. It compiles kernels written in java down to OpenCL which can run on SSE or GPU backends.

j_s · on Nov 19, 2013

Then we received this email from Facebook’s Platform Policy Team [Notice of Violation] I used the Wayback Machine to see when the clause was added… 3 days prior

Always reserve the right to change your API TOS!

tonydiv · on Nov 19, 2013

While I realize that an TOS is always subject to change, I still think it's a low blow to pull something like that. They intended it to be a one-way conversation as well. We had no way to respond.

Facebook is notorious for doing this too.

scott_usa · on Nov 19, 2013

I love how the email from FB began with "hi," as if it were a note from your dear grammy checking in with you.

The deadline was 5PM PST - how many days (or hours) did you have to comply?

tonydiv · on Nov 19, 2013

We had less than 3 days to comply if I recall. Considering we had multiple customers in a handful of countries, it wasn't much time.

truthteller · on Nov 19, 2013

Where is the paper for this novel clustering algorithm?

tonydiv · on Nov 19, 2013

Here you go - http://www.eecs.berkeley.edu/~isabelle/papers/cci.pdf

bouk · on Nov 19, 2013

Cool story, and good luck with your company!

booop · on Nov 19, 2013

Great read, and ParallelX looks great too.

officialjunk · on Nov 19, 2013

i like this piece. great journey. thanks for sharing

tonydiv · on Nov 19, 2013

You're very welcome. I appreciate the kind words.

crowhack · on Nov 19, 2013

Enjoyed this :)

pbhojwani · on Nov 19, 2013

Great read!