Fast Database Emerges from MIT Class, GPUs and Student’s Invention

tmostak · on April 23, 2013

Hi all, MapD creator here - I'd be happy to answer any questions. The 700K figure was done by rendering polygon files to textures and using them as lookup tables - it ran pretty fast on the CPU so its not all the GPU there. However if a point falls on a border - perhaps 1% of cases for say a medium-sized raster of the US - you have to do a geometric lookup as usual - which I didn't benchmark. Probably better would be to do some tiled implementation - basically something like textured quad-trees.

And as I told the journalist I wasn't sure if I did the PostGIS indexing right - I feel like the quote got taken a little out of context - but the speedups were "high" to "quite high" regardless.

vanderZwan · on April 23, 2013

> By opening the code, Mostak opens up the possibility of having a competitor rapidly develop a system and force him to the edge of the market. That possibility doesn’t bother him, he said. “If worse comes to worst, and somebody steals the idea, or nobody likes it, then I have a million other things I want to do too, in my head,” Mostak said. “I don’t think you can be scared. Life is too short.

Can I just compliment you on your attitude?

throwaway1979 · on April 23, 2013

Heh ... when I saw the speedup numbers, the first thing I wondered was ... index? Regardless, this is impressive work. Do you have a tech report you can point to for someone to get more details? I'd love to learn more about your setup. Also, would you mind describing your programming background? I have a cs PhD and I admit I never had the guts to get into GPUs! You've given me a whole lot of motivation!

tmostak · on April 23, 2013

Thanks! Yeah I'd like to revisit this benchmark soon (I'm actually working on rendering choropleths on the GPU now). I'm working on a database paper that will hopefully be ready for VLDB submission by next month - that would probably be the best.

In terms of CS background - nothing formal except for AP comp sci I took in high school like 12 years ago. However I've always loved hacking on things on the side, and a combination of my research on the Middle East/burning desire to make an Iphone app kind of got me back into it. Then to be fair I took a GPU Graphics class (basically OpenGL) at Harvard and then a Database course at MIT while doing my Masters in ME Studies, which is where this project started.

ivanist · on April 23, 2013

Can you tell me whether the main reason for speedup is because of using GPU, or do other factors also play in (such as new algorithms)?

arpineh · on April 23, 2013

This seems really interesting as a concept, and much more so as a working product!

Is MapD written to any specific GPU language, like CUDA or OpenCL? I'm in the market for new GPU and this might be a factor in purchasing... ;)

Also, I can sympathize to the mismatch of creation vs running business.

phyalow · on April 23, 2013

Thanks for the clarification! I was highly suspicious of the 700k figure, it makes a lot more sense with the context of precomputing and then looking up.

btilly · on April 23, 2013

The same story was submitted to /., and the person who wrote the database answered several questions about it there. See http://search.slashdot.org/story/13/04/22/2225240/harvardmit... for more.

tmostak · on April 23, 2013

(MapD creator again) - Also, check out http://worldmap.harvard.edu/tweetmap for a live demo - even though its been a bit buggy under Slashdot load and I had to disable GetFeatureInfo requests (i.e. click to get tweet). Also see here for a demo of a new interface I'm messing with that does animation - http://www.youtube.com/watch?v=Foo8pYbSPv4 .

intended · on April 23, 2013

And roughly half an hour ago, I was hoping for something that could help me map GIS data and animate the changes over time so that I could see before and after effects.

Look forward to seeing this released.

tmostak · on April 23, 2013

Is your data relational? I can import anything from Postgres, MySQL, Solr (but has to be flattened), or shapefiles. If you tell me more perhaps I could put it into the system for you and we could try it out.

intended · on April 23, 2013

Its company data, so I use that greatest of all databases: Excel.

I'll have to think/work on this a bit - my test sets are clean but not massive enough yet for animation, while my data source is messy messy data.

Do note I'm not in the states - I'm new to this, but my experience so for has been that getting usable geolocation data for the region I am looking at is not easy.

(Incidentally, having some contact info on your profile page would make it a little easier for anyone who wanted to get in touch with you)

zzleeper · on April 23, 2013

Hi Todd! Do you have an email? (mine is sergio.correia@duke.edu )

I'm working on a project linking bank credit with the geographic distance between individuals/firms and bank branches. It's a fairly big dataset (~4MM geocoded individuals with matched credit data) and MapD would seem like a cool tool to visualize how the data changes with time.

Cheers

tmostak · on April 23, 2013

Hi, yes im tmostak at csail dot mit dot edu (does that fool the spambots these days)? Yeah email me and we can chat!

contingencies · on April 23, 2013

Hrrm. Since it's light on details, how about zooming out for some random conjecture about industry trends.

Notion: The current environment definitely has a faster-evolving database layers landscape than the previous decade. It follows that for longer-term systems, database schema-naive architecture becomes far more attractive. How exactly should we achieve that?

For all the 2005-enterprise-2.0-ishness of it, message-orientation (ie. on-the-wire protocol first) seems a fair play. This seems to hark back slightly toward the traditional (waterfall) approach of a nominal pre-coding design-time, an extroverted design orientation that focuses on understanding inputs and outputs before coding. Contrast the modern fire-and-forget webdev norm; model some data in some can-abstract-databases-as-long-as-they-fit-a-prior-model language-specific framework, and iterate.

Some of this is just common sense but perhaps we are about to see a shift in architectural paradigm. Of course, the modelling/code generation people have been harping on about this for years!

laurentoget · on April 23, 2013

He is not the first person to try to implement a GPU base database, and this is not a trivial undertaking by any means.

http://gpgpu.org/tag/databases

Does anybody know any source with more details on this work?

sendob · on April 23, 2013

not his, but I found this interesting: http://www-cs.ccny.cuny.edu/~jzhang/papers/ACMGIS10_Extended...

sendob · on April 23, 2013

Curious to learn more as the specifics come out.

I am familiar with pg storm which was using pg FDW with CUDA, it had some pretty severe limitations last I looked, I also remember someone doing a sort implementation on cuda on the mailing list which was pretty slick, Tim Child also did some presentations(I never saw code personally, would love a link if anyone has), that was doing opencl as a procedural option which I think would still be really cool, and potentially easier to deal with as an extension where applicable.

I am going to be very interested to see what is possible with postgres 9.3 writable fdw and potentially something like this. May be that postgis will find its way to similar performance given access to the same tech.

35k loc is compact, cannot wait to see the code!

downrightmike · on April 23, 2013

Slides from the Conference where he won the prize: https://cga-download.hmdc.harvard.edu/publish_web/Annual_Spr...

Didn't see a link to a published thesis.

eelsen · on April 23, 2013

The claim of a 700,000x speedup makes me suspicious of pretty much everything else about the work.

danbruc · on April 23, 2013

The question is what kind of processing he did. Usually database systems are heavily constraint by the I/O bandwidth of the system - even the cheapest consumer CPU is able to process data orders of magnitude faster than discs are able to deliver new data.

That is why database servers usually have the fastest discs money can buy, as much memory as you can fit in there in order to keep as much data as possible in main memory and the largest possible caches to avoid hitting the relatively slow memory bus as often as possible. The situation becomes much worse if the database is not read-only because all changes have to be persisted and you have to hit the disc on every change.

Therefore I think using GPUs will buy you nothing for common database application - processor speed was never your problem. But this does not exclude the possibility that there are operation, for example correlating millions of data points from your data set, where the processing power of GPUs comes handy, but this is most likely not what you will see in one of your line of business applications.

The number 700,000 is probably the result of a highly tuned GPU implementation versus a general purpose CPU implementation. A quad-core Core i7 at 3.0 GHz has a theoretical peak performance of 96 GFLOPs [1], the currently fastest supercomputer peaks at 27,112,500 GFLOPS (Titan, 560,640 cores, 8.2 MW [2]). 700,000 times the mentioned Core i7 is still more than twice the performance of Titan, 67,200,000 GFLOPs.

[1] http://stackoverflow.com/questions/15655835/flops-per-cycle-... [2] http://www.top500.org/lists/2012/11/

reeses · on April 23, 2013

It appears from the graphs (yeah, I know) that he's probably doing trig (to locate the tweet) and collision detection (to determine the area of the city). He could also have more control over single precision algorithm choice than he would have via his original solution. It could be something as simple as pre-calculating the geolocation with a parallel CUDA kernel during ETL into PostGIS and bypassing its index generation.

I also have an itchy feeling that there may have been some confusion regarding conversions between "times" and "percent", possibly multiple times. [1]

Extrapolating that an iterative solution would take 40 days, without running it for 40 days, is tricky as well. Given multiple tweets in close proximity, PostGIS should benefit from caching the R-tree index.

I'm not assuming the inventor is doing these things, as his db prof was impressed, as well as CSAIL and the body awarding his prize. [2] I think, if anything, the reporting is intended to be high-level enough to have more broad appeal, which makes it more difficult for us to evaluate the results in a Gedankenversuch. "Reduce slowness here..."

[1] The sidebar says,'Some statistical algorithms run 70 times faster compared to CPU-based systems like MapReduce,' while the quote in the article says,'“The speed ups over PostGIS … I’m not an expert, and I’m sure I could have been more efficient in setting up the system in the first place, but it was 700,000 times faster,” Mostak said. “Something that would take 40 days was done in less than a second.”'

[2] Hey, logical fallacies do not necessarily mean the argument is wrong!

slavak · on April 23, 2013

This is what I was wondering as well. I work on a BigData BI solution, and in my experience IO and not processing time is the final bottleneck. I would expect working on the GPU to be slower, not faster, than working on CPU, because of the extra time needed to copy the data/results into/out of the VRAM. Also 40M tweets is what I would consider the very bottom of the BigData scale. I would guess the use-case here is one where the calculation is very complex, much more so than the data itself.

wmf · on April 23, 2013

Here's a technical talk: http://www.youtube.com/watch?v=WSvh5ZPrR4w I couldn't find a paper, but fortunately this video is only 10 minutes.

He's got an in-memory column store, so (AFAIK) those are immediate advantages over PostGIS. He's also using pre-rendering of shapes into bitmaps (a form of caching/indexing) along with GPU hardware to accelerate intersection queries. It looks like this is a case where a problem (2D geo data) maps quite naturally to GPU hardware (2D textures) if only you can discover the mapping.

danbruc · on April 23, 2013

Some really nice ideas in there - I especially like the tweet-to-composite-number-via-word-to-prime-mapping-thing.

darksaints · on April 23, 2013

Not me. PostGIS has done an amazing job given its constraints, but it is pretty obvious that there are better ways to do certain operations. Its not that I believe him...its just that I think its possible enough for the claim to not be unbelievable.

eelsen · on April 23, 2013

So his comment is more along the lines of, "I was using a screwdriver to pound in this nail and it was taking forever, but then I started using a hammer and man...it went fast!"

darksaints · on April 23, 2013

Yes...that is how I see it.

tmostak · on April 23, 2013

Hi - MapD creator - I love Postgres/PostGIS and use them all the time - there are just some things that are going to execute fast in parallel with high-memory bandwidth. And then there is the algorithm - see my post above for an explanation.

whitewhim · on April 23, 2013

It says 70?

milesskorpen · on April 23, 2013

And, later: "“The speed ups over PostGIS … I’m not an expert, and I’m sure I could have been more efficient in setting up the system in the first place, but it was 700,000 times faster,” Mostak said. “Something that would take 40 days was done in less than a second.”"

eelsen · on April 23, 2013

“The speed ups over PostGIS … I’m not an expert, and I’m sure I could have been more efficient in setting up the system in the first place, but it was 700,000 times faster,” Mostak said. “Something that would take 40 days was done in less than a second.”

pbbakkum · on April 23, 2013

Very cool project! I'm glad to see this area is getting some attention. I did research in GPU databases a couple summers ago at NEC, if anyone is interested the project is at https://github.com/bakks/virginian .

Some background mixed with some opinion:

This is an area still without a major commercial competitor and there aren't yet any such public product efforts as far as I know (IBM has a couple of researchers publishing on the topic). The difficulty is that really exploiting the GPU requires rethinking architecture from the ground up, including how your query plan handles parallelism, how your data is stored, and most significantly, how to efficiently move data to the GPU for processing. Most databases aren't partitioned nearly well enough to shoehorn GPU processing in on the backend, and it seems like the vast majority of recent database development has been around highly distributed systems and commodity hardware, eg Hadoop. A German company called ParStream has a DB that uses GPUs for certain indexing operations, though last time I spoke to them it sounded like there is much much more that can be done with GPU hardware.

In general, speedup is highly variable, and depends on the data set, indexing characteristics, data type (fixed vs variable length), individual query, etc, but its fairly common to see 10x speedups over optimized multicore query execution. Though the extreme parallelism of the GPU is responsible for some of this, in my opinion the big win there is just absolute memory throughput. Newer Tesla GPUs have a memory throughput of > 100 GB/s without breaking a sweat, while most CPUs are around 30 GB/s. If you can push data through your hardware that quickly, its easy to see how the GPU can have such an advantage.

The difficulty, however, is actually moving the data into GPU memory (1-8 GB generally) to exploit this throughput, which is a major bottleneck (some of my research was around doing this better). I understand that GPU direct disk access is just becoming a reality, which could make this easier. The holy grail, however, is a full-powered integrated server CPU/GPU. If both of these components had equivalent main-memory access and this architecture becomes cheap and commoditized, then GPU databases will immediately become extremely relevant.

I expect that there will be a serious commercial competitor in this area within 3 years, either in the relational data warehousing space (think Vertica) or closely integrated into MapReduce type execution. My guess as to the reason it doesn't exist is that 1) production DB development is really hard and 2) GPU processing is only a really big win for certain workloads, namely heavy analytical/research workloads over fixed-size data. As I said, if GPUs become commoditized in server hardware or fully integrated with CPUs, however, I expect this domain will quickly become extremely important. This is a very cool area and its common to see 10x speedups even with optimized CPU implementations.

Always happy to chat about this stuff, feel free to shoot me a message.

tmostak · on April 23, 2013

Actually I've enjoyed reading your papers and would love to talk... will email you!

im3w1l · on April 23, 2013

Why are CPU's built like CPU's and not like GPU's? I mean this running things on the GPU business clearly means that the CPU does not meet the needs of programmers adequately.

Shoehorning things to run on the GPU seems to me a dirty hack, indicating that there is an underlying problem that could and should be addressed in a better way.

Groxx · on April 23, 2013

Because most consumer software is not as insanely parallel as doing millions (billions?) of isolated lookups. If you ran most applications on the GPU, you'd see something like a 1000x slowdown because you're only able to use a single path (instead of hundreds / thousands at a time), and each path is way way slower than your average CPU.

im3w1l · on April 30, 2013

>Because most consumer software is not as insanely parallel as doing millions (billions?) of isolated lookups.

I think this is more due to tradition than due to necessity. But I have no hard data and sources for this.

Groxx · on April 30, 2013

I doubt this is so. Lots of consumer software ends up being I/O bound, not CPU bound - even if they were completely rewritten to be crazy-parallel, you wouldn't notice. Good portions of games could probably be parallelized significantly more than they are, but they're quite a bit different than most software, so that's not very indicative.

Then there's stuff like parsing data. Lots of times data is context-sensitive - that "<item>" could be a tag, or it could be wrapped in <![CDATA[ <item> ]]>. Without reading through it in order, you can only make certain assumptions. Some of this can be redone in a more parallel-friendly way, some of it is not so much - if it's encrypted or compressed, you're probably forced to pre-process it sequentially, or lose some of the quality of the security / compression.

For those (and think of how much crypto and compression there is in consumer stuff! Every SSL web endpoint, secure wifi connection, zipped file...), a powerful-core CPU or a custom piece of hardware (that only does a few algorithms) is your only option.

im3w1l · on May 1, 2013

>I doubt this is so. Lots of consumer software ends up being I/O bound, not CPU bound - even if they were completely rewritten to be crazy-parallel, you wouldn't notice.

So for these programs it wouldn't matter either way.

>Then there's stuff like parsing data

How many states can you be in exactly? You can be in a CDATA, you can be in a quoted attribute, you can be in a tag, you can be between tags. Most other states are very local such as ampersand declarator ( &l ) , tagname started ( <it ), etc. You could process a text snippet for all of the more global possibilities at the same time (or just chance it and throw your guess out if you are wrong).

I suppose you are right about compresssion and crypto.

yangyang · on April 23, 2013

Alenka is another open-source GPU database engine that looks quite interesting:

https://github.com/antonmks/Alenka

corresation · on April 23, 2013

Fantastic project, and this absolutely sounds like an embarrassingly parallel processing task. To the original itch, so to speak-

"Then he plotted the Islamist indicators from 40 million tweets, ranging from August 2011 through March 2012, against 5,000 political districts from the Egyptian census."

So the core problem is that you have 5,000 non-overlapping polygons and you want to determine from an x,y which they fall within? I have to confess that I'm surprised that just 40 million such points would take several days.

Someone · on April 23, 2013

Let's assume a naive approach. Also, assume each point falls within a polygon.

For every point to test, you need, on average, to test half the polygons. Testing a point against a polygon means testing the point against, on average, about half the polygon's edges. Each of those tests takes a multiplication, an addition, and a comparison.

That's 4E7 points times 2500 polygons times, at least, 1.5 edges. Total: 1.5E11 multiplications, additions, and comparisons.

With 100-sided polygons, it would be 1E13 or so multiplications/addition/comparison steps.

Memory pressure is dominated by polygon storage. 5000 poly's, 100 points each, is 500,000 edges at two doubles each, or 4MB. That fits comfortably in cache.

So, 1E9 multiplications/addition/comparison steps a second should be possible. That's 10,000 seconds or 3 hours.

=> Should be doable on one CPU in hours (famous last words)

corresation · on April 24, 2013

It really is a fascinating sounding problem and I'd really like to see the source data (regions and points), if I'm understanding the problem right, to discern alternative approaches.