Using GPUs to Speed Through the 1.2B Record Taxi Dataset

sxp · on Oct 15, 2016

One of the big data points missing from this article is the price. Unless you need features specific to the high end cards such as unlocked 64bit or 16bit performance or antialiased lines, the consumer cards have much higher performancep dollar$ [1]. It would be really interesting if they compared their 8 K80s ~(8 * ($4K for 8TFLOPS)) against a set of GTX 1080s ~($650 for 8TFLOPS)

[1] https://www.youtube.com/watch?v=LC_sx6A5Wko & http://www.videocardbenchmark.net/gpu.php?gpu=Tesla+C2050

jtsymonds · on Oct 15, 2016

Hi SXP, thanks for your comment. You might want to check out Mark Litwintschik's posts (independent blogger who has benchmarked this dataset across many different databases) for performance on GeForce GTX TITAN X's. 4 x GeForce GTX TITAN X: http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-tit.... 8 x K80s: http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-tes.... He has additional posts on MapD on Pascal Titan X's and AWS as well. In full disclosure I work at MapD...

sp8962 · on Oct 15, 2016

Nice demo, but it would be even nicer if you could get the blooper fixed: "Mapbox's Openstreetmap"

MapBox is an active and respected participant in the OpenStreetMap project and uses our data in some of its products, but that is it.

jtsymonds · on Oct 15, 2016

Blooper Fixed :)

sxp · on Oct 15, 2016

Thanks for the info. The tl;dr is "It's fantastic to see that I've been able to use a machine that costs 1/10th of the one used in the 8 x Tesla K80s benchmark but still have queries running within 33% of the previous performances witnessed."

However, I'm suspicious of the numbers in those articles since the author lists only 4 data points in each trial and doesn't mention the stdev in his measurements. One of his measurements was .964 vs .891 so it looks like the Titan Xs were 90% as fast as K80s if the numbers can be trusted.

Twirrim · on Oct 15, 2016

Both those links got shortened for some reason, into invalid ones.

jtsymonds · on Oct 15, 2016

Yep, sorry about that:

Here is the Titan X link http://bit.ly/2e6C3Gg

Here is the K80 link: http://bit.ly/2eiIwvp

jhj · on Oct 15, 2016

It's the memory size here that's likely important, 8 GTX 1080s have a third of the total global memory size as 8 K80s. And the aggregate memory bandwidth is way more important than flops, since almost certainly this workload isn't bound by arithmetic throughput.

Asooka · on Oct 16, 2016

One more thing to note is that gamer cards aren't meant to be abused like workstation cards are. You can leave a K80 running at full blast for a week doing stuff, but the same workload will significantly reduce the lifespan of a high-end gamer video card. They're meant for gaming sessions that last a few hours with some outliers going for maybe a full day, but not much more than that.

If you really have a tight budget and need to use gamer cards as workstation cards (e.g. a two-person startup that needs to crunch things on 4 GPUs), find yourself some aftermarket coolers, preferably liquid cooling.

vegabook · on Oct 15, 2016

Almost irrelevant, because MapD software charges dwarf the price of the hardware.

https://aws.amazon.com/marketplace/pp/B01M0ZY2OV?qid=1475606...

cerved · on Oct 15, 2016

There's probably a good reason they are using server hardware. But sure, just like you could slap consumer CPUs into a server for a cheaper unit cost, you could use consumer GPUs.

lmeyerov · on Oct 15, 2016

For Graphistry's GPU platform, we suggest our users to go with server-grade GPUs because they get (a) more memory and (b) great multitenancy. So using MapD as a personal system is an expensive use of resources, but when a system is architected and billed as an elastic, multitenant system, total cost of ownership for a team is less. Not all platforms are built for this (and I don't know enough about MapD vs. other GPU databases), but that's the engineering view.

And mini-disclaimer: Graphistry is a related platform focused on scaling & automating investigations. Part of that is a GPU compute stack the we started building around the same time as MapD, though we're not in the database business. E.g., our customers will generally use us to look across multiple other systems that already feature high-availability, long-term storage, and scaleout querying for TB+ storage. As some examples: SQL engines, Spark, Splunk, Datastax, and various graph databases.

chrisseaton · on Oct 15, 2016

Why are anti-aliased lines specifically a high-end feature? I thought that anti-aliasing was done by over-sampling and then down-sampling, so all drawing primitives would work with it uniformly.

Rusky · on Oct 15, 2016

The question, though, is what are you oversampling? Depending on how the line gets rasterized, supersampling (or multisampling) may or may not help you at all.

sxp · on Oct 15, 2016

High quality anti-aliased lines aren't a bottleneck in video games, but they're very important in CAD applications, so enterprise customers paid a large premium for Quadro cards with CAD-specific functionality. The specific list of premium vs consumer features has varied over time, but previous generations of consumer cards could have their professional features unlocked via modded drivers.

pmalynin · on Oct 15, 2016

Thats one way of doing anti-aliasing (MSAA), there are many others.

solatic · on Oct 16, 2016

> we use the GPU to render the image, compress it to a .png (about 100KB) and send it to the browser as a tile. This allows for lightning fast rendering and the perception by the user that all of this data is actually in their browser.

With the enormous caveat that you need to have a low latency to their server to get this illusion of client-side rendering. Considering that they only have this cluster of K80's close to them, geographically, and not a number of clusters spread out globally, this isn't a usable example in much of the world.

Now, I don't expect them to roll out K80 clusters world-wide just for the sake of a demo, but it's still pretty important.

Asooka · on Oct 16, 2016

I'm in Eastern Europe and it loads up in like half a second. Much faster than I'd expect the browser to process queries on a 1.2bln dataset and without taking up untold gigabytes of memory.

fogleman · on Oct 15, 2016

GPUs sound cool, but here's what I did with the same dataset, one using flat files and another using cassandra:

https://www.michaelfogleman.com/static/yellow/

https://www.michaelfogleman.com/static/density/

tmostak · on Oct 15, 2016

This is 77M rows, not the full 1.2B dataset shown in the MapD demo (with 60 variables). It also looks like he map is pre-rendered as opposed to being dynamically rendered with filters applied.

Pretty cool but a different animal.

allengeorge · on Oct 15, 2016

The link to the demo is broken in the blog post. It's actually: https://www.mapd.com/demos/taxis/

e9e · on Oct 15, 2016

They left their source map open. Interresting tech choices:

- React / Redux / mapbox-gl

I always look at the data table implementation to see how far people go. And here they made their own implementation based on d3.

Here's the sources for those curious: https://github.com/d8d/mapd-sources/tree/master/out

zyang · on Oct 15, 2016

d3/dc.js/crossfilter to be precise. having been working on something similar, I found dc.js to be redundant if you already use redux. it's much cleaner to use a lighter weight charting library with cross filter.

e9e · on Oct 15, 2016

Thanks for the tip, I found this table/d3 implemenation but it looks like they use it only for the grouping table: https://github.com/d8d/mapd-sources/blob/master/out/home/jen...

Twirrim · on Oct 15, 2016

Is this the data set? http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtm...

Why wouldn't they be hosting this in compressed form. A quick shot through pigz has it down to < 50% original size.

jtsymonds · on Oct 15, 2016

Twrrim,

Slightly different. We have appended all of the data from Factual as well. This includes the location of every business in NYC.

35bge57dtjku · on Oct 15, 2016

Why don't you publish the real data then?

imaginenore · on Oct 15, 2016

Because it's their business?

raverbashing · on Oct 15, 2016

What are the 'commuter confidential' tricks around bridges? I know some bridges are tolled...

jtsymonds · on Oct 15, 2016

Look at the coloring around the rides near bridges. People take the subway down to the closest point and then take a cab home. The hybrid trip is both pocketbook friendly and probably faster.

eternalban · on Oct 15, 2016

Interesting data point for the scale-up vs scale-out debate.

https://www.mapd.com/assets/static/images/barchart.png

infinite8s · on Oct 15, 2016

That's an interesting slide, but without knowledge of the size of the dataset it could be misleading (especially considering communication costs between nodes in a cluster).

jtsymonds · on Oct 15, 2016

Hi infinite8s, to get additional information on how that chart was made, you can to go https://www.mapd.com/product/ scroll down to the bar chart, and click “See Details” under the chart. Shows the machines used, queries, and the source data set and size. Note that the machine configurations used to generate the chart were normalized for equivalent cost on AWS, i.e. the chart is hardware-dollar normalized.

eternalban · on Oct 15, 2016

Source: https://www.mapd.com/product/

It tops off at 192GB (8 x 24GB). Assuming in a couple of gens (#slots & gpu mem) will put it in 1TB range.

m0atz · on Oct 15, 2016

This is fucking awesome.

smlacy · on Oct 19, 2016

This blog post has since been deleted. :(

devereaux · on Oct 15, 2016

Failed to Load Dashboard TypeError: this.painter is undefined

HN effect?

vegabook · on Oct 15, 2016

Very impressive technology, but is there an open source version? Even a limited one? That one can try on something more modest than 100 grand's worth of pro GPUs?

jtsymonds · on Oct 15, 2016

There is not an open source version as yet, but you can spin up these instances on an hourly basis on AWS https://aws.amazon.com/marketplace/pp/B01M0ZY2OV?qid=1475606... and on IBM Softlayer.

vegabook · on Oct 15, 2016

thanks, but at 5 bucks an hour for an entry-level instance (single 12GB GPU) I'm looking at 120 bucks a day if I don't want to constantly re-upload my dataset into MapD (a very slow operation judging by Mark Litwintschik's posts linked by you). That's a very very high price for such a modest hardware configuration, not to mention the more credible one which goes for an eye-watering 30 bucks an hour ie not much change from a grand a day. Not for us startup folk, clearly.

I have to say it seems your pricing for such a new entrant and before having built share, is bound to attract very stiff newcomer competition. "Interesting" business model.

felipe_aramburu · on Oct 15, 2016

You could try BlazingDB if what you are interested is the gpu powered SQL component. There is a free community edition available here: https://docs.blazingdb.com/docs/quickstart-guide-to-blazingd...

You can install this on AWS or on your own infrastructure (I run this on my laptop for example).

tmostak · on Oct 15, 2016

MapD has a persistent store and normally customers would keep that on an EBS volume, so they don't have to reload their data every time they spin up an AWS instance.

ruw1090 · on Oct 16, 2016

very interesting. Do you have a link to documentation? I'd like to take a look before I try it out.

vegabook · on Oct 15, 2016

fair enough but software cost 3-4x the (already high) hourly hardware cost seems excessive.

wingless · on Oct 15, 2016

If you find this too pricey, then build a cheaper or free competitor.

ChoHag · on Oct 15, 2016

How long before we stop bollocking about and just call them all PUs?

kalleboo · on Oct 16, 2016

GPU: Generic Processing Unit