Hacker News new | past | comments | ask | show | jobs | submit login
CUDA grep (bkase.github.io)
182 points by ColinWright on June 1, 2013 | hide | past | favorite | 31 comments



Interesting work! But I'm unclear on the end-to-end performance comparison, which includes the cost of copying data to and from the device.

I know you guys are eyeballs deep in your data, so what's obvious to you may not be obvious to a reader. From what I can tell, your performance numbers are reported as an average per regex. But you're calculating that without including the string copying cost; you're only timing the cost of performing the computation on the GPU.

You show the overhead in absolute numbers. It seems small compared to the cost of processing the whole file, but I need to know the cost for grep to process that whole file as well. Basically, you show me a number, but you don't explain the significance of this number.

In my experience, the best way to deal with such issues is to always report entire-application performance. (This can be in addition to other numbers, of course.) That way I can say "Ah, calling grep took x seconds on average, calling their program took y seconds on average, and y < x, so they've improved performance." If you report entire-application performance, your reader doesn't need to understand the particulars to at least determine if your technique is a performance win.


Thanks for the advice. As soon as we get this project to run successfully on CUDA 5, we'll update the finalreport document with the entire-application performance numbers (hopefully before the end of tomorrow).


Dont mean to hijack this thread or anything, but as someone who has spent many years researching pattern matching, I was intrigued by this posting and comment. I decided to put my own library (dClass) to the end-to-end performance comparison:

https://github.com/TheWeatherChannel/dClass/wiki/dClass-vs-g...


Well I woke up to a nice surprise!

Since people seem really interested in this, mburman and I are going to spend the day fixing up this project.

This was a final project for 15-418 in Spring of 2012 (that was a great class). We need to update it for the newer versions of CUDA, and we'll probably clean up some of our code since we're both better programmers now.

Pull-requests always welcome!


Always interesting to see applications/algorithms sped up with GPGPU implementations. It looks like they get a great amount of speedup by parsing each line in parallel (in the thread blocks), followed up by parsing each individual regular expression in a separate blocks.

Admittedly, I assumed this was already the case after reading the article title, since parsing each line in parallel would present a great deal of parallelism with growing filesizes. Indeed, they also came to the same conclusion:

we realized that it was a much better idea to parallelize across the lines of the input file against which we had to match regexs

That was also a great choice for future scalability, as no doubt you would get greater parallelism with a growing input file (while also allowing opportunities to speed up more complex expressions).


It's seems to be a project for

http://dolores.sp.cs.cmu.edu/15418_spr13/

Which answers the first question that comes to mind 'why would anyone do this, for an i/o bound process'. For homework, of course.


I think that getting decent speedup on this kind of non-obvious GPU problem is an impressive result. Their presentation here is somewhat lacking in academic rigor (I for one would have liked to see a description of how they got their timing results, as well as some tests with a smaller source file, or possibly many small source files), but given that they've provided the source code their results should be easy enough to verify.


Oh, I didn't mean to imply there was anything wrong or obvious about it, or for that matter, with homework.


I just want to clarify that it wasn't really homework. This was for an open ended final project during the last third of the class where we got to choose to work on anything that interested us.


You are right, 'homework' makes it sound needlessly disparaging. 'Class project' would have been better.


Not necessarily io bound, they clocked grep with the simple single regexp at 2 GB/sec which is about even with a fast net connection or ssd.


Doesn't this approach require that you start with a sequential scan of the file to break it into lines? And does the fact that you beat grep by 8-10x, even after this up-front cost, then imply that such a scan is at least 10 times faster than an actual grep? If so that's surprising; I thought that grep was a lot closer to the raw cost of scanning a string.


I am not an expert on CUDA, but couldn't you split the file into relatively large chunks, find the nearest newline to the chunk points, re-split on those positions, and use that division of labor?


Yes, that is a good approach, and it was used in some of the faster implementations of Tim Bray's Wide Finder competition.

http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finde...


Very nice work!

It seems like this is only fast on large files, though, because the text needs to be copied from the main RAM memory to the GPU, which introduces latency. I wonder what latency would be like if this algorithm was instead run on the kind of unified memory architecture that you see in e.g. the PS2 and XBox One.

Also, I don't quite follow why they're compiling the finite automata on the GPU. To me their explanation that they didn't want to copy the automaton node per node sort of sounds like there's a lot of room for optimization here. E.g. maybe the regular expressions could be compiled to OpenCL code.

Then again, they did also find that pattern matching is a memory bound problem so maybe emitting native code is pointless. Anyone know if there are regular expression engines that compile emit native x86 code?



V8, SpiderMonkey, and jsc also all have regular expression JITs.


SpiderMonkey uses JavaScriptCore's regular expression JIT, YARR (Yet Another RegExp Runtime).


I believe pypy also has a regex jit.


Yup we do.


Very cool overall. However:

> Searching for hay in the haystack is not a common use case for a regular expression matcher.

Where'd that conclusion come from? Surely "does this pattern exist anywhere in this string?" with the answer "no" can't be that that unusual...I'm pretty sure I do this fairly often, in fact.

EDIT: just realized I probably misinterpreted. Looking at the context more, I suppose they're referring to the frequency of matches, not their presence or absence. However, is 'grep -v commonpattern' (another not-too-unusual use-case) kind of equivalent?


I suppose the fastest grep with no GPU support is beagrep - https://news.ycombinator.com/item?id=5645040


jonstewart, you're hellbanned. Nobody can see your comments.


Wow, that's one of the most egregious bans I've seen yet. He has one quippy but non-offensive political comment from 3 years ago that was voted down to -7, and then a series of solid on-topic technical comments every few months. Maybe a mod could just restore him without requiring him to email asking for forgiveness? He seems worth keeping around.


That's why everyone should toggle on showdead.

Deferring to taste police ought to be opt-in.


Am I only one who doesn't have a clue what are you talking about and who is jonstewart?


HN "hellbans" users who have karma that's below a certain threshold (0, I think) - the site looks normal to them, but no one else can see their posts. You can see the post if you turn on "show dead" in your profile.


now unhellbanned by PG, yay


There are already a few regular expression matchers built for GPU based computation. They often have modified finite state machine implementations in order to exploit GPUs.

There's a NFA implementation called iNFAnt, which explores multiple paths through the NFA simultaneously, avoiding backtracking normally required in NFAs.


Since you're streaming the input dataset you may want to explore the option of mmap()ing the files and using cudaHostRegister()/cudaHostUnregister() and letting the GPU operate directly on CPU memory rather than copying it over. Should saturate the PCIe bus, which is way faster than a disk.


I don't see what hardware they tested on. What are the specs of this GHC 5000 computer they mention?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: