Interesting work! But I'm unclear on the end-to-end performance comparison, which includes the cost of copying data to and from the device.
I know you guys are eyeballs deep in your data, so what's obvious to you may not be obvious to a reader. From what I can tell, your performance numbers are reported as an average per regex. But you're calculating that without including the string copying cost; you're only timing the cost of performing the computation on the GPU.
You show the overhead in absolute numbers. It seems small compared to the cost of processing the whole file, but I need to know the cost for grep to process that whole file as well. Basically, you show me a number, but you don't explain the significance of this number.
In my experience, the best way to deal with such issues is to always report entire-application performance. (This can be in addition to other numbers, of course.) That way I can say "Ah, calling grep took x seconds on average, calling their program took y seconds on average, and y < x, so they've improved performance." If you report entire-application performance, your reader doesn't need to understand the particulars to at least determine if your technique is a performance win.
Thanks for the advice. As soon as we get this project to run successfully on CUDA 5, we'll update the finalreport document with the entire-application performance numbers (hopefully before the end of tomorrow).
Dont mean to hijack this thread or anything, but as someone who has spent many years researching pattern matching, I was intrigued by this posting and comment. I decided to put my own library (dClass) to the end-to-end performance comparison:
Since people seem really interested in this, mburman and I are going to spend the day fixing up this project.
This was a final project for 15-418 in Spring of 2012 (that was a great class). We need to update it for the newer versions of CUDA, and we'll probably clean up some of our code since we're both better programmers now.
Always interesting to see applications/algorithms sped up with GPGPU implementations. It looks like they get a great amount of speedup by parsing each line in parallel (in the thread blocks), followed up by parsing each individual regular expression in a separate blocks.
Admittedly, I assumed this was already the case after reading the article title, since parsing each line in parallel would present a great deal of parallelism with growing filesizes. Indeed, they also came to the same conclusion:
we realized that it was a much better idea to parallelize across the lines of the input file against which we had to match regexs
That was also a great choice for future scalability, as no doubt you would get greater parallelism with a growing input file (while also allowing opportunities to speed up more complex expressions).
I think that getting decent speedup on this kind of non-obvious GPU problem is an impressive result. Their presentation here is somewhat lacking in academic rigor (I for one would have liked to see a description of how they got their timing results, as well as some tests with a smaller source file, or possibly many small source files), but given that they've provided the source code their results should be easy enough to verify.
I just want to clarify that it wasn't really homework. This was for an open ended final project during the last third of the class where we got to choose to work on anything that interested us.
Doesn't this approach require that you start with a sequential scan of the file to break it into lines? And does the fact that you beat grep by 8-10x, even after this up-front cost, then imply that such a scan is at least 10 times faster than an actual grep? If so that's surprising; I thought that grep was a lot closer to the raw cost of scanning a string.
I am not an expert on CUDA, but couldn't you split the file into relatively large chunks, find the nearest newline to the chunk points, re-split on those positions, and use that division of labor?
It seems like this is only fast on large files, though, because the text needs to be copied from the main RAM memory to the GPU, which introduces latency. I wonder what latency would be like if this algorithm was instead run on the kind of unified memory architecture that you see in e.g. the PS2 and XBox One.
Also, I don't quite follow why they're compiling the finite automata on the GPU. To me their explanation that they didn't want to copy the automaton node per node sort of sounds like there's a lot of room for optimization here. E.g. maybe the regular expressions could be compiled to OpenCL code.
Then again, they did also find that pattern matching is a memory bound problem so maybe emitting native code is pointless. Anyone know if there are regular expression engines that compile emit native x86 code?
> Searching for hay in the haystack is not a common use case for a regular expression matcher.
Where'd that conclusion come from? Surely "does this pattern exist anywhere in this string?" with the answer "no" can't be that that unusual...I'm pretty sure I do this fairly often, in fact.
EDIT: just realized I probably misinterpreted. Looking at the context more, I suppose they're referring to the frequency of matches, not their presence or absence. However, is 'grep -v commonpattern' (another not-too-unusual use-case) kind of equivalent?
Wow, that's one of the most egregious bans I've seen yet. He has one quippy but non-offensive political comment from 3 years ago that was voted down to -7, and then a series of solid on-topic technical comments every few months. Maybe a mod could just restore him without requiring him to email asking for forgiveness? He seems worth keeping around.
HN "hellbans" users who have karma that's below a certain threshold (0, I think) - the site looks normal to them, but no one else can see their posts. You can see the post if you turn on "show dead" in your profile.
There are already a few regular expression matchers built for GPU based computation. They often have modified finite state machine implementations in order to exploit GPUs.
There's a NFA implementation called iNFAnt, which explores multiple paths through the NFA simultaneously, avoiding backtracking normally required in NFAs.
Since you're streaming the input dataset you may want to explore the option of mmap()ing the files and using cudaHostRegister()/cudaHostUnregister() and letting the GPU operate directly on CPU memory rather than copying it over. Should saturate the PCIe bus, which is way faster than a disk.
I know you guys are eyeballs deep in your data, so what's obvious to you may not be obvious to a reader. From what I can tell, your performance numbers are reported as an average per regex. But you're calculating that without including the string copying cost; you're only timing the cost of performing the computation on the GPU.
You show the overhead in absolute numbers. It seems small compared to the cost of processing the whole file, but I need to know the cost for grep to process that whole file as well. Basically, you show me a number, but you don't explain the significance of this number.
In my experience, the best way to deal with such issues is to always report entire-application performance. (This can be in addition to other numbers, of course.) That way I can say "Ah, calling grep took x seconds on average, calling their program took y seconds on average, and y < x, so they've improved performance." If you report entire-application performance, your reader doesn't need to understand the particulars to at least determine if your technique is a performance win.