Hacker News new | past | comments | ask | show | jobs | submit login
Sort Faster with FPGAs (hackaday.com)
86 points by szczys on Jan 20, 2016 | hide | past | favorite | 38 comments



I wonder how quickly routing resources will cause the fan-out time to hurt the speed of sort as well.

Also fun is Radix Sort. If you can guarantee a fixed size key(say 32bits) a Radix Sort will do O(kN) where k is usually 1-4 based on your cache line size. Destroys any other sort due to linear data reads by almost an order of magnitude in some cases. I've used on on some previous gigs to sort thousands of quads front-to-back in record time on some older ARM chips.


I remember we had a 'challenge' in our algorithms class to develop the fastest sort. I read a paper on how to write a fast Radix sort, and made what I thought was a pretty fast approach (if a little memory hungry). Combine that with some memory mapping and some other tweaks, and it was running in fractions of a second - 5-10x faster than C's built-in quicksort (if I recall correctly). The memory mapping in particular was a huge win.

Unfortunately that professor was a bit of a ... well an odd fellow. On his "benchmark framework" he had a program that ran between test runs that would essentially allocate all memory on the system until it was using tons of swap. So my butt-kicking algorithm actually performed quite poorly due to memory mapping and large allocations not working particularly well on a machine that's in such a horrendous state. That would have been fine, except for the fact that he didn't inform us of this particular detail at all, so I only found out about it after the fact.

Even still, that challenge was tremendous fun, and I learned a whole lot about both the algorithm and the machine in the process. There's lots of optimizations you can make when you know more about the kind of data (and keys) and the distribution of this data. Very fun.


Yeah, Radix can get a bit memory hungry if you don't pick a small enough k(although you trade that for more passes through the data).

That sounds like not much of a benchmark to me. Did he have any reasoning behind it?


O(kN)

If this is trolling you're doing a very good job.


Not particularly.

I'll probably get downvoted but I'm not a huge fan of Big-O notation(hence the k). It's very hand-wavey about the things that have a significant impact on performance(like cache misses).

Great from the ivory tower of academia but down in the real world we like things that run fast :).


Perhaps you already know this, in which case I'm sorry for being pedantic.

There are portions of the academy that also care about constant factors. They would write what, I think, you mean by O(kN) as k * N.

If you need more precision, such as: "There's an order N term with constant factor k and some lower order terms whose constants I don't know (or don't care to calculate)", you could write that as: k * N + o(N). The little o means terms that grow strictly slower than N.

Edit: formatting

Edit2: I should have conceded that I'm not aware of anyone who attempts to use O-notation to talk about cache goodness. That certainly is swept under the covers of "Assume a perfect RAM machine" :P


I agree with you and I meant it as a genuine complement.


Thanks, I'll have to re-train my complement detector :).

I enjoy breaking out Radix because it's one of those thing's that's so unintuitive when you come at it from a traditional CS perspective. AFAIK the only book I'm aware of that approaches data structures with a consideration for this type of performance is Real-Time Collision Detection(which is basically datastructures for 3D spatial domains).


If you're complaining because it's faster than O(N*log(N)), the fixed key size guarantees you'll never meet a sufficiently large N which is required when doing big-oh analysis.

If you're complaining because of the k, k is the number of bits in the key and it's perfectly reasonable to have that spelled out explicitly.


I wouldn't say it was a complaint but the k is definitely what caught my attention.

https://en.wikipedia.org/wiki/Big_O_notation#Multiplication_...


Radix sort really does have linear time complexity, but requires fixed size keys. If your keys are variable sized, e.g. strings, it's O(n log n) as usual.

(I just looked it up on Wikipedia.)


The three things to keep in mind about radix sort are A) It's noncomparative, while other sorting algorithms' performance is measured in number of comparisons, so there's a bit of apples-and-oranges going on. B) What is `k` for `n` unique keys? Log(n). So O(kn) is...? C) Despite those considerations, in practice radix sort can often outperform everything else, so use where appropriate.


There's a 4th as well which is that the sort is stable on keys of the same value. So if you care about ordering(which a lot of rendering algorithms that use Radix do) it useful there as well.


You can sort strings also in linear time with a trie (for example). This requires O(n) time, where n is the total length of the input (i.e. sum of lengths of strings).


A better solution which only uses local connections between adjacent cells is Odd-Even Transposition Sort.

When the cells are arranged in a 2D grid, Shearsort sorts all the values in snake-like order in O(sqrt(n) log(n)). A slightly more complicated algorithm based on Shearsort is that of Schnorr and Shamir, which runs in O(sqrt(n)).


There's a far better way to do this, which essentially decomposes into an insertion sort. Details here: https://github.com/theocean154/dau/tree/master/sort


Yeah I was thinking the same thing, just compare all the registers to the value of the next item @posedge_clk and use that bit to tell the registers whether or not they need to shift their value into the next register. It replaces the `memmove()` (the abstract concept) with a single cycle shift


The benefit of my way is that you get pipelining (which lets you maintain O(1) throughput without having a fill/flush cycle), as well as better scaling (due to said pipelining) and better energy usage due to smaller fanout.


O(N) is not particularly insightful for fixed/limited values of N. This would be more interesting if the author reported sorts/s with the FPGA vs. software. Because obviously, as N grows, the FPGA only performs the full sort in "one cycle" if you are willing to lower the cycle time more and more (until the FPGA blows up).


Yes, it's really an insertion sort where the insert step is done in parallel hardware. That's cool for small N but it's really still an O(N^2) algorithm.


> each sorting step is fast, bubbling down to N identical and independent questions asked in parallel at every clock cycle.

> The first empty cell could take the new data since it’s empty, but since 4 just got kicked out from the cell above, this empty cell gets the 4 instead

These are incompatible. The result in cell 3 is dependent on the result of cell 2 (and so on up the chain of cells).

Therefore, the steps have to run in order, making this O(n) just for adding one item to the set.

The end goal might still be possible, but I don't see how this can work.


I initially thought this as well, but reading the code showed me it's actually O(1) to add each item to the set.

Each cell computes in parallel (new_data < cell_data) & (cell_state == OCCUPIED). If this is true, then the cell will be pushed down, either be the new data or by the data from the cell above. Each cell can also grab this value from the cell above. If the cell above's value is true, then the cell grabs the value from the cell above. If the value in the cell above is false, then the cell grabs the data being inserted. This is O(1) because there's no cascading value shift. I personally found this pretty clever, and I'm looking forward to implementing some of these bounded size hardware sorts on GPU.


> Each cell can also grab this value from the cell above

Therein lies the problem I see. Grabbing this value from the above cell implies that this value has already been calculated- but it hasn't, as these cells run in parallel. The results of a cell in a single round of calculations is not accessible to the other cells during the same round.

In short- how does a cell know whether its new value is the value being inserted or the value being pushed down without knowing the result of the cell above it?

And having asked that, I can actually think of one answer: it can calculate whether it's neighbor will be pushing or not independently. So each cell is really doing the calculation of "given these two numbers (my cell's value and the above cell's), where does this new number fit in sorted order with them- before, between, or after".


arguably it's O(N^2) in spacetime, because it's O(N) steps on O(N) hardware (N parallel units).

You can imagine different hardware architectures taking O(1) steps on O(N^2) parallel units or O(N^2) steps on O(1) unit(s), or anything in between.


Here's an old but great reference on the subject: http://www.ccrc.wustl.edu/~roger/papers/cg09.pdf

I used to teach digital design, sorting was one of the first topics for FPGA's :)


A much more faster sorting algorithm implementation for hardware is sorting networks[1]. This can be implemented with just comparators and wires. Thus, there's no clock involved since it is a combinational circuit. It's quite simple circuit than can be built from bottom to top with logic gates. Knuth's TAOCP has a good explanation. Additionally, there are lots of information on the internet.

---

https://en.wikipedia.org/wiki/Sorting_network


We'll see a lot of FPGA use in the coming years. Consider this: newest FPGAs have 1TB/s (that's a terabyte per second, not terabit) memory bandwidth and 1KB-wide memory bus. Short of developing your own silicon (which takes years and tons of money), you simply can't get this kind of throughput with anything else. That's basically why Intel bought Altera.


HBM and HMC are already in the process of bringing most of those advantages to GPUs and probably some CPUs. Last year's high-end GPUs introduced HBM with a 4kb (0.5kB) wide memory bus with 0.5TB/s bandwidth. The only reason they didn't have more memory bandwidth was because they didn't need it and the die area was better put to use for compute resources. FPGAs may be the only general-purpose chips so far to actually ship that kind of memory bus, but they aren't the only ones that can.


GPUs aren't custom logic. They come with significant and fundamental programming model limitations. So I don't see much of a functional overlap FPGAs and GPUs. In particular, GPUs suck mightily at anything that involves branching. They're also not ideal for things like SDN, crypto, search retrieval/ranking, realtime DSP, and so on.


My point was that memory bandwidth won't sell anybody on FPGAs if they didn't already have strong reasons to be considering FPGAs.


I built a test chip for a class project in college that did sorting in linear time, without all the routing issues of this algorithm.


Care to share some details? It sounds interesting.


Perhaps implement a sorting network? https://en.wikipedia.org/wiki/Sorting_network


The article referenced in the comments is very cool:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83....

It seems related to certain pipeline bypass structures.


I have the pleasure of working with Josh, and first heard him talk about this in a car ride to get lunch. :) Very cool stuff, and very excited to see the things he builds in the future.


rearranged my repo, here is the scalable, pipelined way of doing this, it ends up being an insertion sort with the benefit of small fanout (for lower energy), better scalability (faster due to lower fanout and shorter critical path with pipelining), and maintains O(1) throughput (no fill/drain cycle).

https://github.com/theocean154/dau/blob/master/docs/part_sor...


more explanation is in the paper in the comments below, this is just a simple way of visualizing how it works


Trades time for parallelism.

This isn't going to scale to large data sets...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: