Sort Faster with FPGAs

vvanders · on Jan 20, 2016

I wonder how quickly routing resources will cause the fan-out time to hurt the speed of sort as well.

Also fun is Radix Sort. If you can guarantee a fixed size key(say 32bits) a Radix Sort will do O(kN) where k is usually 1-4 based on your cache line size. Destroys any other sort due to linear data reads by almost an order of magnitude in some cases. I've used on on some previous gigs to sort thousands of quads front-to-back in record time on some older ARM chips.

switchbak · on Jan 20, 2016

I remember we had a 'challenge' in our algorithms class to develop the fastest sort. I read a paper on how to write a fast Radix sort, and made what I thought was a pretty fast approach (if a little memory hungry). Combine that with some memory mapping and some other tweaks, and it was running in fractions of a second - 5-10x faster than C's built-in quicksort (if I recall correctly). The memory mapping in particular was a huge win.

Unfortunately that professor was a bit of a ... well an odd fellow. On his "benchmark framework" he had a program that ran between test runs that would essentially allocate all memory on the system until it was using tons of swap. So my butt-kicking algorithm actually performed quite poorly due to memory mapping and large allocations not working particularly well on a machine that's in such a horrendous state. That would have been fine, except for the fact that he didn't inform us of this particular detail at all, so I only found out about it after the fact.

Even still, that challenge was tremendous fun, and I learned a whole lot about both the algorithm and the machine in the process. There's lots of optimizations you can make when you know more about the kind of data (and keys) and the distribution of this data. Very fun.

vvanders · on Jan 20, 2016

Yeah, Radix can get a bit memory hungry if you don't pick a small enough k(although you trade that for more passes through the data).

That sounds like not much of a benchmark to me. Did he have any reasoning behind it?

minimax · on Jan 20, 2016

O(kN)

If this is trolling you're doing a very good job.

vvanders · on Jan 20, 2016

Not particularly.

I'll probably get downvoted but I'm not a huge fan of Big-O notation(hence the k). It's very hand-wavey about the things that have a significant impact on performance(like cache misses).

Great from the ivory tower of academia but down in the real world we like things that run fast :).

danking00 · on Jan 21, 2016

Perhaps you already know this, in which case I'm sorry for being pedantic.

There are portions of the academy that also care about constant factors. They would write what, I think, you mean by O(kN) as k * N.

If you need more precision, such as: "There's an order N term with constant factor k and some lower order terms whose constants I don't know (or don't care to calculate)", you could write that as: k * N + o(N). The little o means terms that grow strictly slower than N.

Edit: formatting

Edit2: I should have conceded that I'm not aware of anyone who attempts to use O-notation to talk about cache goodness. That certainly is swept under the covers of "Assume a perfect RAM machine" :P

minimax · on Jan 20, 2016

I agree with you and I meant it as a genuine complement.

vvanders · on Jan 20, 2016

Thanks, I'll have to re-train my complement detector :).

I enjoy breaking out Radix because it's one of those thing's that's so unintuitive when you come at it from a traditional CS perspective. AFAIK the only book I'm aware of that approaches data structures with a consideration for this type of performance is Real-Time Collision Detection(which is basically datastructures for 3D spatial domains).

gleenn · on Jan 20, 2016

If you're complaining because it's faster than O(N*log(N)), the fixed key size guarantees you'll never meet a sufficiently large N which is required when doing big-oh analysis.

If you're complaining because of the k, k is the number of bits in the key and it's perfectly reasonable to have that spelled out explicitly.

minimax · on Jan 20, 2016

I wouldn't say it was a complaint but the k is definitely what caught my attention.

https://en.wikipedia.org/wiki/Big_O_notation#Multiplication_...

david-given · on Jan 20, 2016

Radix sort really does have linear time complexity, but requires fixed size keys. If your keys are variable sized, e.g. strings, it's O(n log n) as usual.

(I just looked it up on Wikipedia.)

sirsar · on Jan 21, 2016

The three things to keep in mind about radix sort are A) It's noncomparative, while other sorting algorithms' performance is measured in number of comparisons, so there's a bit of apples-and-oranges going on. B) What is `k` for `n` unique keys? Log(n). So O(kn) is...? C) Despite those considerations, in practice radix sort can often outperform everything else, so use where appropriate.

vvanders · on Jan 21, 2016

There's a 4th as well which is that the sort is stable on keys of the same value. So if you care about ordering(which a lot of rendering algorithms that use Radix do) it useful there as well.

Jaxan · on Jan 21, 2016

You can sort strings also in linear time with a trie (for example). This requires O(n) time, where n is the total length of the input (i.e. sum of lengths of strings).

kbwt · on Jan 20, 2016

A better solution which only uses local connections between adjacent cells is Odd-Even Transposition Sort.

When the cells are arranged in a 2D grid, Shearsort sorts all the values in snake-like order in O(sqrt(n) log(n)). A slightly more complicated algorithm based on Shearsort is that of Schnorr and Shamir, which runs in O(sqrt(n)).

theocean154 · on Jan 20, 2016

There's a far better way to do this, which essentially decomposes into an insertion sort. Details here: https://github.com/theocean154/dau/tree/master/sort

hamiltonkibbe · on Jan 20, 2016

Yeah I was thinking the same thing, just compare all the registers to the value of the next item @posedge_clk and use that bit to tell the registers whether or not they need to shift their value into the next register. It replaces the `memmove()` (the abstract concept) with a single cycle shift

theocean154 · on Jan 21, 2016

The benefit of my way is that you get pipelining (which lets you maintain O(1) throughput without having a fill/flush cycle), as well as better scaling (due to said pipelining) and better energy usage due to smaller fanout.

alain94040 · on Jan 20, 2016

O(N) is not particularly insightful for fixed/limited values of N. This would be more interesting if the author reported sorts/s with the FPGA vs. software. Because obviously, as N grows, the FPGA only performs the full sort in "one cycle" if you are willing to lower the cycle time more and more (until the FPGA blows up).

zik · on Jan 20, 2016

Yes, it's really an insertion sort where the insert step is done in parallel hardware. That's cool for small N but it's really still an O(N^2) algorithm.

mabbo · on Jan 21, 2016

> each sorting step is fast, bubbling down to N identical and independent questions asked in parallel at every clock cycle.

> The first empty cell could take the new data since it’s empty, but since 4 just got kicked out from the cell above, this empty cell gets the 4 instead

These are incompatible. The result in cell 3 is dependent on the result of cell 2 (and so on up the chain of cells).

Therefore, the steps have to run in order, making this O(n) just for adding one item to the set.

The end goal might still be possible, but I don't see how this can work.

lightcatcher · on Jan 21, 2016

I initially thought this as well, but reading the code showed me it's actually O(1) to add each item to the set.

Each cell computes in parallel (new_data < cell_data) & (cell_state == OCCUPIED). If this is true, then the cell will be pushed down, either be the new data or by the data from the cell above. Each cell can also grab this value from the cell above. If the cell above's value is true, then the cell grabs the value from the cell above. If the value in the cell above is false, then the cell grabs the data being inserted. This is O(1) because there's no cascading value shift. I personally found this pretty clever, and I'm looking forward to implementing some of these bounded size hardware sorts on GPU.

mabbo · on Jan 21, 2016

> Each cell can also grab this value from the cell above

Therein lies the problem I see. Grabbing this value from the above cell implies that this value has already been calculated- but it hasn't, as these cells run in parallel. The results of a cell in a single round of calculations is not accessible to the other cells during the same round.

In short- how does a cell know whether its new value is the value being inserted or the value being pushed down without knowing the result of the cell above it?

And having asked that, I can actually think of one answer: it can calculate whether it's neighbor will be pushing or not independently. So each cell is really doing the calculation of "given these two numbers (my cell's value and the above cell's), where does this new number fit in sorted order with them- before, between, or after".

zumatic · on Jan 21, 2016

arguably it's O(N^2) in spacetime, because it's O(N) steps on O(N) hardware (N parallel units).

You can imagine different hardware architectures taking O(1) steps on O(N^2) parallel units or O(N^2) steps on O(1) unit(s), or anything in between.

jcbeard · on Jan 21, 2016

Here's an old but great reference on the subject: http://www.ccrc.wustl.edu/~roger/papers/cg09.pdf

I used to teach digital design, sorting was one of the first topics for FPGA's :)

e19293001 · on Jan 21, 2016

A much more faster sorting algorithm implementation for hardware is sorting networks[1]. This can be implemented with just comparators and wires. Thus, there's no clock involved since it is a combinational circuit. It's quite simple circuit than can be built from bottom to top with logic gates. Knuth's TAOCP has a good explanation. Additionally, there are lots of information on the internet.

---

https://en.wikipedia.org/wiki/Sorting_network

melted · on Jan 20, 2016

We'll see a lot of FPGA use in the coming years. Consider this: newest FPGAs have 1TB/s (that's a terabyte per second, not terabit) memory bandwidth and 1KB-wide memory bus. Short of developing your own silicon (which takes years and tons of money), you simply can't get this kind of throughput with anything else. That's basically why Intel bought Altera.

wtallis · on Jan 21, 2016

HBM and HMC are already in the process of bringing most of those advantages to GPUs and probably some CPUs. Last year's high-end GPUs introduced HBM with a 4kb (0.5kB) wide memory bus with 0.5TB/s bandwidth. The only reason they didn't have more memory bandwidth was because they didn't need it and the die area was better put to use for compute resources. FPGAs may be the only general-purpose chips so far to actually ship that kind of memory bus, but they aren't the only ones that can.

melted · on Jan 21, 2016

GPUs aren't custom logic. They come with significant and fundamental programming model limitations. So I don't see much of a functional overlap FPGAs and GPUs. In particular, GPUs suck mightily at anything that involves branching. They're also not ideal for things like SDN, crypto, search retrieval/ranking, realtime DSP, and so on.

wtallis · on Jan 21, 2016

My point was that memory bandwidth won't sell anybody on FPGAs if they didn't already have strong reasons to be considering FPGAs.

bedros · on Jan 20, 2016

I built a test chip for a class project in college that did sorting in linear time, without all the routing issues of this algorithm.

switchbak · on Jan 20, 2016

Care to share some details? It sounds interesting.

pkaye · on Jan 21, 2016

Perhaps implement a sorting network? https://en.wikipedia.org/wiki/Sorting_network

jhallenworld · on Jan 20, 2016

The article referenced in the comments is very cool:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83....

It seems related to certain pipeline bypass structures.

keithwhor · on Jan 20, 2016

I have the pleasure of working with Josh, and first heard him talk about this in a car ride to get lunch. :) Very cool stuff, and very excited to see the things he builds in the future.

theocean154 · on Jan 21, 2016

rearranged my repo, here is the scalable, pipelined way of doing this, it ends up being an insertion sort with the benefit of small fanout (for lower energy), better scalability (faster due to lower fanout and shorter critical path with pipelining), and maintains O(1) throughput (no fill/drain cycle).

https://github.com/theocean154/dau/blob/master/docs/part_sor...

theocean154 · on Jan 21, 2016

more explanation is in the paper in the comments below, this is just a simple way of visualizing how it works

TickleSteve · on Jan 21, 2016

Trades time for parallelism.

This isn't going to scale to large data sets...