I wonder how quickly routing resources will cause the fan-out time to hurt the speed of sort as well.
Also fun is Radix Sort. If you can guarantee a fixed size key(say 32bits) a Radix Sort will do O(kN) where k is usually 1-4 based on your cache line size. Destroys any other sort due to linear data reads by almost an order of magnitude in some cases. I've used on on some previous gigs to sort thousands of quads front-to-back in record time on some older ARM chips.
I remember we had a 'challenge' in our algorithms class to develop the fastest sort. I read a paper on how to write a fast Radix sort, and made what I thought was a pretty fast approach (if a little memory hungry). Combine that with some memory mapping and some other tweaks, and it was running in fractions of a second - 5-10x faster than C's built-in quicksort (if I recall correctly). The memory mapping in particular was a huge win.
Unfortunately that professor was a bit of a ... well an odd fellow. On his "benchmark framework" he had a program that ran between test runs that would essentially allocate all memory on the system until it was using tons of swap. So my butt-kicking algorithm actually performed quite poorly due to memory mapping and large allocations not working particularly well on a machine that's in such a horrendous state. That would have been fine, except for the fact that he didn't inform us of this particular detail at all, so I only found out about it after the fact.
Even still, that challenge was tremendous fun, and I learned a whole lot about both the algorithm and the machine in the process. There's lots of optimizations you can make when you know more about the kind of data (and keys) and the distribution of this data. Very fun.
I'll probably get downvoted but I'm not a huge fan of Big-O notation(hence the k). It's very hand-wavey about the things that have a significant impact on performance(like cache misses).
Great from the ivory tower of academia but down in the real world we like things that run fast :).
Perhaps you already know this, in which case I'm sorry for being pedantic.
There are portions of the academy that also care about constant factors. They would write what, I think, you mean by O(kN) as k * N.
If you need more precision, such as: "There's an order N term with constant factor k and some lower order terms whose constants I don't know (or don't care to calculate)", you could write that as: k * N + o(N). The little o means terms that grow strictly slower than N.
Edit: formatting
Edit2: I should have conceded that I'm not aware of anyone who attempts to use O-notation to talk about cache goodness. That certainly is swept under the covers of "Assume a perfect RAM machine" :P
Thanks, I'll have to re-train my complement detector :).
I enjoy breaking out Radix because it's one of those thing's that's so unintuitive when you come at it from a traditional CS perspective. AFAIK the only book I'm aware of that approaches data structures with a consideration for this type of performance is Real-Time Collision Detection(which is basically datastructures for 3D spatial domains).
If you're complaining because it's faster than O(N*log(N)), the fixed key size guarantees you'll never meet a sufficiently large N which is required when doing big-oh analysis.
If you're complaining because of the k, k is the number of bits in the key and it's perfectly reasonable to have that spelled out explicitly.
Radix sort really does have linear time complexity, but requires fixed size keys. If your keys are variable sized, e.g. strings, it's O(n log n) as usual.
The three things to keep in mind about radix sort are A) It's noncomparative, while other sorting algorithms' performance is measured in number of comparisons, so there's a bit of apples-and-oranges going on. B) What is `k` for `n` unique keys? Log(n). So O(kn) is...? C) Despite those considerations, in practice radix sort can often outperform everything else, so use where appropriate.
There's a 4th as well which is that the sort is stable on keys of the same value. So if you care about ordering(which a lot of rendering algorithms that use Radix do) it useful there as well.
You can sort strings also in linear time with a trie (for example). This requires O(n) time, where n is the total length of the input (i.e. sum of lengths of strings).
A better solution which only uses local connections between adjacent cells is Odd-Even Transposition Sort.
When the cells are arranged in a 2D grid, Shearsort sorts all the values in snake-like order in O(sqrt(n) log(n)). A slightly more complicated algorithm based on Shearsort is that of Schnorr and Shamir, which runs in O(sqrt(n)).
Yeah I was thinking the same thing, just compare all the registers to the value of the next item @posedge_clk and use that bit to tell the registers whether or not they need to shift their value into the next register. It replaces the `memmove()` (the abstract concept) with a single cycle shift
The benefit of my way is that you get pipelining (which lets you maintain O(1) throughput without having a fill/flush cycle), as well as better scaling (due to said pipelining) and better energy usage due to smaller fanout.
O(N) is not particularly insightful for fixed/limited values of N. This would be more interesting if the author reported sorts/s with the FPGA vs. software. Because obviously, as N grows, the FPGA only performs the full sort in "one cycle" if you are willing to lower the cycle time more and more (until the FPGA blows up).
Yes, it's really an insertion sort where the insert step is done in parallel hardware. That's cool for small N but it's really still an O(N^2) algorithm.
> each sorting step is fast, bubbling down to N identical and independent questions asked in parallel at every clock cycle.
> The first empty cell could take the new data since it’s empty, but since 4 just got kicked out from the cell above, this empty cell gets the 4 instead
These are incompatible. The result in cell 3 is dependent on the result of cell 2 (and so on up the chain of cells).
Therefore, the steps have to run in order, making this O(n) just for adding one item to the set.
The end goal might still be possible, but I don't see how this can work.
I initially thought this as well, but reading the code showed me it's actually O(1) to add each item to the set.
Each cell computes in parallel (new_data < cell_data) & (cell_state == OCCUPIED). If this is true, then the cell will be pushed down, either be the new data or by the data from the cell above. Each cell can also grab this value from the cell above. If the cell above's value is true, then the cell grabs the value from the cell above. If the value in the cell above is false, then the cell grabs the data being inserted. This is O(1) because there's no cascading value shift. I personally found this pretty clever, and I'm looking forward to implementing some of these bounded size hardware sorts on GPU.
> Each cell can also grab this value from the cell above
Therein lies the problem I see. Grabbing this value from the above cell implies that this value has already been calculated- but it hasn't, as these cells run in parallel. The results of a cell in a single round of calculations is not accessible to the other cells during the same round.
In short- how does a cell know whether its new value is the value being inserted or the value being pushed down without knowing the result of the cell above it?
And having asked that, I can actually think of one answer: it can calculate whether it's neighbor will be pushing or not independently. So each cell is really doing the calculation of "given these two numbers (my cell's value and the above cell's), where does this new number fit in sorted order with them- before, between, or after".
A much more faster sorting algorithm implementation for hardware is sorting networks[1]. This can be implemented with just comparators and wires. Thus, there's no clock involved since it is a combinational circuit. It's quite simple circuit than can be built from bottom to top with logic gates. Knuth's TAOCP has a good explanation. Additionally, there are lots of information on the internet.
We'll see a lot of FPGA use in the coming years. Consider this: newest FPGAs have 1TB/s (that's a terabyte per second, not terabit) memory bandwidth and 1KB-wide memory bus. Short of developing your own silicon (which takes years and tons of money), you simply can't get this kind of throughput with anything else. That's basically why Intel bought Altera.
HBM and HMC are already in the process of bringing most of those advantages to GPUs and probably some CPUs. Last year's high-end GPUs introduced HBM with a 4kb (0.5kB) wide memory bus with 0.5TB/s bandwidth. The only reason they didn't have more memory bandwidth was because they didn't need it and the die area was better put to use for compute resources. FPGAs may be the only general-purpose chips so far to actually ship that kind of memory bus, but they aren't the only ones that can.
GPUs aren't custom logic. They come with significant and fundamental programming model limitations. So I don't see much of a functional overlap FPGAs and GPUs. In particular, GPUs suck mightily at anything that involves branching. They're also not ideal for things like SDN, crypto, search retrieval/ranking, realtime DSP, and so on.
I have the pleasure of working with Josh, and first heard him talk about this in a car ride to get lunch. :) Very cool stuff, and very excited to see the things he builds in the future.
rearranged my repo, here is the scalable, pipelined way of doing this, it ends up being an insertion sort with the benefit of small fanout (for lower energy), better scalability (faster due to lower fanout and shorter critical path with pipelining), and maintains O(1) throughput (no fill/drain cycle).
Also fun is Radix Sort. If you can guarantee a fixed size key(say 32bits) a Radix Sort will do O(kN) where k is usually 1-4 based on your cache line size. Destroys any other sort due to linear data reads by almost an order of magnitude in some cases. I've used on on some previous gigs to sort thousands of quads front-to-back in record time on some older ARM chips.