CRC is a bitwise linear function over XOR (+). Digesting a message ABCD, crc(ABCD), is simply the expression crc(crc(crc(crc(A) + B) + C) + D). To parallelize, this can be rearranged (via distributivity of CRC of XOR) as crc(crc(crc(crc(A) + B))) + crc(crc(C) + D). This is simply the CRC of the second half, XORed with the CRC of the first half passed through the function crc(crc(_)) (crc^2), which is itself linear and can be precomputed just like CRC is.
This extends to arbitrary block sizes and/or arbitrary strides through the message. So e.g. you can have four threads compute the CRCs of separate 64-byte (cache-line-sized) blocks within 256-byte chunks, separating each block within a thread using crc^192, offset the results as appropriate with crc^64^k, and then combine them with XOR.
(Protip: these fun properties are exactly what make CRCs a TERRIBLE choice for cryptography!)
You could. You could also have a billion threads. But the article is talking about single-core instruction level parallelism where normal tricks like loop unrolling won't work.
Regardless of what the post is about, its lede is demonstrably incorrect. Speeding up CRC is great, but "we have to because we can't parallelize it" is a false reason.
Anyway the tricks I outlined would probably help even on a single core, since it allows you to pipeline fetches to the CRC table. (If the table is sitting in L2 you've got ~10 cycles latency between successive CRCs. That's enough downtime to run a second CRC in parallel.)
1. I don't know why Redis is using CRC-16 and CRC-64, but if you want CRC-32 then using the CRC32 instruction in SSE 4.2 yields pretty impressive performance.
2. CRCs are parallelizable, it just takes a bit more work (and math).
I believe based on context that you and the article are talking about different kinds of parallelism. Yours is the more usual sense, the author seems to be referring to the fact that CRC does not exhibit instruction level parallelism which out-of-order execution exploits. This is because the loop is very tight (probably a few instructions) and each iteration depends on the results of the previous, so there is only a tiny window for reordering instructions.
Why the heck did they decided to focus on CRC? Early optimization?
First CRC16 is in fact used as a "perfect hash function". It is confusing to focus on the implementation and not the purpose. It is used to evenly distribute data in bucket. It is a 'perfect hash function' they should use and focus on (CRC being a special subcase of totally legitimate candidates for being perfect hash function, but if tommorrow a super hyper fast perfect hash function is out there ... ).
The CRC64 is used for a valid case of CRC BUT in an invalid domain.
Mathematically I could prove they are wrong. And I know where they have a probability problem and how there are sharpening the transition between functional/dysfunctional state, but how they are making so irreversible they will be FUBAR.
1) They are leaving of course there ass open to an attack by image by a malevolent attacker (having access to the storage for which they do the checksum (it is almost used as a digital signature so it should be a crypto hash function)).
2) probability of random collisions are not balanced by the fact a cause altering the signal normally also has some odds to alter the CRC and normally the signal is also a validation of the CRC that is the reason why CRC in real life should be a fixed fraction of the signal according to the probability of coalteration of both. So if you have a tolerance for faulty storage, as a result you will use it, and you might also reconstruct data from randomly corrupted with same CRC. It will be a mess the very exceptional but predictable day one shit happens, that's all I am saying. Every safeguards will be green :)
1. from the antirez interview currently on the front page: 'One of the main characteristics of stuff I make is that they are "strange", don't resemble how a given problem was solved in the past' — sure, SSE CRC32 exists, but it's safer to use a bigger CRC, so performance wasn't a deciding factor there.
An easy argument to see that CRCs can be parallelized [1] is as follows. Note that CRC(a XOR b) = CRC(a) XOR CRC(b) per http://en.wikipedia.org/wiki/Cyclic_redundancy_check#Data_in... this allows you to do e.g. CRC(first_half || second_half) = CRC(first_half || 000...0) XOR CRC(000...0 || second_half). A bit of cleverness allows you to speed up the part with lots of zeroes.
[1] I've never implemented this, and I'm not at all sure that the proposed implementation is the fastest possible - and it's definitely not the most general solution!
Yes. Or to go a bit further, taking advantage of the fact that CRCs are not just linear but also cyclic:
CRC(first_half || second_half) = CRC(first_half) * x^n mod p(x) XOR CRC(second_half), where p(x) is the generator polynomial.
The number of parts you'll want to split the data into will depend on the throughput:latency ratio of your CRC reduction (subject of course to asymptotic optimizations not being useful for small inputs, of course).
Both the article and cperciva use the algorithmic structure of CRC; but the article's trick still runs front-to-back. You need a "combine chunks" operation if you want to outsource part of the work to another processor.
It exists as well, but isn't mentioned in the article. The article just cares about single process speed-up.
The other "merge CRCs" solution starts out resembling:
/* Return the CRC-64 of two sequential blocks, where crc1 is the CRC-64 of the
first block, crc2 is the CRC-64 of the second block, and len2 is the length
of the second block. */
uint64_t crc64_combine(uint64_t crc1, uint64_t crc2, uintmax_t len2)
I doubt this is as big a win as claimed. Sure, a benchmark managed to achieve 1.6GB/s CRC-64 throughput, but that requires 16KB (!) of lookup tables. If those don't reliably stay in at least L2 cache, the throughput could be very poor in practice.
If you want both speed and security, Siphash-2-4 is a cryptographically secure hash algorithm that runs at ~1.4cycles/byte for long inputs (faster than even crc16speed), and yet is still fast for short inputs (2.4cpb for 64 bytes). It also doesn't require 16KB of lookup tables. https://131002.net/siphash/http://bench.cr.yp.to/results-auth.html#amd64-titan0
I've collected quite a few CRC32 methods and compared them all, it would be interesting to do the same for CRC64 to get the whole range of options too.
"crc-mark-adler-hw" uses SSE4.2 just like your "crc-sse42". The latter one is just very poorly written, since it only uses _mm_crc32_u8() once per iteration (instead of 3 times).
Nice! I've ported it to Java, and got roughly a 2x speed increase over the version posted at http://stackoverflow.com/a/20586626/58962 (same question Mark Adler posted in).
On Java 8, I get 647MB/s on a 2,3GHz MBP (the non-optimized version was 320 MB/s). Not bad, but Matt's C version manages 1628.09 MB/s.
Update: with some optimizations and eliminating I/O in the benchmark, we now clock in at a respectable 1150 MB/s (up from 390 MB/s unoptimized).
I really hope this gets into Java9 http://objectlayout.org/
Gives the same benefits as C structs while still behaving like Java objects. i.e. has the structs memory semantics while having java function/method semantics associated with the memory.
Thats a good one, Gil Tene is really a good speaker. For anyone who (like me) thinks: "structs are just classes without methods, why are they faster?". The answer is that this is about arrays, which are currently slow in Java because they are:
- mutable: elements can be changed, nulled etc.
- polymorphic: they can contain T or subtype-of-T
This makes fast streaming of objects impossible, because you need to deref the object. If this changes, object addresses can be calculated like in C, which makes things fast - flat memory is fast, prefetchers work their magic.
If your lookup table is too big it'll blow out your L1 cache too.
apparently unrolling
It's not just _unrolling_, but it's merging positional-dependent pre-computed values and xor'ing them together. A little more complicated than just a quick loop unrolling (hence, nobody discovered it until 2006).
> more complicated than just a quick loop unrolling
Yeah, or else it would be 8 repetitions of the same line.
But its speed is from from eliminating *byte++ in each of the unrolled lines. Instead it just fetches one 64-bit word into a register and works on that for the unrolled lines.
I guess this surprised me - how much of a difference that makes.
I mistook it for simple loop unrolling too, and was surprised at the difference. Take a look a the "cycles per byte" - the difference is big, and I guess the linear access patterns involved mask the memory latencies very well.
Not entirely the same. The problem isn't the polynomial, but the entire model.
crcutil also has many fatal flaws for inclusion in other projects: the code is horrifically ugly, it's hosted on google code so it looks unmaintained and abandoned, it's C++, it uses unreadable ASM to be "fast" but that's not portable across all the architectures Redis runs on, and it has a horrible build system (because of ASM and C++ and other issues).
True it's a pain to embed. We embedded it in my current project, and it only took a few hours to add a CMakeLists.txt and make it reasonably embeddable.
I guess if your project doesn't want to include C++ (and you don't want to write a simple C wrapper), you've got a fair point. But it sounds like their implementation is significantly faster than the one in the linked post.
No.
CRC is a bitwise linear function over XOR (+). Digesting a message ABCD, crc(ABCD), is simply the expression crc(crc(crc(crc(A) + B) + C) + D). To parallelize, this can be rearranged (via distributivity of CRC of XOR) as crc(crc(crc(crc(A) + B))) + crc(crc(C) + D). This is simply the CRC of the second half, XORed with the CRC of the first half passed through the function crc(crc(_)) (crc^2), which is itself linear and can be precomputed just like CRC is.
This extends to arbitrary block sizes and/or arbitrary strides through the message. So e.g. you can have four threads compute the CRCs of separate 64-byte (cache-line-sized) blocks within 256-byte chunks, separating each block within a thread using crc^192, offset the results as appropriate with crc^64^k, and then combine them with XOR.
(Protip: these fun properties are exactly what make CRCs a TERRIBLE choice for cryptography!)