Improving Redis CRC performance

colanderman · on Dec 23, 2014

CRCs are inherently non-parallelizable

No.

CRC is a bitwise linear function over XOR (+). Digesting a message ABCD, crc(ABCD), is simply the expression crc(crc(crc(crc(A) + B) + C) + D). To parallelize, this can be rearranged (via distributivity of CRC of XOR) as crc(crc(crc(crc(A) + B))) + crc(crc(C) + D). This is simply the CRC of the second half, XORed with the CRC of the first half passed through the function crc(crc(_)) (crc^2), which is itself linear and can be precomputed just like CRC is.

This extends to arbitrary block sizes and/or arbitrary strides through the message. So e.g. you can have four threads compute the CRCs of separate 64-byte (cache-line-sized) blocks within 256-byte chunks, separating each block within a thread using crc^192, offset the results as appropriate with crc^64^k, and then combine them with XOR.

(Protip: these fun properties are exactly what make CRCs a TERRIBLE choice for cryptography!)

seiji · on Dec 23, 2014

you can have four threads compute

You could. You could also have a billion threads. But the article is talking about single-core instruction level parallelism where normal tricks like loop unrolling won't work.

colanderman · on Dec 23, 2014

Regardless of what the post is about, its lede is demonstrably incorrect. Speeding up CRC is great, but "we have to because we can't parallelize it" is a false reason.

Anyway the tricks I outlined would probably help even on a single core, since it allows you to pipeline fetches to the CRC table. (If the table is sitting in L2 you've got ~10 cycles latency between successive CRCs. That's enough downtime to run a second CRC in parallel.)

seiji · on Dec 22, 2014

If you like this, also check out other Redis-related walkthroughs from this year.

Improving the Redis list data structure: 1.) https://matt.sh/redis-quicklist 2.) https://matt.sh/redis-quicklist-adaptive 3.) https://matt.sh/redis-quicklist-visions

Adding geo commands to Redis as a loadable module: https://matt.sh/redis-geo

Prototype of native JSON support for Redis: https://matt.sh/redis-json

cperciva · on Dec 22, 2014

Two things come to mind here:

1. I don't know why Redis is using CRC-16 and CRC-64, but if you want CRC-32 then using the CRC32 instruction in SSE 4.2 yields pretty impressive performance.

2. CRCs are parallelizable, it just takes a bit more work (and math).

fitshipit · on Dec 23, 2014

I believe based on context that you and the article are talking about different kinds of parallelism. Yours is the more usual sense, the author seems to be referring to the fact that CRC does not exhibit instruction level parallelism which out-of-order execution exploits. This is because the loop is very tight (probably a few instructions) and each iteration depends on the results of the previous, so there is only a tiny window for reordering instructions.

SFjulie1 · on Dec 23, 2014

2. Theory of CRC comes with noisy telecom lines where noise have a weired non linear random distribution making it impossible to "vectorize" the problem (aka parallelizing) (http://scienceblogs.com/goodmath/2007/07/27/fractal-dust-and...)

A dict/document/transaction maybe a fractal like structure, but the probability of alteration is not happening with the same shape/locality.

After reading this http://www.slideshare.net/Dataversity/redis-in-action , I am surprised.

Why the heck did they decided to focus on CRC? Early optimization?

First CRC16 is in fact used as a "perfect hash function". It is confusing to focus on the implementation and not the purpose. It is used to evenly distribute data in bucket. It is a 'perfect hash function' they should use and focus on (CRC being a special subcase of totally legitimate candidates for being perfect hash function, but if tommorrow a super hyper fast perfect hash function is out there ... ).

The CRC64 is used for a valid case of CRC BUT in an invalid domain.

Mathematically I could prove they are wrong. And I know where they have a probability problem and how there are sharpening the transition between functional/dysfunctional state, but how they are making so irreversible they will be FUBAR.

1) They are leaving of course there ass open to an attack by image by a malevolent attacker (having access to the storage for which they do the checksum (it is almost used as a digital signature so it should be a crypto hash function)). 2) probability of random collisions are not balanced by the fact a cause altering the signal normally also has some odds to alter the CRC and normally the signal is also a validation of the CRC that is the reason why CRC in real life should be a fixed fraction of the signal according to the probability of coalteration of both. So if you have a tolerance for faulty storage, as a result you will use it, and you might also reconstruct data from randomly corrupted with same CRC. It will be a mess the very exceptional but predictable day one shit happens, that's all I am saying. Every safeguards will be green :)

Something seems fishy.

seiji · on Dec 23, 2014

First CRC16 is in fact used as a "perfect hash function"

It must be implemented by all clients too, so it needs to be simple and readily available.

open to an attack by image by a malevolent attacker

That wasn't one of the design goals.

if you have a tolerance for faulty storage,

There's always the problem of storage corrupting only the checksum even if your data is okay, leaving you in a pretty pickle.

fitshipit · on Dec 23, 2014

parent is referring to a multithreaded algorithm, not one that uses SIMD instructions, which is what is meant by vectorizing.

seiji · on Dec 22, 2014

1. from the antirez interview currently on the front page: 'One of the main characteristics of stuff I make is that they are "strange", don't resemble how a given problem was solved in the past' — sure, SSE CRC32 exists, but it's safer to use a bigger CRC, so performance wasn't a deciding factor there.

2. Are they? Do explain.

JoachimSchipper · on Dec 22, 2014

An easy argument to see that CRCs can be parallelized [1] is as follows. Note that CRC(a XOR b) = CRC(a) XOR CRC(b) per http://en.wikipedia.org/wiki/Cyclic_redundancy_check#Data_in... this allows you to do e.g. CRC(first_half || second_half) = CRC(first_half || 000...0) XOR CRC(000...0 || second_half). A bit of cleverness allows you to speed up the part with lots of zeroes.

[1] I've never implemented this, and I'm not at all sure that the proposed implementation is the fastest possible - and it's definitely not the most general solution!

cperciva · on Dec 22, 2014

CRC(first_half || second_half) = CRC(first_half || 000...0) XOR CRC(000...0 || second_half).

Yes. Or to go a bit further, taking advantage of the fact that CRCs are not just linear but also cyclic:

CRC(first_half || second_half) = CRC(first_half) * x^n mod p(x) XOR CRC(second_half), where p(x) is the generator polynomial.

The number of parts you'll want to split the data into will depend on the throughput:latency ratio of your CRC reduction (subject of course to asymptotic optimizations not being useful for small inputs, of course).

seiji · on Dec 22, 2014

but... isn't that what the article describes?

JoachimSchipper · on Dec 22, 2014

Both the article and cperciva use the algorithmic structure of CRC; but the article's trick still runs front-to-back. You need a "combine chunks" operation if you want to outsource part of the work to another processor.

seiji · on Dec 22, 2014

It exists as well, but isn't mentioned in the article. The article just cares about single process speed-up.

The other "merge CRCs" solution starts out resembling:

  /* Return the CRC-64 of two sequential blocks, where crc1 is the CRC-64 of the
   first block, crc2 is the CRC-64 of the second block, and len2 is the length
   of the second block. */
  uint64_t crc64_combine(uint64_t crc1, uint64_t crc2, uintmax_t len2)

Scaevolus · on Dec 22, 2014

I doubt this is as big a win as claimed. Sure, a benchmark managed to achieve 1.6GB/s CRC-64 throughput, but that requires 16KB (!) of lookup tables. If those don't reliably stay in at least L2 cache, the throughput could be very poor in practice.

If speed is important, xxHash is probably a better choice than CRC-64. http://fastcompression.blogspot.com/2012/04/selecting-checks...

If you want both speed and security, Siphash-2-4 is a cryptographically secure hash algorithm that runs at ~1.4cycles/byte for long inputs (faster than even crc16speed), and yet is still fast for short inputs (2.4cpb for 64 bytes). It also doesn't require 16KB of lookup tables. https://131002.net/siphash/ http://bench.cr.yp.to/results-auth.html#amd64-titan0

seiji · on Dec 22, 2014

The constraint was: the output must be CRC-64-Jones, so switching implementations wasn't an option.

So, given that constraint, any better solutions are welcome. :)

Scaevolus · on Dec 23, 2014

Why is that a hard constraint?

fitshipit · on Dec 23, 2014

in some cases the checksums are persisted, so future checksums must be computed in the same way.

seiji · on Dec 23, 2014

Backwards compatibility?

baruch · on Dec 22, 2014

I've collected quite a few CRC32 methods and compared them all, it would be interesting to do the same for CRC64 to get the whole range of options too.

The results are here: https://github.com/baruch/crcbench/blob/master/log.i3-2330M-...

b3tta · on Dec 22, 2014

"crc-mark-adler-hw" uses SSE4.2 just like your "crc-sse42". The latter one is just very poorly written, since it only uses _mm_crc32_u8() once per iteration (instead of 3 times).

cperciva · on Dec 22, 2014

crc-mark-adler-hw

You mean crc32c_mark_hw, I assume?

MrBuddyCasino · on Dec 22, 2014

Nice! I've ported it to Java, and got roughly a 2x speed increase over the version posted at http://stackoverflow.com/a/20586626/58962 (same question Mark Adler posted in).

On Java 8, I get 647MB/s on a 2,3GHz MBP (the non-optimized version was 320 MB/s). Not bad, but Matt's C version manages 1628.09 MB/s.

Update: with some optimizations and eliminating I/O in the benchmark, we now clock in at a respectable 1150 MB/s (up from 390 MB/s unoptimized).

istvan__ · on Dec 23, 2014

I am not surprised about the Java performance.

URL related: http://java-is-the-new-c.blogspot.com/2014/12/a-persistent-k...

jerven · on Dec 23, 2014

I really hope this gets into Java9 http://objectlayout.org/ Gives the same benefits as C structs while still behaving like Java objects. i.e. has the structs memory semantics while having java function/method semantics associated with the memory.

Better explained in this presentation http://medianetwork.oracle.com/video/player/3730888956001 (there are other sources as well).

MrBuddyCasino · on Dec 23, 2014

Thats a good one, Gil Tene is really a good speaker. For anyone who (like me) thinks: "structs are just classes without methods, why are they faster?". The answer is that this is about arrays, which are currently slow in Java because they are:

- mutable: elements can be changed, nulled etc.

- polymorphic: they can contain T or subtype-of-T

This makes fast streaming of objects impossible, because you need to deref the object. If this changes, object addresses can be calculated like in C, which makes things fast - flat memory is fast, prefetchers work their magic.

MrBuddyCasino · on Dec 23, 2014

The ample usage of ";-)" is a pretty reliable indicator of the author being german. Not sure what that means.

Isamu · on Dec 22, 2014

The improvement is to switch from a 256-entry table to an 8x256 table, apparently unrolling the 8-bit loop to get a full 64 bits in each iteration.

So if speed is the thing, why not a 16-bit table? Same idea, faster. 8*64k = 512 kbyte table.

seiji · on Dec 22, 2014

If your lookup table is too big it'll blow out your L1 cache too.

apparently unrolling

It's not just _unrolling_, but it's merging positional-dependent pre-computed values and xor'ing them together. A little more complicated than just a quick loop unrolling (hence, nobody discovered it until 2006).

Isamu · on Dec 22, 2014

> blow out your L1 cache

Fair enough.

> more complicated than just a quick loop unrolling

Yeah, or else it would be 8 repetitions of the same line.

But its speed is from from eliminating *byte++ in each of the unrolled lines. Instead it just fetches one 64-bit word into a register and works on that for the unrolled lines.

I guess this surprised me - how much of a difference that makes.

MrBuddyCasino · on Dec 22, 2014

I mistook it for simple loop unrolling too, and was surprised at the difference. Take a look a the "cycles per byte" - the difference is big, and I guess the linear access patterns involved mask the memory latencies very well.

tlipcon · on Dec 23, 2014

Existing library which does the same thing, and faster, with arbitrary polynomials: http://code.google.com/p/crcutil/

seiji · on Dec 23, 2014

Not entirely the same. The problem isn't the polynomial, but the entire model.

crcutil also has many fatal flaws for inclusion in other projects: the code is horrifically ugly, it's hosted on google code so it looks unmaintained and abandoned, it's C++, it uses unreadable ASM to be "fast" but that's not portable across all the architectures Redis runs on, and it has a horrible build system (because of ASM and C++ and other issues).

tlipcon · on Dec 23, 2014

True it's a pain to embed. We embedded it in my current project, and it only took a few hours to add a CMakeLists.txt and make it reasonably embeddable.

I guess if your project doesn't want to include C++ (and you don't want to write a simple C wrapper), you've got a fair point. But it sounds like their implementation is significantly faster than the one in the linked post.

MichaelGG · on Dec 23, 2014

Why not use xxHash? https://code.google.com/p/xxhash/

Docs say it hits over 13GB/sec on a Core i5 3340M @2.7GHs.

Or if this is an older design decision, why not use MurmurHash, which is also pretty fast and has been around for a while.

seiji · on Dec 23, 2014

That's almost like answering "do you want chicken or steak?" with "I'd like a balloon!"

The article didn't pretend to compare all possible options, it aimed to improved exiting options while maintaing backwards compatibility.

Redis doesn't make design decisions based on what's fast, it prefers to use what's simple.

tedunangst · on Dec 23, 2014

> This is a pure performance optimization, not a correctness requirement. Removing the initial pre-alignment loop doesn't change any results.

Somebody needs to test their code on a CPU that requires alignment before loading 64bit values.