An easy argument to see that CRCs can be parallelized [1] is as follows. Note that CRC(a XOR b) = CRC(a) XOR CRC(b) per http://en.wikipedia.org/wiki/Cyclic_redundancy_check#Data_in... this allows you to do e.g. CRC(first_half || second_half) = CRC(first_half || 000...0) XOR CRC(000...0 || second_half). A bit of cleverness allows you to speed up the part with lots of zeroes.
[1] I've never implemented this, and I'm not at all sure that the proposed implementation is the fastest possible - and it's definitely not the most general solution!
Yes. Or to go a bit further, taking advantage of the fact that CRCs are not just linear but also cyclic:
CRC(first_half || second_half) = CRC(first_half) * x^n mod p(x) XOR CRC(second_half), where p(x) is the generator polynomial.
The number of parts you'll want to split the data into will depend on the throughput:latency ratio of your CRC reduction (subject of course to asymptotic optimizations not being useful for small inputs, of course).
Both the article and cperciva use the algorithmic structure of CRC; but the article's trick still runs front-to-back. You need a "combine chunks" operation if you want to outsource part of the work to another processor.
It exists as well, but isn't mentioned in the article. The article just cares about single process speed-up.
The other "merge CRCs" solution starts out resembling:
/* Return the CRC-64 of two sequential blocks, where crc1 is the CRC-64 of the
first block, crc2 is the CRC-64 of the second block, and len2 is the length
of the second block. */
uint64_t crc64_combine(uint64_t crc1, uint64_t crc2, uintmax_t len2)
[1] I've never implemented this, and I'm not at all sure that the proposed implementation is the fastest possible - and it's definitely not the most general solution!