Fast Reed-Solomon Coding in Go

Smerity · on June 23, 2015

Whilst I quite enjoy Go, one annoyance I have is that it seems there is no middle ground between alright performance using pure Go and great performance using Go with assembler. Many other languages allow you, if you contort it, to get good performance without leaving the language. In this code, it was at least well sourced (and hopefully battle tested) SSE3 assembler for the Galois field arithmetic.

I hit this repeatedly when writing a pet project govarint[1] which performs integer encoding and decoding. I don't expect blazingly fast performance if I don't use assembler, but bitwise operations are painfully slow compared to the equivalent in C. C would also win handsomely even if the compilers were equivalent due to no boxing. It's also additional overhead that you now have two divergent code paths - the Go code (fallback in case you don't have assembler for the architecture) and the assembler code - that would both need to be kept up to date and checked for bugs / errors / inconsistencies.

Am I missing something? I only ever see assembler checked in to repositories - is there a typical sane build process I've just not seen? Is it just a matter of waiting for the Go compiler to "level up" optimizations to match GCC/Clang? Or should I just learn to live with the fact that to code Go I'll also need to reacquaint myself in assembler again?

[1]: https://github.com/Smerity/govarint

danieldk · on June 23, 2015

Is it just a matter of waiting for the Go compiler to "level up" optimizations to match GCC/Clang?

The compiler currently uses a syntrax tree-based IR that is not very easy to optimize. A new SSA backend for the compiler is planned for 1.6:

https://docs.google.com/document/d/1szwabPJJc4J-igUZU4ZKprOr...

(This was blocked until the compiler was ported from C to Go.)

pcwalton · on June 23, 2015

There's no free lunch. One of Golang's primary goals is to compile fast (which is admirable!), which it achieves in part by leaving out lots of optimizations. Bringing 6g/8g up to the same level of optimizations as LLVM/GCC will, in all likelihood, slow down its compilation time. I can't speak for where along the line of speed versus code quality Go's developers want to place the compiler in the long run, nor am I saying you can't do better than both GCC and LLVM in compilation speed, but I do think there's a fundamental tradeoff there that no project can eliminate (short of writing in assembler, that is—which is, for all its downsides, an approach that does eliminate the tradeoff).

danieldk · on June 23, 2015

Bringing 6g/8g up to the same level of optimizations as LLVM/GCC will, in all likelihood, slow down its compilation time.

I fully agree that it is a trade-off, but wanted to point out that introducing better optimizations in the compiler may also speed up the compiler to some extend (once it is in Go per 1.5).

So, there may be a limited free lunch there ;).

XorNot · on June 23, 2015

...I don't see it.

Compile time optimization is a thing you can do for your production build only. Unless Go's design implicitely inhibits certain types of optimizations, but that would seem odd given how opinionated it is. What you do in Go, and what you mean, are much clearer then in most other languages.

kibwen · on June 23, 2015

The go tool has no distinction between development builds and production builds; every build is a production build. This itself is a manifestation of Go's opinionated philosophy.

https://golang.org/cmd/go/#hdr-Compile_packages_and_dependen...

klauspost · on June 23, 2015

>one annoyance I have is that it seems there is no middle ground between alright performance using pure Go and great performance using Go with assembler.

I don't think it is a Go-specific thing. It just happens that this is an algorithm that lends itself extremely well to assembler, because the 'pshufb' instruction allows you to make 16 16-byte lookups in 3 cycles.

The speedup between Go and assembler is the same as the speedup between C and assembler.

>but bitwise operations are painfully slow compared to the equivalent in C.

I think the biggest difference is in variable shifts, since Intel doesn't have instructions for shifts bigger than (register size - 1). This means that if shift using a non-constant shift, Go has to check if you are shifting more than the register size, and execute special code to do that.

If a = uint64(1) and x = uint(64), then a << b always gives 0 in Go. I believe that in C (at least older C) the result is undefined, and therefore the compiler can choose the fastest option.

Also, being able to skip bounds checks in assembler is of course a (usually) minor speedup.

In general I don't see this as a huge issue, and onlyuse assembler if I know it would bring at least a 2x speedup.

>[...]need to be kept up to date and checked for bugs / errors / inconsistencies

Use build tags. In the reedsolomon package, I can run `go test -tags=noasm` to test if my pure go version passes tests.

>Is it just a matter of waiting for the Go compiler to "level up" optimizations to match GCC/Clang?

GCC has has many, many years in developement, and llvm/clang still isn't better (in my experience). I have battled with buggy code optimizers, and I don't want Go to be there. I want a solid compiler first, and fast secondly.

jerf · on June 23, 2015

"Whilst I quite enjoy Go, one annoyance I have is that it seems there is no middle ground between alright performance using pure Go and great performance using Go with assembler."

I've banged the drum here on HN several times on how I consider Go to be more scripting language that "systems language", and this is another example of why. For a scripting language it's actually screamingly fast, even when you use interfaces, but if you want to bit-bash inside the language itself using sophisticated instructions (i.e., not just the bitwise and, or, etc.), it hasn't really got a great story for it, nor is it likely to develop one, because it's not in the core thrust of the language. It just isn't a "low level" language, it's a high-level language that happens to include true arrays and structs. It's not "high-level assembly" like C, and lately even C isn't really "high-level assembly" unless you mean "high-level assembly for a 1970s machine".

That said, I'm not particularly familiar with any high-level language right now that really and truly opens the power of modern assembler to you, within the language itself. I don't consider "sometimes the optimizer secretly and silently manages to use SSE" to be "opening the power of assembler", nor is "allowing you to inline assembler". Partially that's because contrary to popular belief, hardware actually moves much more quickly than software languages, and hardware's been moving fast on this lately. While I don't know that you could wring a Next Big Language out of this, there's probably room for a language that would make this stuff more directly available to people. (I'd focus on making it easy to integrate with other languages rather than trying to make it stand on its own. Partially that's so you could keep pace with the rapidly-moving hardware... you won't be able to keep up if you're trying to drag an entire general-purpose programming language around with you.)

kjksf · on June 23, 2015

Did you actually profile pure Go vs. Go assembly implementation of your code?

You make a pretty strong statement of "bitwise operations are painfully slow compared to the equivalent in C" and I find it hard to square that with my understanding of how things work.

Compiling simple arithmetic is about the simplest thing the compiler does. There's no reason Go compiler should produce slower code on that than a C compiler.

Also, on modern processors arithmetic operations are cheap compared to memory access.

Looking at govarint code: it could be much faster without reaching for assembly.

First, it uses interfaces. Interfaces are great but the call via interface is more expensive than a direct call (and most likely prohibits inlining). In C++ terms it's like calling a virtual function. C doesn't even have anything like that, you only get straight calls.

Given that you incur that overhead every single Put()/Get() call, even though the cost is small in absolute terms, it's large compared to cost of the code inside those functions.

If you care about absolute performance, don't use interfaces.

Second, you make another function call via interface (r.ReadByte()) for every byte of data. Again, a function call is not expensive in absolute terms, but in this case it'll be more expensive than the rest of the computation. Grab 4k into a temp buffer and process that. It'll make the code more complicated but it's not dropping to assembly.

Finally, I can be completely wrong. Profile your code (http://blog.golang.org/profiling-go-programs).

Now, even after those fixes I don't expect Go code to be as fast as C code. C can use slightly faster function call conventions, memory access isn't bounds checked, at O3 it'll inline more aggressively, you can force inlining manually or tell the compiler to do profile guided optimizations - all those contribute to generating faster code.

However, Go code shouldn't be "painfully" slower. If it's more than 2x slower than C, then there's probably something else going on other than weaker code generation.

vardump · on June 23, 2015

> C doesn't even have anything like that, you only get straight calls.

C does have function pointers. But because it involves setting up and tearing down call frame, they're slow for calling small functions large number of times. Functionally equivalent to C++ virtual calls (through vtable).

At least golang 1.2/1.3 did occasionally generate excessive number of bound checks. Once I counted no less than 6 nearly equivalent bound checks in an inner loop. It took a while of twiddling with code to get rid of all but 2.

However, mostly golang performance is pretty good. I like the language, so I just rather write a C/SSE/AVX library if I really need something to be very fast.

klauspost · on June 23, 2015

I like the language, so I just rather write a C/SSE/AVX library if I really need something to be very fast.

I feel exactly the same, and once you get to know the quirks of the Go assembler it is actually pretty easy to use. That said unless I know it can be done en SSE, I usually don't bother, because the speedup likely isn't worth the effort.

zaroth · on June 23, 2015

Very nice work, and thanks for releasing under MIT.

Interesting that Backblaze decided to go with 17+3 configuration. As I read more on the topic, I realized I haven't quite wrapped my head around how data and parity are allocated across symbols, stripes, and disks, and how all that impacts degraded read performance, but at first glance I would assume 17+3 would have a pretty terrible degraded read performance.

Another great read if you're interested in home brew RAID or using parity is Plank's paper "Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems". [1]

I found that gem reading Adam's blog on implementing triple parity RAID in ZFS. [2]

[1] - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103...

[2] - https://blogs.oracle.com/ahl/entry/triple_parity_raid_z

klauspost · on June 23, 2015

Very nice work, and thanks for releasing under MIT.

Cheers. The hard work goes to BB, their initial library, though :)

Interesting that Backblaze decided to go with 17+3 configuration.

Yeah. It seems they don't go for any RAID within a storage pod, but instead uses raw ext4, and distributes shards across pods instead of hard drives in a single pod. [1]

This would of course mean that a single pod isn't loaded when a drive is replaced, but it is distributed across 17 other pods.

[1] https://www.backblaze.com/blog/vault-cloud-storage-architect...

haikubrian · on June 23, 2015

(BrianB from Backblaze here)

Yes, that's right. When a drive is replaced, its contents are rebuilt from the data on the 19 corresponding drives in the other 19 pods in the vault.

signa11 · on June 23, 2015

the article contains an excellent introduction to "Galois Field Arithmetic" available via the following: http://www.snia.org/sites/default/files2/SDC2013/presentatio....

there is linux-kernel article by hpa: https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

zaroth · on June 23, 2015

Great links, thanks!

At the end of the presentation, they mention rotated reed-soloman array codes, which led me to another paper Plank worked on: "Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads" which was very interesting as well. They directly address the problem of read performance in a degraded state.

http://www.cs.jhu.edu/~okhan/fast12.pdf

zzzcpan · on June 23, 2015

I wonder why Rabin's IDA ("Efficient Dispersal of Information for Security, Load Balancing, and Fault Tolerance") is not mentioned anywhere

http://web.archive.org/web/20070221140752/http://discovery.c...

creshal · on June 23, 2015

I was under the assumption that R-S has been replaced with turbocodes in most use cases. Why not use those? Patent issues?

DanWaterworth · on June 23, 2015

I believe the difference is that in this case, RS is being used as an erasure code (which means, you either have the data or you don't; the drive has either failed or it hasn't). Failures are (approximately) uncorrelated (assuming different parts are stored on different hosts), so for this use case, RS approaches optimal in terms of storage used (some metadata needs to be stored, which is overhead).

raverbashing · on June 23, 2015

Not for this use case, no.

Turbocodes are more CPU heavy as well (also require FP math or some workarounds)

For just detecting and correcting bit errors in blocks of data, RS is fine. Now if you're getting data from the other side of the solar system Turbocodes are worth it.

walrus · on June 23, 2015

For short block lengths, R-S codes and turbo codes have similar performance (but turbo codes are harder to design). I think the patents on turbo codes have expired.

rakoo · on June 23, 2015

Quick question about the API: why separate the Split phase and the Encode phase ?

In other words, are there any scenarios where one would want to Split data but not Encode it, or where data is automatically split and all that is left to do is to Encode ?

0xdeadbeefbabe · on June 23, 2015

On a somewhat related note, what is the meaning of 6a and 6c? I suppose 6g is the go compiler and 6l is the linker. Is 6a an assembler and 6c a c compiler?

atYevP · on June 23, 2015

Yev from Backblaze here -> Nice work!