The essence of Reed-Solomon coding

userbinator · on Nov 6, 2022

I've found that, just as with CRCs, there's an abundance of articles that show the theoretical explanation of RS, but aren't much help for those wanting to actually implement it. Here's a good practical explanation of implementing RS, including the GF operations: https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_f...

corsix · on Nov 6, 2022

Alternatively, the following pair of articles, the first of which is already referenced as a footnote in the OP: http://www.corsix.org/content/galois-field-instructions-2021... http://www.corsix.org/content/reed-solomon-for-software-raid

nayuki · on Nov 6, 2022

Using Reed-Solomon coding to recover from erasures (at known positions) is relatively straightforward. It can be understood with a basic knowledge of linear algebra and finite field arithmetic.

Using RS for error correction (at initially unknown positions) is quite difficult. I wrote a step-by-step guide on it including demo code, and it doesn't even cover the most efficient decoding algorithm (I used PGZ instead of Berlekamp-Massey): https://www.nayuki.io/page/reed-solomon-error-correcting-cod...

Hanschri · on Nov 6, 2022

To understand how error correction works and to learn more about Hamming codes & Reed-Solomon, 3Blue1Brown and Ben Eater were invaluable. 3Blue1Brown and Ben Eater are by far some of the best educational content creators within their fields, mathematics and computer engineering respectively.

I would strongly recommend anyone interested in the topic to check out any of these videos:

How to send a self-correcting message (Hamming codes): https://www.youtube.com/watch?v=X8jsijhllIA

Hamming codes part 2, the elegance of it all: https://www.youtube.com/watch?v=b3NxrZOu_CE

And any of Ben Eater's five videos on error correction: https://eater.net/crc

As an aside, Ben Eater does all of his videos and demonstrations using an 8-bit computer he has built step by step in videos on a breadboard. Very impressive and inspiring.

aortega · on Nov 6, 2022

One important stage of error correction is spreading, that use a pseudo-random function to 'spread' the damaged data over several blocks that can be then completely recovered, instead of concentrating the damage in a single block (like often happen in erasures) that cannot be recovered.

userbinator · on Nov 7, 2022

Optical media like CDs and DVDs use a similar interleaving scheme, allowing for large amounts of errors to be corrected; in fact, even during normal operation with what appears to be a completely clean and perfect disc and drive, read errors are always occurring but corrected silently so there is no loss of data:

https://en.wikipedia.org/wiki/Cross-interleaved_Reed%E2%80%9...

https://en.wikipedia.org/wiki/Optical_disc#Surface_error_sca...

nayuki · on Nov 7, 2022

By spreading, I think you're referring to interleaving https://en.wikipedia.org/wiki/Burst_error-correcting_code#In... , which doesn't need randomness.

Pseudo-random scrambling is used in places like https://en.wikipedia.org/wiki/64b/66b_encoding to reduce the chances of many consecutive 0s or 1s which could cause clock desynchronization between the sender and receiver.

aortega · on Nov 8, 2022

>By spreading, I think you're referring to interleaving https://en.wikipedia.org/wiki/Burst_error-correcting_code#In... , which doesn't need randomness.

I stand corrected!

joshspankit · on Nov 8, 2022

> I stand corrected!

Heh

rogers18445 · on Nov 6, 2022

Usually, you would see RS used in a setting where any error is going to be erasure (non erasure error -> failed checksum -> erasure error).

Then you use something like LDPC codes for in-packet ECC. RS for multi-packet ECC.

less_less · on Nov 6, 2022

RS codes are also good for contexts where you don't know where the errors are. The codes are always maximum distance separable, and the algorithms to locate the errors are quite efficient.

In my experience, LDPC is usually better for soft-decision decoding, where you have information about how reliable each bit is. RS is usually better for hard-decision where you don't know anything about the error locations. Also LDPC is usually bit-oriented, and RS is always a symbol-oriented code, so RS works well for burst errors.

abdullahkhalids · on Nov 6, 2022

Is there a resource which lists out different codes and the decoding algorithms that work with them?

terramars · on Nov 6, 2022

I implemented the RS encoder that ended up being used on the production satellite bus for the telemetry system in space systems / Loral upgraded bus during my engineering coop in school! All in verilog. It's complicated especially the decoder but understandable with time. The turbo codes and more modern extensions that enable lte / 5g and other low power / small antenna applications are absolute black magic though. Such a cool field!

jl2718 · on Nov 7, 2022

Similar story. I got my first job assignment at to do a comparison of hardware for the EUMETSAT METOP ground station. This was 2004 and the leading options were custom hardware ASICs in the $1-10M range. There was a snow storm and I had ridden my bike to work, and actually got hit by a car on the way, so I didn’t really feel like trying to make it home, and $10M seemed like a lot of money when I didn’t even know how fast my new Pentium 4 desktop could do it. I slept in my office for the next three days while most of the lab took snow days, trying to write the decoder in C. How hard could it be, right? I’d never taken a CS course, and assumed it was just a little bit of bit-banging to decode a structure or whatever. It was, until I came across this error-correction thing called Reed-Solomon. I knew about CRC, and thought that was the end of the story, but this used a bunch of totally unintelligible ‘new math’ called Galois fields, and it just happened to have been invented at the place I was working at, and by the way, the guy who gave me to assignment was named Solomon, unsure of any relation to Gustave, and I felt I was setting myself up for great embarrassment. Well anyway, this place had a great library, so I read the original paper, still getting nowhere, and then a book on the subject, also unintelligible, except that it included lots of very detailed pseudo code. When the lab returned, I had something working really slowly, but the hardware engineer thought it was cool enough to keep going, and built a PCI DAQ card to read the I/Q inputs directly into the computer. Eventually I made it fast enough, order of 10Mb, and that’s how NOAA made our weather reports during the decade when NPOESS VIIRS was delayed. I adapted it to VIIRS too, but only stuck around long enough to see it used in testing. I was very proud of myself, and was recognized with an award of zero dollars and the non-exclusive right to post a reply about it on Hacker News when the topic finally came up.

tromp · on Nov 6, 2022

This is also the essence behind Shamir's Secret Sharing where the sample at x=0 is a secret shared by k+t parties any k of which can recover the secret.

[1] https://en.wikipedia.org/wiki/Shamir%27s_Secret_Sharing

aortega · on Nov 6, 2022

There are many types of Reed-Solomon codes, each with different amount of redundancy, they are a special case of BCH codes, a kind of code that approach the Shannon limit (the maximum rate of error-free data that can be transferred over a noisy channel). The bandwidth of a channel (like radio, or optical fiber) is not really limited by the media, but only by the noise it has. Optical fiber have almost zero noise, that's why they are so fast.

There are much better codes nowadays that are closer to the Shannon limit, like LDPC, or convolutional codes. But they are usually much more computationally intensive. They are used in space probes where computation time don't matter, but you often have channels with much more noise than signal.

I keep a repo of C implementation of several error-correction codes including Reed-Solomon, that can be used as standard unix filters like gzip: https://github.com/ortegaalfredo/eccchain

vmilner · on Nov 7, 2022

LDPC coding (combined with BCH coding) is now used for most Western Europe digital and satellite TV in the DVB-T2 and DVB-S2 standards.

vbuterin · on Nov 6, 2022

Shameless plug of my own explanation on how binary fields work:

https://vitalik.ca/general/2019/05/12/fft.html

Ctrl+F for "binary fields".

nsteel · on Nov 6, 2022

I think my first exposure to this was parchive files from usenet. That feeling when your 3-month download of some "huge" 700MB iso was corrupt, you load up quickpar and suddenly (quite a few minutes later) it's all fixed! No idea how it worked at the time, just magic.

klodolph · on Nov 6, 2022

Some additions, as "exercise for the reader":

1. The finite field you choose has a minimum size. What is the minimum size field 2^bits for an RS(N,K) coding system? What happens when you try to construct a Reed-Solomon code with a finite field that is too small?

2. Consider a Reed-Solomon coding system which uses a lookup table for the finite field multiplication operation that fits in L1 cache. Given that the table already fits in L1 cache, how could you make the encoder/decoder faster, if you had a smaller finite field?

nayuki · on Nov 6, 2022

> What is the minimum size field 2^bits for an RS(N,K) coding system?

Your field size must be at least N+1, noting that you shouldn't use the value 0 in the encoding matrix.

> What happens when you try to construct a Reed-Solomon code with a finite field that is too small?

Your system of linear equations doesn't have enough linearly independent equations.

> how could you make the encoder/decoder faster

Maybe put the lookup table in a 64-bit general-purpose register and use bit shifting, or in a 128/256/512-bit SIMD register and use extraction instructions (shuffle bytes, etc.).

shachaf · on Nov 6, 2022

Hmm, do you mean N+K+1 (to have enough points for both the data and parity shards)? Why isn't N+K sufficient, fitting a polynomial to N points and emitting K more?

corsix · on Nov 7, 2022

Using the notation from the article, N+K is sufficient for RS(N,K). One point of confusion is that different authors use different notation; some use RS(num data shards, num parity shards), some use RS(total num shards, num data shards), and some use RS(total num shards, num parity shards). Per the article, I'll use RS(num data shards, num parity shards).

As for where the +1 comes from, the clue is in the "noting that you shouldn't use the value 0 in the encoding matrix" remark. The TLDR is that the +1 isn't required, and arises from an (incorrect) attempt to fix an incorrect construction. The non-TLDR is rather long. First, we need to move from polynomials (per the article) to matrices (per the quoted remark). For this movement, let F denote the polynomial from the article, then it so happens that F(k+1) can be expressed as a linear combination of F(1), F(2), ..., F(k). Similarly, F(k+2) can be expressed as a linear combination of F(1), F(2), ..., F(k). This continues to be true up to F(k+t) (and beyond). These various linear combinations can be written as a k-by-t matrix, which is what the quoted remark means by "encoding matrix". Second, once thinking with matrices rather than with polynomials, people want to construct the encoding matrix directly, rather than deriving it from a polynomial. In this direct construction, the requirement is that every square submatrix (of any size) of the k-by-t matrix is invertible. Accordingly, no element of the k-by-t matrix can be zero, as the 1-by-1 submatrix containing just that zero element isn't invertible. Third, one common (but incorrect) direct construction is to create a k-by-t Vandermonde matrix. Such a matrix is usually constructed from some number of distinct elements, but if zero is used as such an element, then the resultant matrix will contain zeroes, which is problematic. Excluding zero causes the Vandermonde construction to _sometimes_ work, but it still doesn't _always_ work. Per https://www.corsix.org/content/reed-solomon-for-software-rai..., there's a slightly different Vandermonde construction that _does_ always work, and also a Cauchy construction that always works, both of which work for zero. Both of these correct constructions have a strong parallel to the polynomial construction: they involve choosing N+K distinct field elements, which is akin to choosing the N+K distinct x co-ordinates for the polynomial.

sizzzzlerz · on Nov 6, 2022

On its two Voyager missions, NASA used (still uses!) a RS encoder/decoder that can correct up to 16 error bits per frame. It became a standard for a number of subsequent missions. This was in an era before ASICs and FPGAs yet the hardware still had to meet strict size, weight, and power requirements. Not to mention maintaining comms over distances greater than 14 billion miles and received signal strengths of 10^-16 watts. Truly humbling.

est · on Nov 6, 2022

I think Reed-solomon should be considered in future network protocols designs to combat censorship. Every byte should be demuxed into bits and transferred in independent data streams, so MITM boxes can only intercept incomplete streams, and aggregate streams back to original would be insanely difficult. Let transport layers do only one job and no distinguish whatever the content might be inside.

Currently H2 does support M:N stream muxing but popular browsers only support N:1 mode.

vlovich123 · on Nov 6, 2022

It’s a comparatively expensive operation (CPU and memory) compared with just encrypting the information which also blinds the network operator to the same extent. Unless you’re saying that you’d send the stream across multiple disparate networks. But if you’re able to get packets out of one, what’s stopping you from getting the whole stream out that network?

est · on Nov 7, 2022

> comparatively expensive operation (CPU and memory)

Which is good, because it means higher cost of middle boxes

> But if you’re able to get packets out of one, what’s stopping you from getting the whole stream out that network?

It's practically impossible, unless the MITM box were setup very close to both ends on the edge. In real world packets were routed slightly different, the server might have several IPs or CDNs, so if your middlebox were placed in backbone it will be useless as packets were transfered out-of-order and not in the same stream.

> just encrypting the information which also blinds the network operator

Yes, but the network operator was sure every information is inside one exact stream, just with a thick layer of protection, state-of-the-art classifiers are able to match metadata patterns to the individual websites, so protocol designers would then take huge amount of time to fight it. You either have a very fast TTFB protocol, or you'd have to add some padding redundancy (noise) to disguise the metadata. By metadata I mean packet length and frequency pattern.

not2b · on Nov 6, 2022

That seems an inferior approach to just using encryption.

vlovich123 · on Nov 6, 2022

I really wish physical and OS network stacks would be able to give you the ability to send uncorrected bit streams. That way you can tune the error rate at the application level that makes sense. For example, with video streaming, you probably don’t need much error correction on the data stream as periodic i-frames would correct any transient glitches (you’d only bother to EC the control headers for the video). Then WiFi networks maybe wouldn’t have to be as careful about time multiplexing all the coex streams and some noise due to conflicts would be fine and not require retransmission (because the application layer could handle it).

This does come with tradeoffs (eg it may take your application longer to recover from the noise than a quick retransmit at the physical layer).

From a cost perspective it’s also maybe impractical because the computer industry gets efficiency gains by solving a problem for everyone at some quality threshold by giving up optimality for applications that could do something with it. Also you would still need to correct the control layer of the network (IP + MAC) just to make it work at all so it may be a wash (ie the incremental cost of correcting the data vs control + data may be insignificant).

Still, at least having the option as a switch that could be flipped for experimentation purposes would be quite neat to allow the curious to find new techniques / layers of abstractions vs what’s orthodoxy today.

sgtnoodle · on Nov 6, 2022

You can do it in linux, with drivers that implement packet injection. You're giving up a lot, though. I think you're right about it being a wash with regard to control overhead. In particular, because of the data ACKs, the firmware can aggressively ramp the 802.11 frame data rate up and down. Thanks to packet aggregation, the cost of the frame preamble gets amortized across all buffered up data. Letting each application schedule its own transmission would quickly eat up the available channel bandwidth in frame overhead. It works for a single specialized application, but monopolizes the entire shared channel.

Folk do that sort of thing when transmitting low latency first-person-video from drones, using commodity wifi hardware.

jesprenj · on Nov 7, 2022

CNLohr at Youtube did this to use ESP8266 at absurdly low power usage, only waking up every couple of minutes to transmit a 802.11 frame without connection handshake. https://youtu.be/I3lJWcRSlUA

jonathanlydall · on Nov 6, 2022

This is already done by many audio/video chat compression algorithms which use UDP.

The idea being that with something like voice or video, if a packet doesn’t arrive, instead of stalling everything while you wait for a retransmit, you instead just carry on regardless.

The occasional missing packet in voice is almost imperceptible, however the more packets that get lost the more the voice seems to “break up”.

With video the symptoms are an accumulation of visual corruption until the next key frame.

FPS games also do this (or at least used to), servers would send a stream of UDP packets of entity states (as opposed to sending deltas of state changes), so things like player1 is at x,y,z coordinate with velocity a and heading vector of b.

If clients missed a packet they would just wait for the next one since it’s useless to know later where someone else was, all you actually care about is where they are, or what’s happening, now.

YZF · on Nov 6, 2022

I think the ask is for something like let a bit flip inside the UDP packet or let a byte fall out of the UDP packet. The integrity of UDP packets is still checked for (at different layers).

For most networks the error rate is so low that I don't think it's that valuable. Also you'd still want your metadata checked otherwise it'll get potentially delivered to the wrong place.

nsteel · on Nov 6, 2022

That's what I assumed they meant but there's so many layers of correction/protection going on and most of it is non-optional if you want a working system. For example, if you disabled the FEC that's used over high-speed serdes links within a core router you'll be left with a broken system, the error rate is much too high. In the designs I've worked on you can't disable the internal CRC/ECC on the databuses without risking corrupting the control data, which won't end well. Nobody provides separate data and control ECC protection, that's pointless overhead.

I guess they probably meant disable checking the ethernet FCS, that might still work but I think it's a very bad plan. I doubt this option is even exposed to network operators.

ultrahax · on Nov 6, 2022

Most engines I've seen will delta all the entity states against the client's last acknowledged state. Costs some memory and computation on both sides to keep the deltas valid, but keeps the state update to under an MTU, generally.

agsamek · on Nov 6, 2022

Reed-Solomon is the foundation of today's computing. It is used in data storage (hdd, ssd) and in data transfer protocols. It allows for building of a reliable system on top of an unreliable real life fenomens with desired level of certainty. This is so incredible tech that once implemented we can just forget about it in the higher level abstractions.

QuinnyPig · on Nov 6, 2022

This is, for example, how Amazon S3 works.

kccqzy · on Nov 6, 2022

It's also how GCS (and the underlying Colossus system) works.

https://news.ycombinator.com/item?id=11713406

nayuki · on Nov 6, 2022

This is how Backblaze works. https://www.backblaze.com/blog/reed-solomon/ , https://github.com/Backblaze/JavaReedSolomon , https://news.ycombinator.com/item?id=9726890

markc · on Nov 6, 2022

And Sia distributed storage network https://gitlab.com/NebulousLabs/Sia/-/blob/master/modules/er...

aortega · on Nov 7, 2022

Also, how RAID works, and how ECC memory works.