Practical Reed-Solomon for Programmers

h2odragon · on June 14, 2021

Amusing: the linked library page, https://www.ka9q.net/code/fec/ says (from 2007) "The new Intel Macs are not yet supported. Stay tuned."

Ignore an issue for long enough, and it stops being an issue :)

coder543 · on June 14, 2021

If anyone wants to read more about Reed Solomon, this is the article that helped me understand Reed Solomon earlier this year: https://innovation.vivint.com/introduction-to-reed-solomon-b...

ahubert · on June 14, 2021

(author here) Nice! Added that to the page, very worthwhile. Thanks.

vmilner · on June 14, 2021

This BBC engineering paper is also useful.

https://downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-files/WHP031...

boredprograming · on June 14, 2021

Just to add to this topic, usually Reed Solomon codes are interleaved with Turbo codes or LDPC.

The reason is this: Reed Solomon corrects entire symbols, but can only correct so many symbols per block. LDPC and Turbo correct single bit errors. If you have a lot of single bit errors, mixed with completely corrupt blocks, it's best to use both.

Interleaving usually mixes the bits around too, to spread long chains of errors out between symbols.

This setup is used in digital radio, 3G, CD and DVD, DSL, etc

conductor · on June 14, 2021

Thanks for the article and code.

Also check out https://github.com/Parchive/par2cmdline

I wish something like this was integrated into a modern and free archive format. RAR has it. RAR also got volume recovery: when creating split multi-volume archives you can also create recovery volumes - each recovery volume can replace any other one missing volume. It was a life-saver in floppy-disk times.

dale_glass · on June 14, 2021

We're not using floppies anymore, and media is extremely reliable.

On the internet you can just retransmit if something went wrong, and if you're worried about the media specifically then there are media specific measures, like RAID and filesystem level checksums.

macintux · on June 15, 2021

Cosmic rays are a thing, and wasn’t there just a study linked here a few days ago about how unreliable hardware is at scale?

dale_glass · on June 15, 2021

Oh, I'm all for ECC and checksumming and whatnot.

All I'm saying is that adding recovery info to archives is probably not ideal -- we can do it in a task or media specific way that works best for whatever context the data is in.

Redundancy in .RAR made perfect sense back when you were transporting data on floppies, and if the floppy was bad, it was a huge pain. These days we transport over the network and don't need to carry any redundant data at all. All we need is a checksum and to retransmit whatever didn't arrive correctly.

ithkuil · on June 14, 2021

There are other erasure codes that go beyond RS. For example LRC (Local Reconstruct Codes) is very interesting; it improves on reed Solomon by requiring less bandwidth/IO for repair reads.

https://www.usenix.org/conference/atc12/technical-sessions/p...

xrd · on June 14, 2021

I loved this article and going down the rabbit hole with the associated links was excited to see a discussion from Russ Cox (golang) about how QR codes rely on RS (and another great article)

https://research.swtch.com/qart

crazygringo · on June 14, 2021

Are there any uses cases for Reed-Solomon at the application level?

My impression was always that error checking was implemented at the hardware level, or else at the OS/driver level.

But just curious if there are some applications I'm missing.

gopalv · on June 14, 2021

> uses cases for Reed-Solomon at the application level

Hadoop has an RS implementation inside the filesystem (called "erasure coding"), instead of storing 3 copies of the same data, it can actually instead store ~1.5 copies as (6+3) or (10+4).

Previously, I've run into this tech in satellite internet gateways, but distributed filesystems is where I've gone through the math & probabilities of failure properly.

I work on perf & the extra network hops (with 3 replicas, you read 100% of data local, when you stripe it that doesn't work) and math for the error correction are hot spots when you are trying to keep all cores busy.

elisbce · on June 15, 2021

Linux RAID-6: RS(10,8) Google File System II (Colossus): RS(9,6) Quantcast File System: RS(9,6) Intel & Cloudera’ HDFS-EC: RS(9,6) Yahoo Cloud Object Store: RS(11,8) Backblaze’s online backup: RS(20,17) Facebook’s f4 BLOB storage system: RS(14,10) Baidu’s Atlas Cloud Storage: RS(12, 8) .....

oofabz · on June 15, 2021

Are the numbers the size in bits? If so, those are very small blocks. It seems like you would get more effective redundancy with larger blocks. Does Reed-Solomon scale so poorly that only tiny blocks are computationally feasible?

sliken · on June 15, 2021

Generally blocks. So backblaze takes 17 blocks of input, and generates 20 blocks out output. Then recoveries work if 17 or more blocks are recovered.

the_benno · on June 14, 2021

I worked on a research project in undergrad that used Reed-Solomon -- we showed a series of images to users, who would later pick those images out of a set of un-trained images in order to authenticate or access some privileged data, sans encryption. the error-correcting allowed for some degree of misremembering or forgetfulness, as is normal in human image recognition tasks.

The professor later spun up a startup that has since been acquired by Dropbox, but I'm unaware what kind of product they're currently working on.

mook · on June 14, 2021

Parchive 2 was, I believe, an implementation of Reed-Solomon at the application level; mostly used (when it was introduced) to counter loss from missing chunks on binary newsgroups. I think RAR might have picked up a similar feature around that time too?

I believe parchive version 1 was simple XOR parity.

akalin · on June 14, 2021

No, parchive v1 also used Reed-Solomon. The main difference between v1 and v2 was that v1 worked on the file level, but v2 divided the set of files into blocks.

Also, the parity matrix for v1 would be sometimes singular (non-invertible), which v2 tried to fix but it didn't quite work.

curtisf · on June 14, 2021

You may want to be able to run highly resilient databases even on non-error-detecting hardware. (Most consumer disks, and even a lot of enterprise disks, don't reliably detect/report disk corruption)

You might also be building a distributed system, where blocks are spread across multiple disks. In that case, an "error" you're trying to correct might be the loss of a disk or server in the distributed system; there is no one single disk or OS responsible for all the relevant blocks in that case.

But, both of these are assuming you're building some kind of data-store, which is not a typical user application. Writing robust data-stores is hard for many reasons, error correcting is just one of them.

alphabet9000 · on June 14, 2021

i used a reed solomon library recently in a project that encodes and decodes binary data stored in a video channel. i added it to compensate for compression/noise introduced in the re-encoding process on third party sites like youtube. here's an example of what it looks like: [0]

[0] https://cicada.jollo.org/xg/sp.mp4

cycomanic · on June 14, 2021

QR codes use Reed Solomon codes

jtlienwis · on June 15, 2021

So do pdf-417 barcodes. I remember writing the code for pdf417_decode and pdf417_encode (freeware) back in the day. The Blahut book on error correcting codes was a great help, but all the examples where using binary galois fields. The pdf-417 used a field of powers of 3 mode 929, so my brain had to figure out how that worked and put it into code.

kevin_thibedeau · on June 14, 2021

If you need to transmit data over an unreliable link where retransmission isn't practical you will want some sort of FEC. If hardware isn't available to do the job then it gets done in software.

sparc24 · on June 14, 2021

s3, dropbox - https://dropbox.tech/infrastructure/inside-the-magic-pocket

sparc24 · on June 14, 2021

These 2 papers helped me better understand some more practical aspects of it.

https://mirrors.edge.kernel.org/pub/linux/kernel/people/hpa/... https://www.usenix.org/legacy/event/fast08/tech/full_papers/...

moreati · on June 14, 2021

A semi-related question/puzzle. Is it possible to reverse engineer parameters of a Reed Solomoon function?

Given one or more known inputs/outputs of a Reed Solomon function with known symbol size, message length, number of parity symbols. Is it possible to calculate the other parameters (polynomial coefficients, primitive element etc)?

I failed to find an answer a few years ago, when trying to interoperate with chirp.io (now gone). https://math.stackexchange.com/questions/663643/discover-par...

xorvoid · on June 14, 2021

Doesn’t seem too hard actually, given enough example input. The parity is just a linear combination of the input so you can set up a linear system and solve for those things encoding matrix rows pretty easily. From there you may need to make some assumptions about what type of matrix was used (e.g. vandermonde is popular) and perhaps the field size being used (e.g. GF(2^8), and with those, it should be easy to determine the reducing polynomial. Some of these choices are small enough to be able to brute force (finite fields used for practical problems tend to be small, naturally). If you have a dataset, I might actually toy with it myself.. sounds fun.

moreati · on June 14, 2021

Thanks. I have 30 examples I got from the API, many moons ago https://github.com/moreati/chirppy/blob/master/examples.json. My finite field knowledge is not up to solving it analytically, my attempts to brute force it are in https://github.com/moreati/chirppy/blob/master/trials.py. That includes functions to go between the characters, and integer representations.

nayuki · on June 14, 2021

Mathematical derivation and runnable code for a Reed-Solomon error-correcting code decoder: https://www.nayuki.io/page/reed-solomon-error-correcting-cod...

Tcepsa · on June 14, 2021

At the risk of oversimplifying, is there any value in the analogy, "It's like RAID but for data streams"?

frabert · on June 14, 2021

Problem with that is, as the article states, RAID6 _can_ use RS as its error correction method, which turns this description into something a bit too circular for my taste

IAmLiterallyAB · on June 14, 2021

What is the difference between RS and Hamming codes?

rurban · on June 14, 2021

Hamming Codes can only recover from 2 wrong bits per byte, with RS it's configurable.

alessandroetc · on June 14, 2021

what are your thoughts on storj.io? they utilize R-S for all their entire storage platform.

mrfusion · on June 14, 2021

Is this what tcp/ip uses?

rl1987 · on June 14, 2021

No. TCP, IP and UDP are all using Internet Checksum (RFC 1071). Ethernet uses CRC.

znpy · on June 14, 2021

I went and skimmed the rfc you're citing and it's fro 1988, it's old enough to have samples in C code where the "register" keyword is still used (afaik nowadays most compilers will ignore it) along with assembly code for motorola 68020, Cray CPUs and the IBM-370.

EDIT: for reference: https://datatracker.ietf.org/doc/html/rfc1071.html

brandmeyer · on June 14, 2021

All of these are techniques for verifying a message. Galileo also uses a 24-bit CRC to verify its messages after performing forward error correction.

jandrese · on June 14, 2021

TCP doesn't do forward error correction at all. It uses a retransmission scheme instead. TCP's error correction is not only to insure a consistent datastream, but also to more equitably share limited bandwidth.

stagger87 · on June 14, 2021

Forward error correction is usually at the PHY layer, not the transport layer. So the more appropriate question is, does 10GbE/100GbE/WLAN/etc use RS coding? (and yes some of them do.)

TonyTrapp · on June 14, 2021

No, but Reed-Solomon codes are used on on CDs.

sliken · on June 15, 2021

Indeed, I think there's a 3 layers of encodings to protect against different corruption scenarios. Trick is some equipment only does a partial implementation, thus the need for a tool like cdparanoia to recovery data from corrupted CDs.