I've never understood, in explanations of run length encoding, why they always default to putting counts before literals. Isn't it sensible to know what a thing is before knowing how much of it there is?
Count-first also tends to waste encoding space, since it's not clear what a code ending on a count should mean. Literal-first makes it more intuitive to make the code bijective (all codes corresponding to exactly one decoding), and why wouldn't you want to?
Some RLE encodings use negative values for a block of non-repeating data. So in its simple form AAABCD would be encoded as 3A -3BCD . Without negative values this would be encoded as 3A 1B 1C 1D. The .TGA RLE format worked roughly like this. All of this means you need the length before you can look at the next bytes.
The reason in explanations is clear, it's how people naturally think about it "You have 1 B, then 3 Ws...". People don't say aloud "You have B times 1, then W times 3".
Whenever there are many ways to encode something, there's wasted room in the encoding space :) Not usually a big deal, but when the goal is compression, I think it's nice if it can be avoided.
Regardless of how you do your RLE, you're almost certainly going to have many ways to encode something: you can almost certainly encode a run of four as two runs of two (you wouldn't want to, of course, but you could). It's not an encoder you tend to use if you want to use the absolute smallest amount of space, it's good if your data has enough repeats to get compression and your encoding/decoding systems are very resource constrained.
It makes a nice interview problem because compression is a real problem, but RLE needs almost zero previous knowledge, has many reasonable answers, doesn't have a gotcha solution, and most people haven't done it before. Lots of typically better compression algorithms out there, but I don't think any are suitable for general interview.
Yeah, in a bijective RLE, one needs to take into account that, say, A2A2 can't mean the same as A4.
However, since you also want to be able to represent runs longer than 255 (if you're doing it on byte level), you can solve both problems. If you encounter repeated runs of the same literal, just combine them into a bigger number. So A2A2 should be read as "A times (0b00000010 00000010)" whereas A4 should be read as "A times (0b00000100)."
There are some subtleties still to make it a bijective code, related to end handling and how to represent variable length numbers uniquely, but it's perfectly possible, and once you get it right it isn't really larger than "regular" RLE.
It's an interesting challenge to write bijective encodings. I feel it's a bit like writing quines, you have to keep in mind a thing you usually don't. For quines, "what this does to the text you need to output", for bijective encodings, "what this does for the inverse function". Not saying you should give it as an interview question, though!
Well that one is more understandable. You can't tell literal from counts except by convention. If we assume ASCII encoding, you don't know if it's a 'B' or a count of 66. It makes sense when you're encoding something with lots of runs, that you use every second word for literals and every second word for counts.
I just don't understand why they start with counts, that's all.
I've never understood, in explanations of run length encoding, why they always default to putting counts before literals. Isn't it sensible to know what a thing is before knowing how much of it there is?
Count-first also tends to waste encoding space, since it's not clear what a code ending on a count should mean. Literal-first makes it more intuitive to make the code bijective (all codes corresponding to exactly one decoding), and why wouldn't you want to?