Fast character case conversion (or how to compress sparse arrays)

jhallenworld · on Oct 13, 2021

I did a four-level radix tree for the Unicode character class and case conversion support in JOE (Joe's Own Editor, so that you can edit Unicode on non-Unicode systems): index sizes of 7 bits, 5, 5 and 4 plus ASCII has a fast-path table.

The main difference is that it computes the tables at startup from the sorted Unicode intervals. So the construction code has to be fast. The same code is also used for user character classes in regular expressions.

Anyway, it builds them in two passes. First pass it de-duplicates nodes, but only the previously constructed node is a candidate for de-duplication. This keeps the memory usage low during construction. De-duplicated nodes can still be modified during this construction, so they may be re-duplicated (there is a reference counter to determine when this happens).

Second pass (after all data is loaded, no more changes allowed), it globally de-duplicates the leaf nodes using a hash table. Many of the leaf nodes are duplicates (and not just the all zero ones).

GrayShade · on Oct 13, 2021

Oh wow, JOE was the first editor I've used on Linux -- ages ago -- and it's still so much better than other very popular choices today. It's friendly, fast, feature-packed and the keyboard shortcuts (of WordStar heritage) are lovely.

JOE is one of my favourite software projects, one I can only look up to. I don't know anything about you, but thank you!

willvarfar · on Oct 13, 2021

My experience with this is to fast-path the common cases (ie ascii and vast swathes of emptiness) with hardcoded bounds comparisons, so the memory access is only done as a slow path when needed.

joebob42 · on Oct 13, 2021

Although as with any "common case performance improvement" this can be risky if what the common case is for a given server changes. We learned that this was what our server did when we launched in a new country and our auth server started needing 1.5x the computer per query. There was an ASCII special case and then an especially bad / expensive Unicode lookup when that failed, which we started triggering way more often.

edflsafoiewq · on Oct 13, 2021

Previously: https://news.ycombinator.com/item?id=26227754

kaetemi · on Oct 14, 2021

Wrote a case conversion that processes UTF-8 directly last year for Ryzom Core. The tables look like a mess, but it massively improved performance over the code that was replaced. Case changes seem to be called more often than I expected in the game. I do wonder if there's any cleaner and faster way.

https://github.com/ryzom/ryzomcore/blob/core4/nel/src/misc/s...

nwellnhof · on Oct 14, 2021

Working directly on encoded UTF-8 sequences is a nice trick that allows to lookup Unicode properties without even decoding a character. I did something similar for Apache Lucy [1]. Note that you can store the data for each "level" in a single table and compute the index with bit operations as explained in the article.

[1] https://gitbox.apache.org/repos/asf?p=lucy.git;a=blob;f=core...

edflsafoiewq · on Oct 13, 2021

quickjs has pretty cool case conversion too. It special cases ASCII, then does the rest with a binary search through some kinda run-length encoded table. There's more code but the table itself is only about 2K for both case directions.

https://github.com/bellard/quickjs/blob/b5e62895c619d4ffc75c...

lifthrasiir · on Oct 13, 2021

The ASCII special case is especially important when you apply casing operations to a run of code points, where you can leverage SIMD or any bitwise trickery to convert a block of ASCII characters and fall back to a slow iteration otherwise.

dhsysusbsjsi · on Oct 13, 2021

How do these benchmark? I’d like to know if the extra cost of the memory lookups is balanced by the smaller table and better cache locality.

wheybags · on Oct 13, 2021

Me too, considering the uncompressed table is tiny anyway, so saving ram for its own sake is not a useful goal here.