Hacker News new | past | comments | ask | show | jobs | submit login
Fast UTF-8 validation (lemire.me)
266 points by ingve on Oct 20, 2020 | hide | past | favorite | 84 comments



> There are other standards. Some older programming languages like C# and Java rely on UTF-16. In UTF-16, you use two or four bytes per character. It seemed like a good idea at the time ...

IIUC this isn't quite right. UTF-16 never seemed like a good idea on its own. These languages started with UCS-2 which did seem like a good idea. But then USC-2 ran out of bits and UTF-16 was the least-incompatible way to adopt Unicode.

So as I understand it UTF-16 was always a compromise for backwards compatibility.


Right. UTF-8 was intentionally created while UTF-16 was a known hack to deal with the mistake of UCS-2. Ideally neither should exist. There's at least some argument for the existence of UCS-4.


> There's at least some argument for the existence of UCS-4.

I think advocacy for 4 bytes per codepoint misses the mark. People tend to erroneously assume that this takes away all complexity and brings us back to an ASCII-like reality of random access and 1 integer per glyph. However Unicode is full of complexities that mean that is not attainable. (Joiners, combining characters, and bi-di are a few.)


And the crazy thing is that UTF-8 came before UTF-16.

https://en.wikipedia.org/wiki/UTF-8#History, published in 1993.

https://en.wikipedia.org/wiki/UTF-16#History, published in 1996.

And yet they ruined Unicode for everyone with surrogates in order to introduce UTF-16. (Surrogates and UTF-16 are by far the worst thing about Unicode, because they’re a pain and technically completely unnecessary, only having been introduced for poor social reasons.)


And UCS-2 run out of bits because the Unicode consortium keeps adding garbage like emojis to keep their job...


As well as this being rather closed-minded, it's also not true. The contents of the 0000-FFFE codepoints are public knowledge, and the biggest users of space are:

  1. the private use area
  2. the general "CJK" area
The second of which has a truly mind-boggling number of characters, including every possible composite Hangul glyph used in modern Korean, despite them being constructable from the basic Hangul codepoints.

Emojis and other symbols which aren't used for language appear relatively rarely. Certainly there is no reason to believe that UCS-2 would be sufficient for writing if they were removed. The number of scripts included in Unicode would exhaust even the private use area, and UTF-16 would have been invented regardless.


> [...] despite them being constructable from the basic Hangul codepoints.

Unicode strives for the round-trip compatibility with source character sets, and in this case KS X 1001 (KS C 5601 at that time) is a main culprit: it had 2,350 (out of 11,172) common syllables precomposed. But it happens that Korea had supplementary character sets beyond KS X 1001, which were subsequently added to Unicode 1.1 (up to some 6,000 characters), before it was decided that having an algorithmically derived section of all 11,172 syllables is better. This whole situation is now known as the "Hangul mess".


>The second of which has a truly mind-boggling number of characters, including every possible composite Hangul glyph used in modern Korean, despite them being constructable from the basic Hangul codepoints.

Also true of most Chinese characters, but the proposal to encode them component-wise was a no-go (for adoption in China IRRC) and separate character encodings was went with in the end. I never managed to dig up the reasons behind it.


What was adopted was an adoption of existing encodings mapping, as per rountrip convertability policy. If Hangul had a working composable encoding, it would've been used instead.


The problem is that Hangul had too many composable encodings from each vendor. As a result the government went to yet another standard (KS X 1001) that fits better to the ISO/IEC 2022 infrastructure. It was too late when the standardized composable encoding was specified as an annex to the original standard in 1992: Windows 95 didn't care about the annex and introduced their own extension to KS X 1001, now known as the code page 949 and standardized in the WHATWG Encoding standard [1].

[1] https://encoding.spec.whatwg.org/#index-euc-kr


Yes, and that's how Koreans got themselves a block up in 0x11xx for composable jamo, which means 3 bytes per jamo, and 9 bytes per vowel :-O


Although I think we have enough, I think the way people use emojis cements my view that they are a good thing.

Nothing trivial annoys me more than people writing in "I luv u" shorthand, so if an emoji can a more emotional message in less characters I'm all for it. Even if it's a thinly veiled sexual euphemism.

Emojis in official corporate communication can burn - I got one recently when applying for a relatively serious job: sends a strange message, that and it reminds me that I want to save the emojis for my friends and family (sadly not the people we most of spend our time with in)


Yeah, but then you should consider the effort taken to say "I love you" in that way. Such a low effort. The message of sending a heart is typically habitual for people, and has no real meaning behind it. Same could be said with "I luv u", but a bit less so, I would say. I think it has a bit more weight to it.


Come on there's tons of good stuff in the astral planes like 𐌾𐍈𐍊𐌷𐌹𐌴, runes, full metal alchemy, egyptian, cuneiform (which has had a lot of impact in the past helping with those hefty Go hello world binaries), and 𝐁𝚹𝙇𝗗 math that doesn't need a <b> tag. See https://justine.storage.googleapis.com/astralplanes.txt


Congrats, you've overloaded CoreText on my machine. Safari refused to load that page for me and running it through less made Terminal hang a lot.

Also, fun fact that I just learned: CoreText synchronously calls through to a font registry in fontd using XPC to draw text on the main thread in your app.


Fascinating, I see for the first time how many of those are present in iOS. I’ll have to check them all in other platforms.


emojis were added in Unicode 6.0 in 2010. Surrogate pairs were introduced with Unicode 2.0 in 1996. It should be pretty clear from that timeline that emojis had nothing to do with it. Unicode as of today contains 92,856 CJK Unified Ideographs, so just by that alone UCS-2 was insufficient.


As far as I'm aware, the largest multibyte in Unicode isn't even an emoji or some odd symbol, it's theta 𐍈. There's a lot more complexity for things that you can actually expect people to make use of.


Why are emojis garbage?


They required every graphics stack to add support for color fonts. And because new emoji compositions are invented every year, it's common to see uncomposed glitches like <facepalming woman><male gender marker> or <thumbs up><white skin color marker> on all sorts of semi-smart devices.

In my opinion Unicode went from a slow-moving standard that carefully absorbed the world's languages, to a pop culture product that serves to make old software obsolete even faster.


Here was my reply at the time (2018) for non-simd lookup table approach

http://darkcephas.blogspot.com/2018/10/validating-utf8-strin...


One thing that is unsatisfactory here is that when you follow such a fast operation by another operation, you lose on memory access time if the string doesn't fit in the cache. So if you want to compute g(f(s)), then you will always have to optimize g and f together.


Good point.

In that case you might for example need to splice about 8-16 kB substrings to UTF-8 validate, to strike a reasonable balance between loop setup costs etc. and not spilling L1 cache. You'd of course need to check every splice ends at a valid UTF-8 boundary.

Streaming operation composition while not sacrificing memory bandwidth or excessive overhead for CPU's liking is pretty cumbersome. You definitely don't want to spill cache, but you also need avoid overhead by ensuring vectorized code can take full advantage of the CPU performance.

Anyone who tries to achieve high performance in this kind of operations without a profiler is doomed.


The approach here can be used to process strings in chunks without too much added complexity, and it has low setup/teardown costs. The lookup tables fit in three 128-bit SIMD registers so there's minimal cache pollution, which is my usual worry with clever table-driven schemes for accelerating things.


This is a validation step, though.

I'd understand if it was a parser, but you want to validate UTF-8 to avoid reading into garbage (or evil code injection).

So at least for this case it's mostly going to be validate(x) -> parse(x)


This is a cool post, but I'm still not convinced I want to do UTF-8 validation in the first place. UTF-8 text, even invalid, has the great property of failing gracefully in display that a boolean "valid" or "invalid" can never come close to.


Routines that manipulate UTF-8 strings can produce unsafe results on invalid UTF-8. For example, finding the end of a string may place you into invalid memory if there are missing trailing bytes.

To achieve safety, one option is to use routines that always assume that the data they are operating on may be invalid — but such routines are always slow, because they are essentially validating the UTF-8 data continually.

Therefore, what most systems do is establish the validity of UTF-8 data as an invariant by running a sanity check once before supplying the data to unsafe-but-fast internal routines. And that's where validators like the one described in this article come in.


> finding the end of a string may place you into invalid memory if there are missing trailing bytes

I struggle to imagine such a function. If its signature is "const char * utf8end(const char * s, size_t len)", then the implementation is obviously "return s + len". If the len is missing then, uh, you can't find the end unless there is a terminator character, in which case the implementation is "return s + strlen(s)".

And if you mean finding the beginning of the last codepoint, then again, you simply go to the very last byte of the string and rewind back until you see a non-trailing byte (or you've run off the beginning of the string, or you've seen more than 4 bytes in which case you have an invalid string).


> I struggle to imagine such a function.

The following function will read invalid memory if there are missing trailing bytes:

    int32_t
    read_last_code_point(const char *s, size_t len) {
        s += len; // place pointer at end of string
        while (!HEADER_BYTE(s)) { // read backwards until header byte found
            s--;
        }
        return DECODE_UTF8(s); // BOOM! May read beyond string end
    }
To fix it, you would either need to establish that the passed-in string was valid UTF-8, or you would need to add a bounds check to DECODE_UTF8, making it slower.

When you write string manipulation code, you come across this sort of issue all the time. In isolation, maybe you'd want to fix this function by adding the bounds check. But since you're going to see this problem everywhere, you'll probably solve it through an initial validation pass instead.


Can you give an example of how one might make use of the last code point in a string?


In practice, this sort of thing might be most likely to come up with a chars iterator. The natural iteration logic is something like:

1) If position == end, the iterator is done. 2) Otherwise, read the next byte to determine the number of bytes in the next character. 3) Read all the bytes of the next character, convert them to a regular 32-bit code point, and emit that.

In the presence of invalid UTF-8, that iterator is broken. There needs to be an extra check in step 2.5: "Check that the number of bytes reported for the next character doesn't exceed the number of bytes remaining in the string." Iterating over the characters of a string is a pretty common thing to do, and that extra branch is not free.


If the natural logic is broken then you've misunderstood the thing's nature. UTF-8 behaves more like a communications stream that was shoehorned into the purpose of character arrays. If you think about it in that way, then decoding can be done simply and elegantly:

    #define bsr(u)          ((sizeof(int) * 8 - 1) ^ __builtin_clz(u))
    #define ThomPikeByte(x) ((x) & (((1 << ThomPikeMsb(x)) - 1) | 3))
    #define ThomPikeMsb(x)  (((x)&0xff) < 252 ? bsr(~(x)&0xff) : 1)

    void ThompsonPikeDecoder(const char *s) {
      unsigned c, w = 0, t = 0;
      do {
        c = *s++ & 0xff;
        if (0200 <= c && c < 0300) {
          w = w << 6 | c & 077;
        } else {
          if (t) {
            printf("%04x\n", w);
            t = 0;
          }
          if (c < 0200) {
            printf("%04x\n", c);
          } else {
            w = ThomPikeByte(c);
            t = 1;
          }
        }
      } while (c);
    }
That code generalizes to any 32-bit number. It's the full superset of intended behaviors including things like stream synchronization. It'll even decode numbers that were arbitrarily banned by the IETF e.g. \300\200. Most importantly, since it doesn't require a validation pass beforehand, does that mean it goes faster than the OP's code in praxis? D:


I strongly question "doesn't require a validation pass".

1. Your code drops naked continuation bytes entirely.

2. If a valid multibyte character is followed by extraneous continuation bytes it will not decode correctly.

3. Using the wrong number of continuation bytes is a giant mess and can give you arbitrary numbers in multiple ways.

The second one especially is a behavior that's hard to defend.


The check only needs to be done once, when the iterator is constructed. You can also zero pad the buffer by 3 bytes to avoid a bounds check if you really want.

This seems like a premature optimization, unless you're only iterating over tiny strings.


That's a good point. I know in the case of Rust in particular there are other invariants, like that each 32-bit code point is guaranteed to have a valid value. (The compiler knows this and may use unset bits to stash enum state or something like that.) But maybe in other languages without such strict validity guarantees, it's less of a big deal?


I tried to simplify the example code for the sake of clarity.

While I acknowledge it would be unusual to provide a library API for finding the value of the last code point specifically, most string libraries provide something like a "codePointAt" function to return the code point at a specific offset. You'll run into the same problem with DECODE_UTF8 there.

    int32_t
    code_point_at(const char *s, size_t len, size_t offset) {
        const char *end = s + len;
        while (s < end && offset > 0) {
            // Move forward by 1-4 bytes depending on header byte value.
            s += UTF8_SKIP(s);
            offset--;
        }
        if (s >= end) {
            return -1;
        }
        return DECODE_UTF8(s); // BOOM! May read beyond string end
    }


The iteration count is at most 4 anyway, so we're arguing about nanoscopical performance difference. Just check the boundary.


> If the len is missing then, uh, you can't find the end unless there is a terminator character, in which case the implementation is "return s + strlen(s)".

Sure, and if you do strlen(s) without checking if there is actually a NUL first then:

> finding the end of a string may place you into invalid memory if there are missing trailing bytes.

As was originally said. This doesn't mean it's impossible to make such functions but <rest of original comment>.


> Sure, and if you do strlen(s) without checking if there is actually a NUL first then:

>> finding the end of a string may place you into invalid memory if there are missing trailing bytes.

You're really losing me here. Either you have a NUL terminated string or you already know the length of the string or you have no way of determining the length of the string that exists in memory that it is safe for you to access and what you have is garbage.

Just because data is "valid UTF-8" doesn't make it safe to read. If the end of your string isn't marked and you don't know how long the string is already, it's over.


Or you reserved the string ahead of time or it was reused and not zeroed properly or there is more than 1 null in the data and copying must be done with care.

Again, it's doable, or you can validate the string once (including more than "it's valid encoding") and make a ton of safe assumptions later instead of each thing that interacts with the string. UTF or not.


> if you do strlen(s) without checking if there is actually a NUL first

Then your problems have nothing to do with invalid UTF-8, because that issue occurs even in pure ASCII.


Function finding end of a string that breaks on invalid utf8 would be silly and is a bad example.

A function that counts the number of characters/codepoints or maybe decodes utf8 -> ucs4, may wish to operate on entire codepoints instead of always checking for potential end of buffer mid-decoding. Or perhaps avoid handling of invalid codepoints at all.


> may wish to operate on entire codepoints instead of always checking for potential end of buffer mid-decoding

That sounds like the sort of reasoning that results in compiler bugs like:

  thing_t* nextp=p+1;
  // other declarations for this function
  if(!p) abort_thing("null pointer"); // optimised out
  /* CVE-20XX-#####: user-space code can compromise kernel $THING if it mmap()s $STUFF at address zero */


> A function that counts the number of characters/codepoints

Have you ever met a use case for the number of codepoints in a unicode string?

(One that didn't involve the moral equivalent of "because codepoints are what Twitter counts"?)


When converting from UTF-8 to UTF-32, you need to know the number of codepoints so that you know how much memory to allocate.

I feel like you're hinting at something. What is it?


Sure. You could also multiply by 4, wasting at worst twice as much memory as optimal on pathological input.

But wait, why are you converting from utf-8 to utf-32? To iterate through? You don't need to count your codepoints for that, you can just advance through your utf-8 linearly. To index into it? You might as well index into byte offsets—sure, that makes some possible indices invalid, but so does the finite length of your buffer.

And what are you doing with the codepoints anyway? utf-32 turns out to be profoundly useless for pretty much every concrete use case, because its whole appeal, that its code units map 1:1 to code points, presupposes that there's something you'd want to do with the codepoints.

There isn't. Code points are practically orthogonal to all of: user-perceived characters, user-perceived graphical symbols, grapheme clusters, text width. They combine arbitrarily in ways that you cannot infer without knowing the semantics of every possible code point. There is no linguistically sensible text operation you can do at the codepoint level. They are completely useless to you.

The worst part is, there are a lot of operations you can do with codepoints that seem, to cursory inspection, to do something you want. But they're wrong, all of them, necessarily so, because you're working at a level of abstraction that does not have the information needed to do just about anything correctly, but the cases that are broken are ones where none of your developers even know what the correct result should be.


It's not very sporting to rule out making things fit as a use case for measuring something.

Even measuring the width of rendered text is done to make it fit some layout.


The width of rendered text is not measured in codepoints.


If you pre-validate the string because maybe you do a lot of transformations and operations with it later, then any functions processing strings become simpler. If you have the start of a 3byte character you can just read the next two bytes and don't need to check if any of the two trailing bytes is actually a nullchar instead (in case of null-terminated strings) or whether pos+2<len. Because otherwise you'd do an oob read. Also you don't need to check if the trailing bytes of your char actually start with 10xxxxxx, you can just mask off the first two bits right away.


The self-synchronizing aspect of utf-8 is one of the best things about the encoding, for sure.

But there are a lot of reasons to validate utf-8. The article mentions security vulnerabilities, and as a langsec devotee, I approve of always parsing inputs to prevent that sort of thing.

But even if you're reasonably confident that an invalid string isn't going to power a weird machine in your codebase (and are you, really?), validating is useful, because at that point you can trust algorithms that work on codepoints to actually work on codepoints, and don't have to constantly insert logic to detect malformation and re-synchronize your read, when all you want to be doing is working with your string.

An even simpler reason: It's easy to write something to work on some utf-8, and be quite sure that, for your application, it's going to be fed valid utf-8 or 'close enough'. And then some wire gets crossed and you're feeding it an mp3 or something.


It always depends, but most of the time it's better to fail fast. Invalid UTF-8 is a good indication something went wrong. The earlier the failure occurs, the easier it'll be to troubleshoot it.

Of course there are situations where you want to make do with your input and try to interpret it the best you can. Sometimes you need to be lenient in what you accept [0]. That's how I'd expect, say, a web browser to operate.

Default to not accepting invalid input. Just fail fast.

[0]: Be liberal in what you accept, and conservative in what you send - Jon Postel


> Of course there are situations where you want to make do with your input and try to interpret it the best you can. Sometimes you need to be lenient in what you accept

The best way to do this is to TURN the input into valid UTF8 though. Like replacing any uninterpretable-as-utf8 bytes with the unicode replacement character (U+FFFD).

Which is what most well-behaved software I've seen does with bad UTF-8.

I work with lots of text processing, including processing old legacy files etc. Just refusing to go forward with bad input ("just fail fast") would work a lot less well.

Can you imagine if you tried to open up something in UTF8 in a text editor, and it just refused to open it, telling you nothing but "Bad UTF8" or maybe "Bad UTF8 at byte 19734"? Not super helpful.

But yeah, you want to TURN it into valid UTF8 as early in the pipeline as possible, while logging/flagging/notifying what was done. You definitely don't want to let the bad UTF8 just continue through the pipeline like that unless you really really have to, that way lies madness.


> Can you imagine if you tried to open up something in UTF8 in a text editor, and it just refused to open it, telling you nothing but "Bad UTF8" or maybe "Bad UTF8 at byte 19734"? Not super helpful.

I think text editor is a good example where you need to be lenient. Furthermore, in that use case, any unnecessary modification of the data in any way is a bad idea. This includes replacing invalid code points with something else, unless they're edited by the user. The data might not even be UTF-8 in the first place.


> Be liberal in what you accept, and conservative in what you send

Terrible advice, with a long history proving why it's terrible, starting with IE6.


There have apparently been some injection attacks where an overlength ";" or something has been missed by a string sanitizer. Overlength means eg a 7-bit ASCII character is encoded in two or more bytes, neither of which would be noticed by an 8-bit delimiter checker. The only flag in this case is that bitfields in the two bytes have 0's in the high bits beyond 7.


I mostly agree. In many, perhaps most cases I'd rather not specify that the input to a function is UTF-8; I'd prefer to say it's an array of bytes that will usually be valid UTF-8.

Even valid UTF-8 can turn out to be fairly unusable because it requires unimplemented features such as some corner case of bidirectional text, unexpected combinations of diacritics, or just plain characters that were added to the Unicode standard after your font was designed, so leaving validation till the point of rendering can be a good strategy, and it also means that your validation will be even more ridiculously fast in the case where the string never gets rendered at all.


I think you’re confusing two concepts. There’s two meanings of “invalid” here - UTF-8 strings which contain invalid / unknown sequences, and strings which are packed in an invalid format. (Eg where the last byte is missing from a multi-byte code point).

I agree that usually you should often allow and preserve unknown code point sequences, because rendering degrades gracefully and Unicode keeps changing. But the algorithm this article is talking about checks that a UTF8 string is ‘valid’ in the context of the raw byte encoding format. These checks are much more simple, and they’re needed in basically any software that interacts with strings.


I've seen serious bugs and even security issues due to invalid utf-8 and code just assuming it's valid (e.g. if there's half of the surrogate present the code just assumes the other half is there too). Display is not the only thing done with utf-8 text, and when parsing utf-8 strings being able to quickly validate utf-8 is a great capability.


I'm not convinced I should be coding my own UTF-8 validation in the year 2020, unless I'm working on the run-time of a programming language.

I expect the I/O facility of a modern programming language to either reject bad UTF-8 or "quarantine" it somehow (map the bad bytes to a safe representation according to some documented rules).


Regardless of whether it’s a good idea, there is enough inertia behind it by now that making it faster is a worthwhile project. In Rust and probably many other languages, if you want to work with non-utf8-validated text you’ll be fighting the ecosystem every step of the way.


> Thus you must validate that the strings you receive are valid UTF-8.

Depending on your application, this is either unnecessary or insufficient. Consider a1="\xC3\xA4", a2="\x61\xCC\x88", a3="\xE0\x83\xA4". Assuming you're using a1 as your canonical representation of "ä", there's no legitimate difference between a2 and a3 despite the former being 'valid' UTF-8 and the latter not. (There are decent reasons why you might use a2 as your canonical representation instead, whereas the reasons for using a3 are satyrically flimsy (ironically they'd better support using a4="\xF0\x80\x83\xA4"), but that would just leave no legitimate difference between a1 and a3.)


There are rather good reasons to not allow overlong encoding. For instance it allows smuggling data through unicode-unaware filters into unicode-aware applications, and while you should also catch the issue on the other side of the filter, overlong encoding allows trivially bypassing one line of defence for no reason. Such filters usually don't usually care much about precomposition or decomposition.


That's why I said:

> > this is either unnecessary or insufficient.

There are certainly plenty of circumstances where you don't want to allow non-canonical representations; my point was that there are more non-canonical representations than just the ones that UTF-8-level canonicality checks catch.


Overlong encoding is not just non-canonical UTF-8. It is actually forbidden; it is not UTF-8. A decoder without checks can as well not return the codepoints you expect at all (that would be curious but if you try to heavily optimize one, maybe you can end-up with a result that does that).

So at the point where you do UTF-8 -> codepoint decoding, it is perfectly reasonable to... validate UTF-8 -> codepoint decoding? (or do a safe decoding that copes with invalid things)

Yes you might need to do other kind of validations after that, maybe (or not) some related to Unicode, maybe even some completely unrelated to Unicode and unique to your application, but so what? That does not make UTF-8 encoding validation useless.


You also said:

> there's no legitimate difference between a2 and a3

Which could not be less true. Decomposition and overlong encoding have nothing whatsoever in common.

> There are certainly plenty of circumstances where you don't want to allow non-canonical representations; my point was that there are more non-canonical representations than just the ones that UTF-8-level canonicality checks catch.

This has nothing to do with overlong encoding. Overlong encoding is nothing like decomposition. Composition and decomposition are different valid sequences of codepoints for the same grapheme cluster. Overlong encoding is the invalid non-minimal encoding of a single codepoint. It's not just a difference of opinion as to the convenience of precomposed codepoints.


That seems a bit like saying that there are more ways to die than just not wearing a seatbelt in an vehicular collision, so why bother with seatbelts.


> so why bother with seatbelts.

No, it is specifically not like that. It's like saying that, in any situation where you need a seatbelt, you also need airbags and a lack of ready-to-become-shrapnel sharp objects scattered around the inside of the vehicle.


That version just strikes out at a "seatbelts are all you need" strawman argument which nobody made, and adds nothing to the seatbelt topic.

For instance, even if a browser validates all incoming UTF-8 from all sources and rejects bad inputs, users can still be duped with URLs that use funny Unicode characters to look legit.

But that concern is simply not in the scope of UTF-8 validation and its requirements.

Whereas I might want a low-level I/O library to reject overlong UTF-8 forms at the stream level, I probably would not want it deal with Unicode ambiguity issues at that level.

A: "I have some code to to fast UTF-8 Validation"

B: "That's nice, but you need to restrict more than just UTF-8 to canonical form in order to catch all attacks."

B is just signaling that they know a lot about the larger issue from various angles.


Bob Steagall gave excellent talks at CppCon on using SIMD for utf8:

https://youtube.com/watch?v=5FQ87-Ecb-A

https://youtube.com/watch?v=qejTqnxQRcw


I had a follow up thread with the author on Twitter, and one part of the talk ventured into if UTF-8 made sense for Eastern scripts. I looped in one of my ex Apple colleagues who worked on the transition of Swift from UTF-16 (the Mac default) to UTF-8 for the rationale why this made sense.

That led to a whole discussion of the benefits of self synchronising bytes in UTF-8 as well as a gratuitous self promotion of my “history of Unicode” talk at https://speakerdeck.com/alblue/a-brief-history-of-unicode

The end of the Twitter thread is here, if you’re interested:

https://twitter.com/alblue/status/1319017092004315138


I like the (pseudo-?)code snippet in the article:

  simd8 classify(simd8 input, simd8 previous_input) {
    auto prev1 = input.prev<1>(previous_input);
    auto byte_1_high = prev1.shift_right <4>().lookup_16(table1);
    auto byte_1_low = (prev1 & 0x0F).lookup_16(table2);
    auto byte_2_high = input.shift_right <4>().lookup_16(table3); 
    return (byte_1_high & byte_1_low & byte_2_high);
  }
Is this coming from an existing library? If not, would it be generally possible to write something like this rather than dealing with some garbage like _mm_storeu_si128.


On the other hand, I have absolutely no idea what this code is doing. I'd rather read the intrinsics or the actual assembly code.

Like, what is `input.prev<1>(previous_input)` supposed to do? And with this syntax, I don't know where to start searching either.


There's actually a comment in the paper about what it does:

> shift the input by 1 byte, shifting in the last byte of the previous input

I'm unsure that I think the naming or parameter order is very good, but being able to lift the code into something nicer than intrinsics seemed nice to me.


How does it compare in performance with the optimized code from ClickHouse? https://github.com/ClickHouse/ClickHouse/blob/c0fef71507b43c...


The algorithm used by ClickHouse was "inspired" by https://github.com/cyb70289/utf8/ ... Which is known to be slower than this new method from simdjson.


Long ago, I wrote a database program in Turbo Pascal to browse a 300k text database under MS-DOS. My search function took about 30 seconds or so, if you hard a hard drive. I was challenged to make it faster... I got it down to 3 seconds, in the days of 286s.

I believe I could now get it down to 1millisecond on my laptop using the techniques shown.

This hack amazes me.


The author seems to have ended up in a silly debate when defending the use of SSE3 (and not SSE2).

"You are absolutely not going to 100% SSE2 out in the real world. In fact, you have many 16-bit x86 systems out there. There are people running companies on Apple II computers."

Right, "companies running business on Apple II computers" is a relevant argument for... absolutely nothing.


Note that the author is not defending the use of SSE3 (https://en.wikipedia.org/wiki/SSE3), the author is defending the use of SSSE3 (https://en.wikipedia.org/wiki/SSSE3), which is something completely different. In fact, the whole point of this post is a novel use of the PSHUFB instruction, which comes from SSSE3.

I happen to have a somewhat old laptop right next to me (it originally came with Windows Vista). According to /proc/cpuinfo, it does have SSE, SSE2, and SSSE3 (but not SSE3), so even that old laptop is already new enough to run the proposed algorithm.


AMD added SSSE3 with the FX processors, Phenom didn't have it. Not unreasonable to still expect there to be users without SSSE3 - e.g. across all Steam users over 1% don't have SSSE3. [0].

[0] https://store.steampowered.com/hwsurvey/


Is there any indication of the size of the lookup tables in question?


Source code here: https://github.com/lemire/validateutf8-experiments/tree/mast...

Looks like a very tiny LUT, perhaps it's all in SIMD registers.

In any case, for anything to go at 12 GB/s, the LUT needs to be tiny and pretty much in SIMD register(s).

Any indirect memory accesses would drastically slow things down, even if it's in L1 cache [0]. At least by an order of magnitude.

Even when using something like x86 V[P]GATHER [0] (element-wise vector random access load). X86 gather is currently only slightly better than scalar code, even if all the data is on same cache line present in L1D!

[0]: https://www.felixcloutier.com/x86/vpgatherdd:vpgatherqd


48 bytes


To be specific, three lookup tables of 16 bytes each, each table having 4-bit keys.


From the paper: The first lookup table is 256 entries.

I assume there are more lookup tables as well, but I haven't read far enough to kn9ow what they all are.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: