Not really. A Unicode string is more like a sequence of data built from simple b...

Not really. A Unicode string is more like a sequence of data built from simple binary structs, which belong to a smallish group of valid structs. Additionally, some but not all, of these structs can be used to infer the validity of subsequent structs in the sequence if your parsing in a more byte-at-a-time fashion. Alternately if your happy dealing with a little less forward compatibility and go for explicit enumeration of all groups of valid bytes you can be a lot more sure of things but it’s harder to make this method as performant as the byte-at-a-time method, which given the complete ubiquity of string processing in software... leads to the dominance of the byte-at-a-time method.