What I'm really curious about is how bugs/errors in the iOS typesetting algorithm result in a crash, rather than just wrong or nonsense typesetting -- and how the last time this happened, they appearently just fixed the specific case, but not the ability of bugs/errors in the typesetting algorithm to crash the system.
I am not surprised there will be errors/bugs in the typesetting algorithm, as the OP demonstrates, this stuff is extraordinarily complicated to do for every language/all of unicode. But if you have something so complicated that bugs/errors aren't surprising -- you'd think you'd want to make sure they didn't cause hard crashes.
I wrote some similar bugs to this in the complex text handling in Chrome.
In text layout you do a lot of indexing into various arrays -- like the array of code units of the input string, or an array of metadata collected per-code point, or an array of data collected per-grapheme. Oftentimes those arrays are all the same length (like in simple text like Chinese) and mixing up which index to use where is no problem. And then when it goes wrong you're violating array bounds which is a crashable offense.
I agree that it's stupid to have code that causes crashes here. My only defense is that I had a lot of other work to do and complex test pathological cases only affect a fraction of users, none of whom were the ones yelling at me about other bugs. I am rooting for Servo, where they are using a language that defends against bad programmers like me.
PS: If a web page wants to crash, it easily do so by allocating memory in a loop, so making web pages that crash isn't as exciting as it is in Core Text in general. Of course crashes can often be escalated into RCE, but the Chrome sandbox was there to mitigate that.
> PS: If a web page wants to crash, it easily do so by allocating memory in a loop, so making web pages that crash isn't as exciting as it is in Core Text in general.
To allocate memory in a loop, you need some control over the JS. Websites try hard not to serve untrusted JS.
But websites serve untrusted text without a second thought. For example, I could post a comment on a news article and cause the article to be unviewable by anyone with a vulnerable browser.
How would using Rust help this case? An out of bounds array would still lead to a crash, and thus a DoS in the crashing application. You could sandbox the text rendering into its own process to solve that, but then you could do that using unsafe languages anyway.
Fully sandboxing an unsafe renderer might have unacceptable performance. E.g. you'd have to reset the internal state after every call, otherwise invalid text on a phishing website might be able to subvert the renderer to make it render text in the URL bar to read something different.
An out of bounds access doesn't have to lead to a crash. For most types in the standard library, the [] operator crashes but the "get" function returns Result<T, E> which you can deal with.
The difference is that on Rust the [] operator does bounds checking: it'll reliably panic before accessing memory if the index is out of bounds. While on C++, the [] operator will happily let the program read or write outside the array bounds.
Depending on compile-time options, a Rust panic can either cause an immediate crash, or do something similar to throwing a C++ exception, complete with stack unwinding.
Yeah, and because it's reliable, you don't need to worry about security issues -- the worst this can do is abort the application.
Whereas for this bug it's quite possible that it may be exploitable. Especially given that the crash backtrace doesn't always appear in the same place -- something is corrupting memory that gets discovered later. (This explains why I can sometimes get the string to render for a split second before crashing)
Wouldn't any language with exceptions work here? You just define an indexing operator that throws an exception instead of crashing and handle that exception outside the unicode handling function.
The advantage of rusts error handling is that it is explicit. The compiler knows that a function might result in an error and forces the programmer to deal with it, or pass it along.
In an exception based language you might forget to deal with the error and have it crash "higher up" in the code.
I also suspect that there might be a performance benefit but I could be completely wrong about that
It's actually slower to not use exceptions and what you describe is not an advantage - exceptions also force you to deal with it or pass it along, if the exception is checked. Bounds check failures aren't of course because that would be incredibly inconvenient and unwieldy, and anyway, you'd just pass it all the way up the stack to some much higher level point which is the only place you can sanely do something (like not render the string at all).
> The advantage of rusts error handling is that it is explicit. The compiler knows that a function might result in an error and forces the programmer to deal with it, or pass it along.
Checked exceptions are explicit as well, though I'm not aware of a language that _only_ has checked exceptions.
> I also suspect that there might be a performance benefit but I could be completely wrong about that
Yes, this is a big advantage. Stack unwinding is very expensive.
> PS: If a web page wants to crash, it easily do so by allocating memory in a loop,
isn't this also a bug :)
And probably want we should fix.. I think all browsers are susceptible, though last I tried in Firefox one had to not just allocate but also fill the memory with garbage.
He means crash the single web page doing the loop... how would you fix this? Isn't the correct behavior when an application goes past resource limits to crash the app? It doesn't crash the browser or any other open tabs.
> Isn't the correct behavior when an application goes past resource limits to crash the app?
Maybe, or maybe it would be better to exponentially decrease CPU time awarded to a tab over time. And ask the user to confirm that this tab should be allowed to do CPU / memory intensive computations, and otherwise reduce the frame rate, number of ticks, or crash early.
Maybe, big memory allocations and CPU intensive stuff should only be allowed on background workers. And even then, we should still reduce resource allocations to save battery, etc. and allow users to award more resources.
> He means crash the single web page doing the loop... how would you fix this?
It used to be that you could bring my entire system to halt by allocating too much memory in the browser (mostly chrome). Mostly due to swapping and just plain bad system configuration I suspect :)
Better process management and limitations between tabs on browsers can help.
For instance, this site will crash most Firefox browsers but not Chrome because they limit the modal dialog rates: fan-pages[dot]herokuapp[dot]com (be warned that it may crash your browser even with JS disabled).
> And then when it goes wrong you're violating array bounds which is a crashable offense.
Why is this a crashable offense? Probably a bit naive as a C# application developer, not a system developer, but shouldn't all this code be in the equivalent of a try/catch? If it throws an out-of-bounds error, just log the error and return, rendering what you've got.
Probably caused by an out of bounds write on some heap buffer. If this bug is also present on OS X, it would be interesting to see where it crashes with some of the malloc debugging flags enabled (https://developer.apple.com/library/content/documentation/Pe...), hopefully to get a crash a bit closer to the root cause.
The crash seems to be in CoreText. CoreText is embedded/linked in Messages, Spotlight, Springboard, etc. CoreText is written in C.
The fix would be to rewrite CoreText in a memory safe language like Swift. This would be “hard”. Or put CoreText in an XPC container. This would both be “hard” and result in terrible performance.
For more details on how hard C, memory management, systems programming, and operating system development is please refer to your local copy of Modern Operating Systems by Andy Tannenbaum.
Or just move CoreText into its own process, and restart it when it crashes.
The big issue is that when CoreText crashes right now the kernel panics and the device restarts. If CoreText itself could crash safely, get restarted, and the OS continue running then these bugs would go from "significant" to "annoying." Even if CoreText crashing caused individual apps to also crash, that would be a big improvement the current situation.
Obviously we'd all like bug free fonts and text rendering, but if we call that goal aspirational (read: impossible), the best we can hope for today is handling the fault cases better than they're handled today. Bootloops are a pretty lame user experience.
From what I gathered so far this doesn't hit the kernel but the process. It seems to turn out that on iOS one of such processes happens to be Springboard, hence the UI (but not the kernel) gets a kick and restarts.
Yeah, Springboard is responsible for not just the home screen but also notifications. So if that sequence of characters arrives in a notification, Springboard will restart… and try to show the notification again… and restart…
On the other hand I had the same thought (but better self-restraint than you) - however I was thinking "nah, scrubbing bad messages server side is just an s/badstring// and I am sure the major non-encrypted messenger apps (where the server knows the strings) added that server-side, so people couldn't crash their contacts' apps, which the app company might get blamed for. This kind of hotfix shouldn't have negative effects, I'm sure there are already a few server-side manipulations of text (stuff like adding a space to very long lines, maybe a blacklist of certain malicious URL's that sort of thing.)
So I'm surprised your message was delivered as sent (if it's not encrypted end to end), unless you did this right when the news broke.
Though it seems like providers have not yet figured out the full set of crashy things (an overly conservative thing to do would be to filter out zwnjs in <consonant, virama, consonant, zwnj, vowel> for the three languages listed). Twitter blocks the original one but not any Bengali variants; গ্য + zwnj + a bengali vowel will still crash it.
Unfortunately, enumerating badness just a stopgap measure - as this seems, so far, to triggered by a specific combination of character classes, it at least possible that there a non-malicious yet crashy string: what now, if the Knights of Ni cannot stand to hear it, but if it a part of the message? The recipient might feel that something not right with the message, and the sender might not even know that the message has censored because a part of it seems to harmful tó intermediary code. (See what I have doing here?)
So, you're right - and the point you raise at the end (with your illustrative example) is a good one. It would be wrong for HN software to silently not deliver your message to me without telling you - just because tó was on some blacklist for some reason.
If it's possible to write "Your message could not be delivered" when messages match the blacklist (even leaving the sender to guess at what they did wrong) it would be better.
As a practical matter if you haven't built the infrastructure into your clients to tell the sender that their message won't be delivered, none of the choices the platform operator has seem great:
- Silently drop a few kinds of messages without informing sender. Seems bad for the reason you outlined.
- Silently modify messages before delivery, modifying them so they won't crash clients. This seems potentially very wrong.
- Deliver messages even if you know for sure they will crash the client upon view
Doesn't seem great to me either.
I guess the real solution is to have robust forced-upgrade on the client (after all, it's your software, you're responsible for it and if you build it to include updates that it is on you) but some users object to that and I suppose they could be justified - it is also a massive responsibility.
I guess there really aren't any perfect answers here.
You'd probably need 1 CoreText process per application which seems suboptimal. If you didn't you'd end up having 1 crash impact all processes (+ opens you up to things like trying to steal data between processes). There's another problem which is that CoreText is intended to be an extremely efficient API for processing lots of text. It would seem to me to be hard to do that while maintaining performance requirements.
There are usually less dramatic fixes like changing all the array accesses to be checked, or putting pages that will trigger a fault around the buffers that the library uses, and handling the fault hitting those buffers generates.
Putting a buggy system in a memory safe environment is certainly not 'the fix'. The fix is to find the precise bug or architectural deficiency and fix it.
It's easier to failsafe something than make things perfect
Even better, when you failsafe you plan for the (unknown) future.
That's why we have circuit breakers, hydraulic and electric fuses, pressure relief valves, etc. Because no one thinks they can know all things that can go wrong in the future (with catastrophic consequences) and plan for that
That’s the reasoning behind the Erlang “let it crash” philosophy. It’s not advocating poor programming; it’s asking processes to handle whatever issues they can within reason, but otherwise to crash and be restarted by their supervisor process, rather than try to carry on in a probably erroneous state.
It’s also a recognition that in complex systems, something unanticipated is going to go wrong sometimes, and rather have a plan for handling the failure than pretend that the system will never hit a really bizarre failure mode.
Your circuit breaker analogy made me think of this.
Huh? C is fast (compared to Swift) because using it doesn't imply sprinkling lots of sugar (like ARC) into the resulting machine code.
Simpler languages like Fortran can turn into even faster code than a C implementation. UB optimizations aren't that relevant for real-world performance.
Code generated by C compilers for C64, Spectrum, Atari, Atari ST, Amiga, Mac, CP/M, MS-DOS, Windows 3.x, Nintendo, MegaDrive,... systems meant many times the code would be 80% like this:
void some_func(/* params */) {
asm {
/* actual "C" code as inline Assembly */
}
}
Lots of Swift sugar also gets optimized away, and there is plenty of room to improvement.
The code that current C compilers don't generate, many times is related to taking advantage of UB.
They also generate extra code for handling stuff like floating point emulation though.
Just as an example, IBM did their whole RISC research using PL/8, including an OS and optimizing compiler using an architecture similar to what LLVM uses.
They only bothered with C, after making the business case that RISC would be a good platform for UNIX workstations.
Why bring these ancient home computer platforms into play? Those were totally different to program for. Why not compare a C compiler from 1998 to one from 2018, on x86 (no SSE of course)? C compilers have gotten better, but not spectacularly.
>> The code that current C compilers don't generate, many times is related to taking advantage of UB
Compilers are really smart in optimizing things that aren't relevant to the real world.
For example, this code would reduce to "return 32" in most modern compilers:
int return32(){
int x=1;
for (int i=0; i<5; i++){
x*=2;
}
return x;
}
Does that make impact in real-world code? Almost certainly not, it's a contrived case. Most UB cases fall into the same category.
>> They also generate extra code for handling stuff like floating point emulation though.
> wWhy bring these ancient home computer platforms into play? Those were totally different to program for. Why not compare a C compiler from 1998 to one from 2018, on x86 (no SSE of course)? C compilers have gotten better, but not spectacularly.
To clear up the myth among young generations that C compilers always generated fast code, regardless of the platform.
As for something more modern, in 1998, C code quality was still at a similar level to other system's languages, before they started to fade away thanks to the increase in UNIX, Linux and BSD adoption
For example, given that Delphi and C++ Builder share the same backend, their generated code was quite similar, even if it would require disabling some of the Delphi's security checks.
You realize that there's no such thing as "don't ever crash" fix, right? Maybe they added some defensive code and maybe they didn't, but if they're using an unsafe language, there's always the possibility of more such issues.
You realize that there's no such thing as "don't ever crash" fix, right?
You answered universally for all contexts, ever, everywhere. That's a usually quite a foolish thing to do. In ObjectStudio Smalltalk, there was actually a place where you could define an empty lambda as the "top-level" exception handler. Once you did that, all Smalltalk exceptions did nothing. There you go: a "don't ever crash" fix, in a language many would call "unsafe." You are now technically wrong, which is the best kind!
Exceptions can be caught in C++, Objective-C, and in Swift. You can even do this for C. There is apparently a history of similar bugs where certain data crash processes in iOS. Depending on how you count, this is either #3 (strings) or #5.
Given that, why wouldn't Apple take steps to make sure that certain critical processes are architecturally immune to this sort of thing? Springboard going away is pretty horrendous. Messages app, given that it's a core functionality for a phone, is almost as bad.
if they're using an unsafe language, there's always the possibility of more such issues.
There are software projects where certain things simply can't be allowed to happen, ever. Apparently, Apple isn't operating at that level.
Yikes. Instead of having heap corruption which immediately causes a hard failure, you want heap corruption which is silently ignored and (?????) happens on a device in which people access banking and all their personal identity stuff?
No. But the heap corruption shouldn't make certain facilities go away, or hang around looking broken. For example, Springboard could be separated into separate processes (display and monitor) and built on some kind of event queue. Then, if an event containing some kind of poison pill brought down the display, the monitor could note the crash and after a few retries, evict the poison pill event and bring the display back up without it.
It's the not even 2nd rater who gives up on something which causes such a huge hole in the user experience, saying to themselves, "Uhhhh, there's no such thing as a never crash fix." You're not a 1st rate programmer if you only think one step ahead and say, "Uhhh, you can't guarantee no crashes," then leave such a huge hole in your system. The 1st rater engages in a bit of lateral thinking. The real problem isn't to eliminate crashes. The real problem is to eliminate the hole in the UX! Anything you can detect, you can "fix," and sometimes a guaranteed "fix" every bit as good or better.
(It's exactly this kind of mediocre thinking that led to the hardware quality doldrums in the 90's. The OS crashed so often, hardware manufacturers started to make cheap machines that could only stay up for a few days anyways.)
this stuff is extraordinarily complicated to do for every language/all of unicode.
Perhaps this sounds a bit Anglocentric, but isn't it unfortunate that this string crashes the devices even of those people who have never heard of and likely won't ever need to use the language it's written in? The majority of people use a tiny fraction of Unicode --- the parts that cover the languages they use; everything else is useless to them --- or in cases like this, even a liability. It would greatly reduce the number of affected devices, especially with bugs like this having possible security implications, if the text rendering system were more modular and perhaps divided into separate optional components: Latin (maybe not optional), CJK, and other complex scripts. I know Windows has/had a similar feature:
This way, those who have no need for anything other than Latin scripts get the simple and hopefully much less buggy rendering algorithm, while those who do need the others can do so without unnecessarily burdening everyone else.
- linebreaking support for east and southeast asian languages: Many of them don't use spaces, so if you want to know where you can break lines it is best to know what the words are. For this you need to have a pretty large dictionary file stored.
- fonts
- probably various assets for RTL languages (changed backgrounds, changed layout files, etc)
This is Windows providing this option for a very specific thing (and also downloading some files), for which there is okayish fallback functionality. Not cfg'ing the entire text stack.
I don't think you can easily slice and dice a text stack so that you can only pull in the components needed for Latin scripts without making it even more prone to bugs. You could write entirely separate stacks specialized for each group of scripts. But you'd probably end up with one for Latin/Cyrillic/Greek, one for Chinese/Japanese (not Korean) and one for "all the rest". There's enough feature overlap between most of the complex languages that there's questionable benefit to separating that out.
For example, most of the underlying text functionality in Telugu or other Indic scripts is not overall different from things in Hangul or Arabic (Arabic is more complex, actually), it's just that Telugu has certain features that press the buttons in just the right way to cause this crash. Which means that if you want to prevent this crash for other language users, what you need to do is not attempt to render Telugu, not swap out the font stack.
Like, looking at the last iOS crash that happened -- with the Arabic text -- that was because chopping off the end of a string of Arabic text doesn't guarantee that the string will be shorter. Really, you can replicate this for most scripts, it's just subtler (Even English has support for this, if your kerning is extreme enough. As long as your stack supports kerning, it supports everything necessary to crash this bug). So, the root text stack functionality that lets this happen is necessary for all scripts, Arabic just ends up pushing the right buttons in the right order to cause a crash.
Why does any bug result in a crash rather than just random unexpected behavior? There's a limit to how much it's practical to isolate things. For example, Chrome uses different processes to render different web pages, so a crash in one renderer shouldn't affect the others, but it doesn't parse HTML in one process, compute layout in another, execute JavaScript in another, and so on.
I read that it was related to text-boxes so I figured that could be a potentially hidden culprit. I haven't heard of web pages crashing due to embedded text but if so then I retract my hypothesis.
2nd Sentence of the article says text box but not sure what "other places" are:
"Basically, if you put this string in any system text box (and other places), it crashes that process."
From the article's source article: "iMessages, [...] Facebook Messenger, WhatsApp, Gmail, and Outlook for iOS [...] can become disabled once a message is received". "It might be difficult to fix and delete the problem message".
From the article: "I’ve been testing it by copy-pasting characters into Spotlight so I don’t end up crashing my browser". "I can cause this crash to happen more reliably in browsers by clicking on the string".
So, not text-box specific and happening on display for a wide variety of applications, including web pages.
My guess would be that it's some aspect of measuring the text that is causing the crash: when you click in an editable text box, there is code to track down where the cursor should be placed. This is done by measuring various sub-strings of the whole line.
If measuring the sub-strings gives surprising results (sub-strings being visibly longer for example), this could cause the algorithm to fail in any number of interesting ways: for example if a binary search is used to locate the cursor position, it could break the invariants of the binary search.
Well, the crash occurs for Spotlight without me clicking anything or having any cursors anywhere.
But yeah, this is one of my theories about it. One of the previous crashes had to do with an Arabic string which got longer when you truncated it, which made snipping it to display in a notification have bugs.
It's interesting to see it's causing a segfault, i'd expect measuring bugs to cause clean assertions or shitty rendering. Which is why I'm also wondering if it's actually a disagreement on the number of "characters" in the rendered things.
> If measuring the sub-strings gives surprising results (sub-strings being visibly longer for example), this could cause the algorithm to fail in any number of interesting ways: for example if a binary search is used to locate the cursor position, it could break the invariants of the binary search.
Cursor positions are based off of grapheme clusters -- there's a defined algorithm for that. Though different parts of the system may disagree on the specifics of the algorithm causing such a crash.
However, that doesn't gel with the fact that it's only specific consonants causing this, all versions of UAX 29 do not consider any differences between Indic consonants for a single given script.
> Grapheme clusters can be tailored to meet further requirements. Such tailoring is permitted, but the possible rules are outside of the scope of this document. One example of such a tailoring would be for the aksaras, or orthographic syllables, used in many Indic scripts. Aksaras usually consist of a consonant, sometimes with an inherent vowel and sometimes followed by an explicit, dependent vowel whose rendering may end up on any side of the consonant letter base. Extended grapheme clusters include such simple combinations.
> However, aksaras may also include one or more additional prefixed consonants, typically with a virama (halant) character between each pair of consonants in the sequence. Such consonant cluster aksaras are not incorporated into the default rules for extended grapheme clusters, in part because not all such sequences are considered to be single “characters” by users. Indic scripts vary considerably in how they handle the rendering of such aksaras—in some cases stacking them up into combined forms known as consonant conjuncts, and in other cases stringing them out horizontally, with visible renditions of the halant on each consonant in the sequence. There is even greater variability in how the typical liquid consonants (or “medials”), ya, ra, la, and wa, are handled for display in combinations in aksaras. So tailorings for aksaras may need to be script-, language-, font-, or context-specific to be useful.
For example, in Chrome, we added an extra rule to not allow grapheme clusters to be split after Indic virama characters, but later had to modify the rule to not apply to Tamil viramas:
I don't know the exact cause of this crash, but I can see why Apple might be running into trouble with their logic for these languages. I suspect their algorithm for computing grapheme clusters has a bug causing an inconsistency somewhere.
However, given that some Brahmic scripts prefer explicit viramas (Malyalam, also Thai I think), this will probably be restricted to Brahmic scripts where joining is always preferred (even if not possible).
I'd been testing UAX 29 stuff out before and Apple seems to follow the spec. For example, Chrome and Firefox seem to do special handling for e.g. flag emoji (distinguishing between regional indicator pairs that render as a flag vs those which don't -- i.e the ones which don't correspond to a country code). But Apple follows the spec rigidly. In particular it does not consider joined consonants to form a single EGC.
I cannot trigger this crash on iOS 10.3.2, despite repeated attempts. I can reliably crash friends' phones by iMessage for all friends on iOS 11.
That suggests to me that Apple made changes to CoreText, and did not perform adequate regression testing.
All software has bugs. I understand that. But I suspect a large part of power users' and developers' growing frustration with Apple is that they keep introducing severe, kernel-panicking, root-exposing bugs in software that previously did not exhibit the problematic behavior.
Honestly, how do you not have a stringent regression testing requirement for changes to the "Core" of the operating system?
> Honestly, how do you not have a stringent regression testing requirement for changes to the "Core" of the operating system?
By not managing expectations and deadlines properly. Especially if schedules are overriden from a manager who is not aware of the ramifications of changing something in the core.
Well. Is it really bad to have bugs though? I think it can be good, it's free advertising it gets people's attention for some time. Which is good. Even being hacked is good, you can always turn it into PR. Maybe Apple doesn't need this as much, but 99.9% of other companies do. Perhaps even, security consultancies should provide a new kind of service "fake hacking", which is really a PR campaign. You introduce a bug which is kind of severe but not really detrimental. It's better if it allows to compromise a user. Publish tutorials, videos, tweet about it, create panic, make it cool to abuse the system. Give it a name, create a dedicated web page. And then you squash the bug, you fix it. People will remember that you fix stuff, so your company and product must be good because you care.
Ooh, so that's what Therac did? Or perhaps mangle the drive-by-wire software so that the stopping distance is just 10% longer - a few extra feet never killed anyone, eh? Or something non-life-threatening: silently truncate all passwords to 8 characters, not like anyone would abuse this to compromise the user (and it's the user's problem anyway, not the vendor's).
Irony aside: you are, perhaps unintentionally, omitting from your narative all and any damage that would be caused by such a deliberate bug - the vendor is usually the only one who can fix it, but not the only one who can exploit it. Also, what of unpatched devices, and of liability (you are introducing a backdoor, intentionally)? And realistically, your original change might introduce more holes than you bargained for, or the fix might. This is a horrible idea on so many levels, even discounting it's inherent evil.
>Honestly, how do you not have a stringent regression testing requirement for changes to the "Core" of the operating system?
I have never had the fortune of working for a manager who prioritized quality over (sometimes imagined even) random dead-lines.
Developers' complaints sound like gibberish to them ("Plus, they always complain, such perfectionists!") and when the shit hits the fan, sometimes years later, well, it might not even be their problem - they've been promoted by then, fucking up another unfortunate team.
My comment is slightly rhetorical. "Bad management" and/or "misplaced company values" are the answers, and they're exactly my point. Apple has lost touch with the perfectionism that made it so wonderful.
I think perhaps Tim Cook doesn't realize that Apple didn't create all this emotional loyalty from artists just by having pretty boxes and wallpaper, but by cultivating a kindred spirit with them. And to do that, you have to actually care about the quality of your work on a fundamental level.
Very interesting article. It gives a bit more perspective to the problem than the simplistic view that iOS has a problem displaying a certain character onscreen.
This particular bug shows the problem with multibyte characters (terminology may be wrong, perhaps multi glyph is better) where certain parts of the character become left or right associative based on context.
Not directly related to the crash, but I have a question about Telugu and similar scripts: How do their speakers think about the structure of the script? Do they consider each vowel and consonant a separate "thing" that just happens to get written as a complex grapheme, or is the grapheme the unit you think about and it just happens to be made up of smaller parts? I.e https://en.wikipedia.org/wiki/Telugu_script#Consonant_Conjun...
Also, do you learn the large number of graphemes separately? Or are they the "obvious" way to write the consonant and vowel? Or is there a set of rules you learn?
- You normally think of clusters as single "letters". The word for them in Marathi is "joined letter". This notion of "letter" may not be totally in line with what English speakers would expect.
- You still think of clusters as having component parts that make it up. However, क्ष/ज्ञ and sometimes त्र are thought of as their own "fundamental" consonants even though they're clusters. I think folks think of things like क्र as being both -- a letter of its own, that is built up from component letters.
- Most of the clusters in Devanagari are predictable. So for any given pair you usually know how to make a "half form" for the first one and join it on, or you know how to stack them. Usually whilst reading it's obvious because you see the components; and whilst writing it's usually fine to try what makes sense.
- For unpredictable clusters, you just eventually get used to them. No different from learning that some words are spelled weirdly in English, really. Though I very much dislike ह्म , ह्य, and द्य since they're kinda confusing.
I think Bengali has a lot more unpredictable conjuncts; I stumbled across this a lot when I tried to learn the script a year ago. I suspect folks just get used to those.
As someone who can read both Devanagari and English/Latin script, I'm just curious, when you read Devanagari does the text size need to be larger for good comprehension?
There seems more subtlety with the word structure, like tiny little flicks which seem to have significant meaning. While Latin-like script has characters like the comma/full-stop, and modifiers like the dot on the i, these are far rarer than Devanagari's modifiers from what I have seen.
Another related question is, because Devanagari's information density seems higher, does that mean shorter text for the equivalent information?
PS - Sorry if anything I said was offensive, that wasn't my intent. I am just an ignorant idiot wanting to learn someone's experience with a different language script, nothing more.
Not really, I can read at normal text sizes. But it is somewhat annoying (much like how tiny English can be readable but annoying to read) and used to have my default font size bumped up by one in my browser. I currently have it bumped up quite a bit for Chinese because I really have trouble with the "tiny little flicks" problem in Chinese.
The consonants of Devanagari are easy to tell apart; they are pretty different. In most cases the "similar" ones are basically with an extra line (प/ष, ब/व) or with a little loop (य/थ, ट/ढ) . (The line/loop has no semantic meaning, these letters are just random letters that look similar) . The only super annoying one is घ vs ध -- the thing up top is a loop in the second one, but depending on the font/handwriting can be rather unclear.
The vowels are also easy to tell apart. There are 12 main ones, and they're made up from some really basic components which are quite distinguishable.
So you can easily distinguish consonant+vowel combos.
Consonant clusters can get tricky, like I said there are a bunch of ambiguous ones. Fortunately you end up realizing which is which and then it's nbd. But like, द्म and ह्म can be infuriatingly similar and you just get it from context.
Bear in mind, at one stage you start sight-reading words, so the actual details of the word matter less.
> Another related question is, because Devanagari's information density seems higher, does that mean shorter text for the equivalent information?
Visually? Yeah. Words can be really small. But it ends up roughly being the same number of code points (and probably more bytes in UTF-8).
If you use Firefox or Chrome you can change default font sizes in preferences on a per-language basis. You can also change the default font used. It's pretty useful, lets me replace shitty system fonts with better ones and also bump up the size for Chinese.
It does not apply to text using absolute font-sizes however, just text that uses things like font-size:medium (or inherits the document default font size, which is also medium)
I'm no linguist, but Malayalam has a more predictable stacking rule than Devanagari, more so in the modern alphabet/sandhi scheme than the old. For instance, one can barely tell apart ह्म and ह्य, but their Malayalam equivalents are pretty distinct: ഹ്മ (old scheme) and ഹ്മ (modern scheme) for the first, and ഹ്യ for the second.
To understand Telugu and similar scripts, one has to think in terms of syllables. Only when we look at these scripts from Roman alphabet, do we see the discussion of consonant and vowel graphemes in the script.
/peɪ/ is a syllable in English, but can be represented by these grapheme clusters: <pay>, <pei>, <pai>. The similar syllable is represented in Telugu or Devanagari this way: పే (Telugu script) or पे (Hindi/Devangari script); there are phonetic differences in the way this syllable represented: (a) /p/ is aspirated in English, but not in the telugu/devangari syllable; (b) the diphthon /eɪ/ is realized as a monophthong in Telugu/Devanagari scripts.
<ja, virama, nya> sequence plays a major role in Sanskrit sandhi. In Sanskrit, in assimilatory contexts, a voiced palatal affricate /dʒ/ + a palatal nasal /ñ/ + some open vowel, leads to this sequence <ja, virama, nya>. Since this assimilation occurs in many contexts, the resultant grapheme is taught as part of alphabet in this special set of clusters: kSa, tra, jna
( క్ష త్ర జ్ఞ) or (क्ष त्र ज्ञ). In fact, there was a poet in Telugu, whose name was ksEtrajna (combo of these three graphemes with ligatures). https://en.wikipedia.org/wiki/Kshetrajna (in general) https://en.wikipedia.org/wiki/Kshetrayya (Telugu poet)
I can't speak for Telugu however, [Devanagari](https://en.wikipedia.org/wiki/Devanagari) is very similar and used throughout northern India. Each vowel and consonant is considered as a separate entity. Each consonant has a pure form and a combined form with every consonant. There are also combinations of two or more consonants that can also be combined with each vowel!
For learning, you just learn the sounds separately for each vowel and consonant. So you can read & pronounce anything written in Devanagari script but not understand it if its a different language :)
An interesting debate that may be closer to home for English speakers: is ñ a letter? The Spanish alphabet considers it to be a different letter than n, but é is not a different letter than e. https://en.m.wikipedia.org/wiki/Ñ
Telugu is my native language. Here's how I learnt it.
There's two types of letters. One type is used as an addendum to the other type. The first type are called vowels and the second consonants (I'd say that's mischaracterization but whatever). In this[1] picture, the top half of the alphabets are the type which are the addendum part, and the bottom part are the 'primary' ones.
So how it goes is that one of the bottom always comes with one on the top. Most often it tends to be the very first vowel, which is pronounced 'a:', and it is the default way letters are learnt(pronunciation chart[2] for each of the vowels).
So each of the consonants (the bottom part) can be used with any one on the top. Let's take the very first consonant. It is pronounced 'ka'(because by default, consonants are accompanied with the first 'vowel' which is 'a:'). Now, lets take a look at the the bottom part of picture [2]. That same consonant is pronounced differently when accompanied with each of the first one. For example: kaa, kee, kuu, etc. The first half the the pronunciation is the same, but the second half, is replaced which whatever vowel's symbol is attached as an addendum. Also notice the the center part of the alphabet hardly changes visually between changes in vowels.
Not to throw a lot at once, but there's another part. Each of the consonants can again be used as addendum to any other consonant (which always comes with a vowel). Take a look at this[3]. Notice the little tick at the top you saw in [2] is still there? That picture contains the letters 'ka', which each of the consonants being tacked on. The vowel is always still there. That's because, as I said consonants always come with a vowel attached. The consonant addendum is optional though.
Congratulations, you can now pronounce (almost all of) Telugu (and almost any Devanagari script using this same play book). It is because these scripts are deterministic, and there is a single pronunciation unlike English and many other languages.
> Congratulations, you can now pronounce (almost all of) Telugu (and almost any Devanagari script using this same play book). It is because these scripts are deterministic, and there is a single pronunciation unlike English and many other languages.
weeeellllll, not exactly. For example, spoken Marathi differs considerably from written Marathi, with vowels turning into other vowels, etc. My favorite example is the word लहान which is spelled "lahaan" but is often pronounced pronounced "lhaan" OR "laahaan".
Almost all Indic language experience schwa deletion -- you will say the word "kanpur" with no vowel between the n and p even though it is spelled "kanapur". Anusvars have a range of pronunciations that depend on the word being used -- they can nasalize a vowel, or add a nasal consonant like n or m.
However, these scripts are definitely way more phonetically consistent than English. These differences are minor.
I would say the reason is due to people changing the pronunciation over time rather than it being part of the actual grammar.
Take for example Hyderabad. Looking at the Hindi spelling, it should be pronounced Hy-der-aa-bad. But it it pronounced more like Hai-dra-bad.
What I'm saying is that people made those changes which we got so used to in our day to day lives that most now regard them as the 'correct' pronunciation, but it isn't how they were supposed to be.
This isn't any different from what happens to pronunciation in English. The difference between English and Indic languages is simply that this has happened a lot more with English (also English kinda gets its words from all over the place)
Pronunciation changes over time. Things which may have been highly phonetic may lose this property over time.
As a telugu speaker, let me try answering you question..
The squiggly character you see has two characters joined
The top and Bottom..
The top By itself can be used separately so It would be called "ja"
The bottom if used by itself would be called "na"
So basically the grapheme you see is made by joining to consonants Ja and na ..so it basically becomes "Jna"
(spelt gna)
there are specific rules like english but like any language we have to learn these rules..eg I know that
I can add k+ nowledge and that makes something in english..
But what if I add ra+no "rnoledge", yes you can add but that sounds weired, not sure if there is any rule which says you cannot add R+na, same with telugu.
I’m curious too. Koreans, for example, think in an alphabet similar to ours (actually fewer letters), but it forms into a grapheme. I don’t think that is the case here, but don’t know for sure.
Not Korean, but the Korean case is different, a bit. Korean syllable blocks are viewed as syllable blocks -- a new concept, whereas Indic consonant clusters are simultaneously viewed as letters of their own and fundamentally composed of more letters.
I might be wrong or I am not getting your meaning but I do think consonants can be “mixed”.Say విక్రమ .But I think we are considering mixing differently.
Fonts are hard. I don't think CoreText is worse than others, most vendors seem to have a history of horrors, and at least it's not rendering fonts in a kernel driver like Windows did(!?) - see for example the Project Zero series of "one font vulnerability to rule them all" - https://googleprojectzero.blogspot.com/2015/07/one-font-vuln...
Has anyone yet traced this to a specific syscall? Seems like the perfect opportunity for whipping up a fuzzer to spit arbitrary unicode into the system and see what else crashes.
Would that even be possible on iOS for non Apple employees?
My idea would be just to create a program that continually creates Unicode strings and display them on screen with a pause before creating and displaying the next. Record the screen and see if any thing exciting happens. Would never end, if there was a SETI @ Home style program for it with insentives for running it would be cool.
Given that Apple does not design their OS to work on anything but their hardware, the pain I would incur trying to set up either option is just not worth it. Unless someone has published a stable and updated pre-built VM for macOS, it just seems like nothing but trouble.
Not so long ago I was pleasantly surprised by how easy it was to setup a Hackintosh VM in VirtualBox --- all I did was create a VM with the default settings, add the all-important SMCDeviceKey, insert a completely stock El Capitan ISO, and it booted up and installed on the first try. All on hardware that was as non-Apple as it gets (a ThinkPad.)
The only thing that I found somewhat confusing and perhaps a bit un-Apple was the fact that the installer would not prompt me to partition and format the HDD first, but was perfectly happy to let me try to select and then fail to install onto the install media itself (with an odd "not enough space" message); I had to use the Disk Utility to do that before going back to the installer. I've always wondered why --- even Windows' installer includes partitioning and formatting as one of its steps.
You should look up if your hardware has been used for Hackintosh by anyone else. I lucked out and my PC has nearly identical specs to this /r/hackintosh post and my install went flawlessly:
https://www.reddit.com/r/hackintosh/comments/2mgq9e/everythi...
Most likely none of this involves syscalls directly. Are there any syscalls that concern themselves with unicode, besides perhaps filesystem related syscalls?
Yeah I'm being a bit loose in my terminology, by "syscall" I really just mean tracing it to some specific function call(s) in a vendor-provided library (as I assume the problem must be on Apple's end) to expedite the process of fuzzing.
People interested in this bug may also be interested in "An Introduction to Writing Systems & Unicode." Part 3[1] in particular discusses complex scripts.
I went to the Apple Store in Burlington, MA yesterday (2/16/2018) None of the 'geniuses' there knew about this bug or how to fix it. They reminded me of microsoft drones following a troubleshooting script. 1) Reboot the phone 2) reset the phone 3) do a dfu restore..
After they did this my phone would not recognize its sim card. You see my phone originally had AT&T service. Then I switched to sprint some months back. I guess the dfu restore ended up relocking my phone to AT&T. At least that's what the Apple geniuses said. Then they said I had to go to a sprint store.
This is the first time I was disappointed with Apple's human help. I remember past experiences going in there with a frayed power supply cord - they gave me a new one. On other occasions they replaced my phone when they couldn't fix it. They never spent more than 30 mins trying to fix a problem and always fixed it. Their knowledge was deep. This time their opinions were confabulatory - for example one guy said 'your phone does not have the antennae to work with sprint' EVEN THOUGH I reminded him I walked into the store with a working iPhone using sprint. Another one said that sprint jailbroke my phone and that's how they got it to work. WRONG. Another one said that the phone was 'soft unlocked' and relocked itself after the DFU restore. I felt sad for Apple because I know that's not what they were about.
Then I suggested they replace my phone, the manager lady's eyes bugged out, her voice became stern and said ' There is no scenario where you walk out of here with a different phone'. This is the first time they wasted 5 hours of my time and actually caused me to have more trouble than what I walked in with.
The broader question is, I think, how can any textual sequence cause a crash. Text should be regarded very skeptically by the text renderer. What could this be doing that it invokes a crash? Where else is this team not being careful? It reeks of process problems.
Latin text gives people a misleading idea as to how simple text is. For Latin, each character is generally an independent unit with independent metrics isolated from its environment, with a small set of exceptions to this rule (ligatures).
East Asian ideographs bring up interesting questions about what constitutes a character, with Unicode "solving" the problem by saying "every distinct rendering is a distinct character," necessitating somewhere in the region of 80,000 characters or so once all of them get added. Even more difficult are scripts like Korean Hangul or Egyptian and Mayan glyphs, which are composed from a relatively small set of independent units but laid out in blocks and sub-blocks themselves composed in linear text. Unicode has both precomposed Hangul characters and the individual Jamo radicals, but it has currently punted on being able to accurately Egyptian or Mayan text in the first place (although they do appear to be revisiting that decision).
Scripts like Arabic differ from Latin in that glyphs changing shape according to their surrounding context is the norm rather than the exception. However, in Arabic, you largely break this down into an initial/medial/final form, with some characters inducing a shift from medial back to initial. Indic scripts go far beyond this by needing to treat entire consonant clusters as single rendered glyphs.
The end result is that it's very easy to find that invariants that one expects to exist if one is used to Latin text to be violated in other languages. Some concepts of text metrics might not even exist in the first place. As speculated elsewhere, it's probable that the text is crashing because someone insufficiently versed in scripts is asserting an invariant that doesn't actually exist. It does not appear to be a rendering issue, but rather a slightly higher level operation on top of that instead.
Finding these sorts of issues pretty much requires extensive fuzzing with known problematic scenarios. The fault is caused by "I didn't know this exists" which is both a very reasonable situation (few UI experts are well-versed in the complexities of foreign scripts) and very hard to solve from a process perspective.
> Indic scripts go far beyond this by needing to treat entire consonant clusters as single rendered glyphs.
FWIW, Arabic does this too, just that most default Arabic fonts don't. https://www.google.com/get/noto/#nastaliq-aran contains a bunch of specialized ligatures (and is overall a very complicated font).
I always say: There's a reason a lot of the folks working on font shaping are Persian/Arabic speakers :)
Yeah, the common ones are lam-alif showing لا , and alif-lam-lam-heh showing الله .
But for example خ/ح/چ at the end of a word often form cool ligatures, and the dots on ب/پ/ت often go to interesting places, and سے forms a cool ligature where it uses the other form of the bari yeh and the س forms little teeth marks up top. (Some of these are Urdu-specific; I can read regular Arabic but I have more experience trying to read Urdu calligraphy)
One great way of dealing with this at a software engineering level is to stop operating on arrays of characters, and instead add functionality on a case-by-case basis.
For example, you have a wrapper that only lets you iterate forwards. Once you hit languages where you need to "move back", you have to explicitly code in the functioanality. And in theory you can do it in a safe way.
Sure, some might argue you'll end back at arrays. But I believe that if you encode the traversals in a specific way, you'll at least end up at bounds-safe arrays.
For example, if you use an array as a queue, but some other part of the code doesn't, you're gonna have problems. But if you wrap your thing as a queue, no other part of the code will be able to pierce the veil.
Though I don't know how how well C abstractions let you do this.
So far, if I understand correctly, nobody knows of a sequence of characters you could write down in any language that would trigger a crash when encoded in Unicode in a straightforward way. That suggests that the invariant being assumed may come, ironically, from a deep understanding of the scripts rather than ignorance.
In Bengali and Oriya specifically, a ZWNJ can be used to force a different vowel form when used before a vowel (e.g. রু vs রু), however this bug seems to apply to vowels for which there is only one form
This seems to say that the ZWNJ has a meaning before vowels that have different forms, but the crash happens with vowels that only have one form, where the ZWNJ has no effect. Maybe I am misreading?
I'm saying that this crash _also_ applies to vowels with one form.
রু was the original Bengali crash, and that has two forms. I'm saying it's less likely to be related to the zwnj-vowel interaction because it also occurs for vowels where such interaction doesn't exist.
A good example of test sequences causing weird behavior/crashes are RTL languages (like Hebrew and Arabic) and CJK characters (Chinese/Japanese/Korean), which have UI implications as well.
displaying language is full of corner cases. imagine having to render both english, japanese, and arabic in the same code paths, and inline with each other
text display code is not so simple as it seems, and don't fall into the trap of "how hard could it really be"
I'm not saying it isn't hard - I acknowledge that getting it right is a huge task. But the cost of failure should be rendering incorrectly, not crashing or memory corruption.
There was a similar bug about a year ago as well. I'm surprised more fuzzing, static analysis, and general testing hasn't gone into this area since there have been issues found there already. Maybe they have done this extra work and just didn't catch this one. It's hard to know. Still feels like the way it fails is unacceptable.
My question was, why was there not a unit test for this? It seems like it would be trivial to step through every character combination for each language they support and make sure it doesn't cause a crash.
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112. This crashing sequence is five characters long.
Your unit tests would have to go through 1.71650179e30 sequences to be guaranteed to catch this one. At a test rate of 1 millisecond per sequence, that's just 4×10^9 × the age of the universe, according to wolfram alpha.
Rendering Indic scripts with AAT fonts involves a series of finite state machines that are stored in the individual font. So don't forget to multiply by the number of different fonts that each need to be tested.
> Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112.
Allows for 17 planes, but only small portion of those are actually used. According to Wikipedia[1], currently Unicode has 148944 codepoints + 128k private use ones (which might, or might not make sense to include in unit tests). So your time estimate is off by mere 5 orders of magnitude.
Are there just enough people using iOS that these sorts of bugs can be found by mistake, or is someone fuzzing CoreText? Perhaps that can be applied to provide some kind of test coverage? Even if it’s not complete?
This sequence begins the Telugu word for "knowledge" so maybe someone texted that to someone and it went viral from there. This is, of course, only speculation.
It does not include the zwnj, that somehow snuck in. Most keyboards don't support directly inputting a zwnj, but may support it for specific combinations. For example my Marathi keyboard supports typing eyelash rephs (e.g. in र्क) which includes zwj.
However I'm not aware of any such things in Telugu aside from explicit virama-showing which rarely exists in input methods (and doesn't end up with zwnj in the position shown here, but that could have happened after editing).
I don't know if I agree with that. The sequence here is something like 5 or 6 characters. There's some tens of thousands of unicode characters, which suggests something like (10000)6 combinations to test. Fuzzing might be more fruitful, but even then I'm skeptical it can find obscure crashes. There's no replacement for defensive programming.
Concolic execution as a form of whitebox fuzzing is a very effective way to generate these kinds of crashes, especially if the crash is simple enough that it's the input stream that's generating it, and not positioning the program into a very specific state to get this kind of input string to crash it.
Why does everyone assume only this one specific character combination causes the issue? Generalists can't generalize, I guess. IOS/OSX are notoriously bad at unicode parsing... These have been problems for YEARS and there are many many known character combinations that cause these types of issues. You seriously think every character has unique code to process it? Of course not... there are general subsystems that process all of these things, and those can easily be tested. Literally, copy/pasting the Arabic word for "hummus" (حُمُّص) about 20 times will crash OSX/IOS. It's likely a mixture of RTL and LTR along w/ BOM and other unicode weirdness that causes these issues. That is something that is trivial to test for. For added effect, just shovel ح̷̵̷ُ̧̛̟͙̙̠͖̺̞̟̬͖͙̯̭̞̘͚̻̱͉̦̲̑ͫ͐ͬ̏̽ͨ̐̃̃́ͬ͑̃͛̍̃ͭ̄ͬ̊͗̆̇́͘͜͜͞
م̶̸̸ُّ̢̧̨͓̮̣̺̤̟͓͕̯̯̬͉͍̥̥̹͉͉̠̣̰̜̻͓͖̮̫͎̯͙͇̳͛ͩͨ͒ͦ̏̊͒ͩ̅̅̑ͤ̋ͮͩ̔̒͆̔̂͐ͧ͒̐ͭ̎̕̚͜͜͞͞͠
ص̑ͨͦ̆͗̌̔ͬͬ̈̌̑͏̧̙̰̩̭̜̮̺͚̼̙͉̱̭͉͖̤͞
concatenated a few times to any IOS or OSX application (chat apps work great), and you can crash even the most recent "fixed" versions of IOS/OSX, still, to this day, because these bugs have never been fixed at their core. IOS/OSX unicode support is awful, and will remain awful, if the past has been any indication of the future.
An article that claims to "pick apart" the issue, that only mentions "unicode" once in the article body is pretty sad... and no mentions of byte order mark, or endianness. Some dissection...
I don't say "unicode" because I'm not certain it has to do specifically with unicode handling, and not a font stack bug (that would occur on other encodings). One of the previous crashes, for example, had nothing to do with unicode -- it had to do with the fact that some arabic strings get larger, visually, when you shorten them. Folks these days say "unicode" when talking about anything relevant to non-Latin text, which munges up the issue, which is why I specifically avoiding saying that word.
BOM or endianness don't seem to be relevant to this bug.
The problem is not with the characters themselves though, it's with how CoreText processes and parses the unicode symbols.
Don't get me wrong, I know fonts can be malicious, even... but given the history[0] here, with these unicode[1] issues... I think it's pretty safe to say the issue is not with a specific font, per-se, but with CoreText and unicode parsing. For example, when I ran OSX on my old macbook pro, I used open source fonts on my terminal, and this unicode parsing bug still happened.
In 2015 an Apple spokesperson had this to say: "We are aware of an iMessage issue caused by a specific series of unicode characters and we will make a fix available in a software update."
> BOM or endianness don't seem to be relevant to this bug.
Yet on your blog you allude to just that with the left/right comments, though, to be fair, you state that you really don't know the problem:
> I don’t really have one guess as to what’s going on here – I’d love to see what people think – but my current guess is that the “affinity” of the virama to the left instead of the right confuses the algorithm that handles ZWNJs after viramas into thinking the ZWNJ applies to the virama (it doesn’t, there’s a consonant in between), and this leads to some numbers not matching up and causing a buffer overflow or something.
This is claimed to be a dissection of the issue, but there is not even a stack trace present, and yet you joke about that...
> Yes, I could attach a debugger to the crashing process and investigate that instead, but that’s no fun
Nah, you should do that... and you'll likely see that it's CoreText being the same old piece of shit as usual. If it was a font problem, then loading that font on a different system that uses a different rendering engine should reproduce the same problem. It doesn't, I tried that years ago.
Sure, I would posit, that potentially, this is a whole different bug, but... given the history, and the repeated failed attempts to fix this entire class of issues... it's safe to say that IOS and OSX do not handle unicode very well.
Googling, "If you tabulate all stars visible down to magnitude 6.5, thought to be the faintest stars still visible to the unaided eye, the entire sky contains some 9,000 stars. Since you can only see half the sky at any time, that means there are as many as 4,500 stars visible in your sky tonight."
Thats a great explanation (though quite a bit of it was too much for me to understand).
The only correlation I can think of, is that the Kannada script keyboard and full support for the script rendering in the UI didn't exist in iOS until iOS 11.0. So it is crashing for Indic scripts that were supported pre iOS 11. Probably Apple tweaked something in iOS 11 for the languages already supported and broke something.
Does anyone have a link to a thorough reverse-engineering of the bug from someone decompiling/source-diving?
I tried replicating it on High Sierra out of curiosity (single text file with the broken character, opened in TextEdit to cause a crash) and noticed that the crash doesn't happen in consistently the same way each time (I got 8 crashes in a calloc call in CoreText's OTL::CoverageBitmap::Reset, 4 crashes in a malloc in Foundation with no CoreText in the stack trace, plus 10 KERN_INVALID_ADDRESS errors from various places, as well as one time it opened successfully with no immediate problems). Seems like a memory corruption error somewhere earlier, but I'm not sure how to figure out where–I tried rooting around in CoreText with Hopper for a bit but didn't get very far.
(I should probably try this myself, but…) The equivalent programs on other systems are HarfBuzz (open-source) and Uniscribe / DirectWrite (Windows). Did you consider checking what they do with the equivalent text (presumably they don't crash)?
Not sure why I'm not able to trigger this crash on any Apple devices in the office - saw on twitter about how it was messing with iphones even if it was an SSID, unable to replicate here though.
Oh, gotcha. Yea not sure auto-correct could cause that. I initially read it was only related to text-entry and thought auto-correct was likely since it was affecting multiple apps indicating an OS issue.
I am not surprised there will be errors/bugs in the typesetting algorithm, as the OP demonstrates, this stuff is extraordinarily complicated to do for every language/all of unicode. But if you have something so complicated that bugs/errors aren't surprising -- you'd think you'd want to make sure they didn't cause hard crashes.