Here's a basic explanation of the diacritic notion as it applies to Asian scripts.
Thai belongs to a family of scripts known as abugidas. Abugidas include pretty much all South Asian and many Southeast Asian scripts, for example Burmese, Cambodian, Dai, Lao, Thai, etc. They all pretty much derive from Brahmi, which was the proto Indian script. You can see an example of Brahmi over here: http://en.wikipedia.org/wiki/Brahmi
Abugidas are based upon combining multiple glyphs in to syllables, often allowing glyphs above, below, to the left and to the right of the initial consonant, and often including a closing consonant. Most glyphs tend to be consonants, though some are vowels, and others can be special marks for indicating tone or other notions. Often shorter vowels are excluded (as in Modern Standard Arabic).
In old times, such scripts were handled with wacky font-hacks. However, with Unicode, there are some super complex algorithms that make glyphs combine both visually (when typesetting) and logically (when saving/searching/etc). You can actually type a character and a diacritic and it can sometimes automatically combine to form a single character, if such a beast exists, not just visually but when saving to disk.
What makes it even more confusing is that South Asian scripts in particular have mega-combo characters, where whole chunks of glyphs sort of fold in to flowing short-hand symbols. In the case of Sanskrit, I believe loads of these were used in history but few are used these days.
I think that's a fair pontification - corrections welcome!
I wouldn't call them "mega-combo characters" ;-) Typically, these are at most two or three letters written together, and they are very much used today in Sanskrit, Hindi and other regional variants such as Bengali etc.
Of course you also have multiple words that have been combined into one large compound word, by way of appropriate linguistic rules of combining sounds. This is similar to long compound words in German.
This is the term I'm familiar with. They are similar to ligatures which exist in the Italic scripts e.g & (e+t), œ (o+e), German ß (ſ(long s)+s) etc. But conjuncts in Devanagari are much more numerous and widely used, probably because the individual consonant forms are fairly complex.
i have learned the thai alphabet and that "mega-combo character" comment just made my day. Reading thai pretty much feels like reading regexps. Anditdoesntmakeitanyeasierthattheydontusespaces
Actually Thai only has roughly twice the number of characters that we have in Roman scripts - excepting tones (a real pain) it is possible to learn pretty quickly. Lao by contrast has less, but has a rather tricky plethora of vowel combinations for a myriad of hard to distinguish eww, ieww, ooh, iuooh type sounds. :) Cambodian has no tones and is my pick for the one to go for if you are keen on an easy starter.
Tangential tidbit: I sent a copy of The Cambodian System of Writing (http://pratyeka.org/csw/) to TPB's anakata while he was solitary confinement to help him stave off boredom. No idea if he ever read it, though his mother assures me it arrived.
Besides ก (0xe01), the rest of the Thai characters (in 0xe02-0xe59) are letters ขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮฯะาๅ, alphabetic non-spacing marks ัิีึืฺุูํ which can follow the preceding characters, and render as stacked (as shown by the OP), five "marks" เแโใไ (with unicode's Logical Order Exception) which can precede the following character, diacritic non-spacing marks ็่้๊๋์๎ which also render as stacked, modifier letter ๆ ,digits 0-9 ๐๑๒๓๔๕๖๗๘๙ ,currency symbol Baht ฿, and punctuation symbols ๏๚๛
Thai looks to be a pretty gr๏๏vy language, like many other natural languages - perhaps some programming languages will catch up in their enhanced use of lexical tokens one day, instead of just relying on grammar, long English names packed into name hierarchies, and multi-ASCII symbols.
I think I would steer new learners of SE Asian scripts away from Cambodian as a first language to learn. Cambodian, while it is a beautiful script and has more-or-less regular pronunciation, also has a few odd exceptions, and complex vowel pronunciation rules. The consonants are divided into two groups, and many of the vowels are pronounced differently in the first group than in the second. That said, after learning the Thai script, Lao and Cambodian were not too hard. Either way, they are all great languages to learn.
Actually I am almost certain Thai and Lao have those consonant divisions as well, in fact I believe 3 or 4. If I am not mistaken they are still taught and are part of the tone system and/or can affect unwritten vowel selection. More certainly, the consonant classes somehow stem from the need to preserve pronunciation of Pali, a middle-Indian prakrit language (with features not present in these SEA countries' modern languages) that is used as the littoral language of Theravadin ("older school") Buddhism. See http://pali.pratyeka.org/ for more info on that.
The worst thing about "not using space" when it comes to computer is that it's nearly impossible to do word-breaking/line-breaking without relying on dictionary[1] which is very hard to convince software developers not using system text engine to add support for line breaking[2].
The suffer still continues to this date, Android, for instance, still lack a proper support (break by character instead of word) and few apps lack of Thai word-breaking support at all (Twitter, whose only do word break by space).
It's actually relevant here. For people less familiar with some aspects of Unicode, this is a neat example, along the same lines as the OP. Just don't try to select it!
It's displayed with boxes here, not the actual zalgo text (although I can paste it somewhere and looks correctly there), although my encoding is set to UTF and I can't select single boxes. The whole page kept blinking, but this was another issue (the layout of the comments, when I went over some borders it kept selecting and deselecting the whole page which caused it to blink).
Chrome Windows 7 ends up rendering them all as boxes. The lack of whitespace prevents line breaking. The problem is that it forces the container out to 1400px wide, rather than responding to browser width.
I really don't like trying to read 1400px long lines of text.
(to me it looks like some kind of particle accelerator experiment happening in your browser, where ก็็็็็ emits some kind of unknown radiation. After closer examination [zoom to +300%]: maybe it just shows the escaping life spirits of the toppled latin small letter «u» after being shot in right side.)
It's actually more correct on your browser. It's a stack of diacritics. The rasterizer above was printing them on top of each other, which is (well, seems likely to be) typographically incorrect. The problem of course is that the implemented rules should disallow unused-in-the-real-world combinations of diacritics but don't.
I have tested with Firefox (latest version) on Windows XP and 7, and I think we can conclude the problem comes from Windows XP, as it shows the same thing as you, whereas it gives a whole big stack on Windows 7.
I actually get a different rendering for some reason. I get all the extra diatricts to the right above empty circles. That certainly explains why I didn't quite understand the problem.
That's not really a Thai character, right? It's way too many bytes! It must be an intentional repetition of stacking diacritics. Some of the ones in that Google result page are 21 bytes.
Something I wondered once: If one were to have a go at sanitizing Unicode input (e.g., for a forum), what would be a sensible limit on the number of diacritics to allow, without interfering with languages that need them?
Indeed, pasting from the google input and investigating it's a single U+0E01 ก (THAI CHARACTER KO KAI) with 20 U+0E47 ็ (THAI CHARACTER MAITAIKHU), a diacritic.
This character used to be (or maybe still is) a very popular way of trolling people on facebook. Flooding chat window with those funny letters seemed to crash the browser after a while.
Once had this dude sign up for our page management... we had a lot of assumptions about plain or at least sane text that had to get updated. https://www.facebook.com/glitchr
The problem seems to be restricted to Windows - apparently it looks normal on Ubuntu and Mac. (Here: Mac 10.6, works fine on Chrome, Firefox and Safari.)
So I guess it's a platform issue.
In cases like this it is useful to post a short description of the link. Like this: http://stackoverflow.com/a/1732454 (the famous "don't parse HTML with regex" cry for sanity, semi-relevant because of zalgo text.) FTFY
What on earth is going on here? 🔴҈҈҈҈̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚
That's a pretty fun one. It's a sequence of the following codepoints: {LARGE RED CIRCLE} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} and then 66 repetitions of {COMBINING LEFT ANGLE ABOVE} {COMBINING CYRILLIC MILLIONS SIGN}
Funny you mention this. We are working on support for this right now (today). EAWebKit, used by SimCity, didn't originally have Thai support, and it's being implemented now. I can tell you what it will look like though, as I just took a screenshot: http://i.imgur.com/ZFtGP87.png. It's conventional that repeated Thai decorators stick with the base glyph, though that character is invalid Thai. We might fix it nevertheless.
Hmm, that's interesting. At work I looked at this on my Ubuntu machine with the Chromium browser. It didn't look particularly special because all the diacritic marks were drawn on top of each other in the same location above the letter.
I come home and look at it on my Windows machine with Chrome, and now I see the big stack of diacritic marks that I assume everyone's making a fuss about. I assume it's something to do with the way that the system's installed font lays out the marks in question.
Yeah I was completely baffled by this until I looked in Google Image Search. I ran this on my Windows box and sure enough it looks crazy with little 'springs' going everywhere. Under Chrome, Firefox and Safari on OS X it looks normal though, just foreign text.
I was going to ask about this a week ago. An "Anonymous" twitter account posted it last week and the letters overlaid 3 or 4 tweets above it. Had no idea what it was.
Technically "correct" you mean; if there was an unicode character that filled all the screen with the color black it would not matter if it were technically "correct", usability correct is more important.
If stacking diacritics are a legitimate part of a language and they change the meaning of words or characters, then it's more important that the content be correct than the "usability."
In fact, if a person can't read it or reads it incorrectly because parts of characters are hidden, then it's not very usable.
Hypothesizing about a Unicode character that fills the screen with black is a nonsensical straw man, because it makes no sense in "real world" written languages, so there would never be a Unicode character for it.
False; diacritics commonly used in the real-world such as "´¨`" fit inside the same space as the characters, this successive chain of diacritics is never used in real-world texts except for very few obscure cases. Plus the implementation of UTF8 should include the line-height required for the correct displaying of the character if they really believe the displaying of obscure characters is more important than usability.
After trying with the css suggested here on Google inside Chrome, the problametic characters don't have the top going across too high like originally, but it gets capped on the top of the first line of the paragraph, instead of the line the character is actually at.
Doh, you're right. For some reason I had it in my head that the block itself wrapped onto multiple lines but I must have been thinking of something else.
It seems like there isn't really any CSS only solution for this without wrapping every character in its own element like jQueryIsAwesome suggested.
It's a huge hammer and a considerably larger webpage footprint just so that a type of character isn't allowed to run amok, something that's not going to happen anyway in the vast majority of cases.
We have very different definitions of "huge hammer", in any decent modern browser (Chrome, Firefox, IE9) it takes less than 10ms to apply the mentioned code in 20 texts (using the linked 10 Google results as an example).
It also puts every single special character into its own span. That may negatively affect programs that trust HTML for copy-paste, for example Open Office, Microsoft Word (I think?), some IM clients, some text editors, etc. Doesn't seem worth it in the general case.
That said, if those sorts of things bother an individual, they could run that on the page themselves, I suppose, so it's good for that :)
10ms is a huge amount to spend for such an obscure fix. How many obscure corner cases do you think Google webpages have to account for? A lot more than 100, I would bet. So rough methods like these would be horrible for performance.
You're the one who made the 10ms claim, not me. If indeed it is faster than you claimed, then maybe it would be ok. Still probably not worth engineering time for a problem of this magnitude.
It corrupts Unicode data because it splits on code points instead of grapheme clusters. When performed on 'Spin̈al Tap', it splits the base character U+006E (LATIN SMALL LETTER N) from the combining character U+0308 (COMBINING DIAERESIS) and results in the string 'Spin<span class="s_char">̈</span>al Tap', which contains the valid Unicode grapheme cluster '>̈'! If you were to split on grapheme clusters instead, the result would be 'Spi<span class="s_char">n̈</span>al Tap'. However, I still wouldn't support that solution because it could negatively affect text segmentation used by search engine indexing and natural language processing tools.
See that: yet another issue with characters in filenames/directory names with aren't printable ASCII characters.
This kind of stuff is precisely the reason why we make sure that every filename we create is only using a subset of ASCII (and no space of course). In our source code, in our builds, in the desktop app we're serving, etc.
Unicode characters entered by users should go in one place: the DB.
I smiled the other day when I read about the build script for Chromium: it clearly specificied that the source directory must not contain any space in its name.
Thai belongs to a family of scripts known as abugidas. Abugidas include pretty much all South Asian and many Southeast Asian scripts, for example Burmese, Cambodian, Dai, Lao, Thai, etc. They all pretty much derive from Brahmi, which was the proto Indian script. You can see an example of Brahmi over here: http://en.wikipedia.org/wiki/Brahmi
Abugidas are based upon combining multiple glyphs in to syllables, often allowing glyphs above, below, to the left and to the right of the initial consonant, and often including a closing consonant. Most glyphs tend to be consonants, though some are vowels, and others can be special marks for indicating tone or other notions. Often shorter vowels are excluded (as in Modern Standard Arabic).
In old times, such scripts were handled with wacky font-hacks. However, with Unicode, there are some super complex algorithms that make glyphs combine both visually (when typesetting) and logically (when saving/searching/etc). You can actually type a character and a diacritic and it can sometimes automatically combine to form a single character, if such a beast exists, not just visually but when saving to disk.
What makes it even more confusing is that South Asian scripts in particular have mega-combo characters, where whole chunks of glyphs sort of fold in to flowing short-hand symbols. In the case of Sanskrit, I believe loads of these were used in history but few are used these days.
I think that's a fair pontification - corrections welcome!