Love hotels and Unicode

davidw · on April 3, 2012

I loved the idea of the slides interspersed with text. Just posting slides is usually lame, because you lose out on the actual talk, which contains most/much of the actual information.

pmjordan · on April 3, 2012

This really ought to be the standard way to post presentations online. Unfortunately, it's more work than uploading to the awful slideshare.

_delirium · on April 3, 2012

Yeah, it's an interesting format I've recently started experimenting with for some of my own talks (though only preliminarily). It seems sort of halfway between the "throw up slides" non-solution, and the traditional academic solution, which is to accompany a talk with a written paper. Then the paper would serve as the "durable" version of the talk suitable for archiving, laying out the same material but in a way more suited to text, and usually a bit more formally, with more details (e.g. to enable someone to actually reproduce the work).

The slide interspersed with text format could be seen as a more heavily illustrated, easy-reading sort of paper. More similar to the talk than a traditional accompanying paper would be, but not just the slides. But I think it's still non-trivial work to make a good version, which is why people often just throw up the slides, because that's almost literally no additional work.

You basically have to take the same source material and think how you would write it up into a blog post, which is not 100% the same as thinking of how to give it as a talk. Though I guess a first-cut solution could be to just record the talk and type up a transcript in between the slides.

pmjordan · on April 3, 2012

You could probably do a lot worse to prepare for a talk than writing such a text+slides version of it, then distilling key words as notes from the prose and then practicing and giving the talk based on those.

corbet · on April 3, 2012

A lot more work. Like a lot of people, I don't read some kind of static text to an audience; even with the same slides I give a different talk every time. The only way to put together a page like this one would be to simply transcribe the talk. If somebody has the time to do that, that's cool; I sure don't.

pmjordan · on April 3, 2012

I don't know if it really would be that more work if you planned it from the start. Granted, I'm not very good or experienced at presentations and public speaking; so when I do I tend to spend a lot of time preparing. This preparation invariably involves practicing the talk a lot and writing down key words to remind me of points I must not forget. I suspect that writing the talk out at some point during that process would help just as much as (if not more than) spending the same time practicing the talk a few times. I'll certainly try this next time.

abdelazer · on April 3, 2012

Agreed. In the same vein: http://nedbatchelder.com/text/unipain.html

bgruber · on April 3, 2012

i had never before read about the unicode flags, and the rather (politically and technically) brilliant end run around the problem. fascinating.

jerf · on April 3, 2012

I am torn between it being brilliant, versus it should have been taken as a sign that adding all those symbols was just a waste of time. I sort of feel like Unicode 5 is jumping the shark here. At the point where you're arguing about how to refer to the color of the hair on one of your "graphemes", may I humbly suggest that the goal of creating a complete global encoding scheme is apparently done and the committee ought to disband. Unicode has apparently ceased being about being a universal encoding for human text and expanded its mission into becoming a universal icon catalog. This is just silly. Next up in Unicode 7, we deprecate the face icons in favor of a series of combining characters that allow to mix and match hairstyles, face colors, eyes, noses, etc., to create arbitrary faces, why not.

greggman · on April 4, 2012

Or you could consider that 140million Japanese (not sure about other countries) have been using emoji every day for the last 10+ years on their cellphones.

They've become part of the language and it would be culturally insensitive to give them a big "fuck you, us westerner's don't need your damn icons, either keep those to yourself and hence don't be able to inter-operate with everyone else or stop using them even though they've been a big part of your culture for 10+ years"

jerf · on April 4, 2012

You're trying to make me feel bad, but you're doing it with a false dilemma: Either we stick emoji in Unicode or the Japanese are screwed and we told them to "fuck" off. This is a false dilemma, and exactly the sort of thinking that worries me as we fuzzily expand the charter of Unicode beyond a universal grapheme repository. I say there's plenty of third options, many of which are better choices than jamming it into the international all-purpose standard.

Unicode is supposed to be the central hub for all graphemes, but I would certainly like to see some argument that emoji are actually graphemes. For one thing it's quite bizarre that suddenly Unicode appears to be specifying colors as well as shapes, which is one bright line I'd be concerned about. (There may have been colors before but they would be much more exceptional, and I'm not aware of any.) Unicode was ambitious as the all-grapheme repository, it's simply a guaranteed failure if it tries to become the repository of all vaguely iconic/smilyish/little pictures in the world.

aidenn0 · on April 4, 2012

Japanese do mix emoji with kanji and kana. Are emoji any less graphemes than the mor pictographic han characters? I don't think the colors are normative though, any more than the exact shape is. A different representation of a love hotel ought to be fine.

jerf · on April 4, 2012

The colors I'm referring to are the ones in the names of the code points, such as the ones the Germans complained about It doesn't get much more normative than that.

aidenn0 · on April 5, 2012

Ah, that grapheme doesn't actually have color IIRC, but the hair is not filled, thus giving the impression of a fair-haired person (rather than dark haired)

sk5t · on April 4, 2012

Should every fad or handset feature adopted by 2% or more of the global population be incorporated into Unicode?

TazeTSchnitzel · on April 4, 2012

Sort of. Unicode is designed to unify all character sets, eventually. 2% of the world's population using a character set (granted, not traditional characters here but it still counts) is enough.

maaku · on April 3, 2012

There is still important work to be done with respect to code point unification. Notably, encoding ancient or minority scripts that don't have national governments advocating on their behalf.

Unfortunately, entrenched business and political interests have made encoding flags and emoticons a higher priority.

wvenable · on April 4, 2012

> we deprecate the face icons in favor of a series of combining characters that allow to mix and match hairstyles, face colors, eyes, noses, etc., to create arbitrary faces, why not.

Encoding Mii's in Unicode, brilliant! Nintendo should really try to get a spot on the committee.

Michiel · on April 3, 2012

Unicode is probably the closest we are today to world peace. If we can get all this somewhat right, including the flags, the noodles and the 'person with blond hair', I think there is hope...

stephengillie · on April 4, 2012

Wholeheartedly agreed. We've gone from killing each other over hair and skin color, to arguing about them in committee.

vorg · on April 4, 2012

> Every character, or to be more exact, every "grapheme", is assigned a Unicode code point.

Every character is assigned a Unicode code point. The Unicode consortium defines a grapheme as a "user perceived character", usually made up of one Unicode code point, but sometimes two or more. A base character can be followed by one or more non-spacing marks, together forming a "grapheme", the most common of which usually have a "canonical mapping" to a single character, but need not.

guard-of-terra · on April 3, 2012

Minor point: I see copy-paste from wikipedia about ISO-8859-5. It's unfortunate since nobody ever used ISO-8859-5. They probably should change it to ISO-8858-something-else in the wikipedia article.

pmjordan · on April 3, 2012

It also entirely glosses over the fact that before the ISO-8859 standards, there were the horrendous code pages in DOS and numerous other encodings on other platforms, which made things hard even for Europeans, let alone languages with a non-latin-derived alphabet.

guard-of-terra · on April 3, 2012

And after ISO-8859, there were Windows (aka ANSI) code pages, also not matching perfectly (or sometimes at all) with either DOS or ISO-8859.

yuhong · on April 3, 2012

Most SBCS codepages other than 437 as used today was introduced in 1987 with DOS 3.3 which was after ISO 8859.

derleth · on April 3, 2012

And before that, you get into encodings like EBCDIC, RAD50, SIXBIT, FIELDATA, and even more failed schemes now largely forgotten.

Why should a brief overview go back even to the pre-ISO-8859 days except to mention ASCII? None of them are directly relevant: The world we're dealing with now on the Web begins with ASCII, moves through a Pre-Unicode Period, and finishes up in the Land of Unicode, where it's at least possible to do things Right. All history tells a narrative; when it comes to character encodings, that's a good default unless you really think your audience cares about why FORTRAN was spelled that way back in the Before Time.

Tom Jennings has an interesting history:

http://www.wps.com/projects/codes/

Someone · on April 3, 2012

"The world we're dealing with now on the Web begins with ASCII"

I know nothing about the implementation of early web browsers/gopher/etc, but I doubt there ever was anything on the web that used ASCII. 7-bit email may have been around at e time, but I would guess Tim Berners Lee just used whatever character set his system used by default (corrections welcome; being snarky isn't the only reason I write this)

flomo · on April 3, 2012

It was a hotly debated topic whether the www should use 7-bit/mime or not.

derleth · on April 3, 2012

> I know nothing about the implementation of early web browsers/gopher/etc, but I doubt there ever was anything on the web that used ASCII.

All headers, HTTP, email, or otherwise, are 99% or more ASCII. HTML markup is over 99% ASCII for most documents, especially the complex ones.

ASCII is the only text encoding you can guarantee everything on the Web (and the Internet in general, really) knows how to speak. Finally, guess what all valid UTF-8 codepoints in the range U+00 to U+7F inclusive are compatible with: ASCII.

TazeTSchnitzel · on April 4, 2012

ASCII in fact is the completely safe text encoding for HTML - and thanks to HTML entities, you do not lose any international character support. You can have a Unicode-using HTML document encoded in ASCII - it's just quite big.

Someone · on April 4, 2012

I know that, but "over 99% ASCII" = "not ASCII". For many users, UTF8 is over 99% ASCII, but it is not ASCII.

derleth · on April 4, 2012

> I know that, but "over 99% ASCII" = "not ASCII"

No, that's not what I meant. I meant that all of the essential bits are ASCII, all of the software that generates those important pieces as to know ASCII, and it's entirely possible for software that speaks only ASCII to handle it as long as the filenames (the main source of non-ASCII characters) being served are also ASCII.

Read the HTTP specification sometime.

mavroprovato · on April 3, 2012

Mirror from Google cache, but unfortunately text only:

http://webcache.googleusercontent.com/search?q=cache:http://...

icebraining · on April 3, 2012

The Coral cache has images: http://www.reigndesign.com.nyud.net/blog/love-hotels-and-uni...

brown9-2 · on April 3, 2012

Does anyone happen to know what the arguments for big-endianness versus little-endianness in the Unicode format were?

I've never really understood the advantage of one over the other, but this section on Wikipedia http://en.wikipedia.org/wiki/Endianness#Optimization helps explain that there are optimizations that can be made at a hardware level when performing arithmetic on little-endianness values.

Is the argument for using LE in Unicode similar?

ajross · on April 3, 2012

If you're using, say, UCS2 (not UTF16, lest we get too confused), it's awfully nice if the wide character (i.e. a C short) for "A" can be equal to the literal 'A' and not 0x4100. Making that work depends on the endianness of the host architecture, not just the data.

jonshea · on April 3, 2012

http://www.ietf.org/rfc/ien/ien137.txt

sbierwagen · on April 4, 2012

  ASCII was actually invented in the US in the 1960s, as a 
  standardised way of encoding text on a computer. ASCII 
  defined 128 characters, that's a character for half of the 
  256 possible bytes in an 8-bit computer system.

Uh? Wouldn't it be easier to just say that it's a 7-bit coding system? And what does he mean by "256 possible bytes in an 8-bit computer system"?

hardy263 · on April 4, 2012

"7 bits" versus "half of 8 bits" are two slightly different things. One has a padding, the other does not. So the file size for a 7 bit encoding would be slightly smaller than an 8 bit one.

TazeTSchnitzel · on April 4, 2012

Because there's 256 possible byte values. Like there's 65536 possible short values. (Using short to mean 2 bytes here, I'm aware it varies)

davvid · on April 5, 2012

The one thing he forgot to say is, "If you have to choose an encoding, use utf-8."

Someone who doesn't know the difference between utf-16, utf-8, etc. might not know which to use.

adavies42 · on April 3, 2012

i'd've noted that endianness is hardly unique to unicode.

akrifa · on April 3, 2012

This was a fantastic read. Thank you!

derleth · on April 3, 2012

When he says byte-order marks are optional, does he mean just in UTF-8 (where they are) or also in UTF-16 (where I strongly suspect they are not)?

(Yes, you can use heuristics to guess which endianness is in use. The problem is that while this is trivial for Western languages I don't even know how you'd begin when presented with arbitrary text using an East Asian or African written language.)

patio11 · on April 3, 2012

The problem is that while this is trivial for Western languages I don't even know how you'd begin when presented with arbitrary text using an East Asian or African written language

Not actually that hard.

Consider a document which is encoded in either a) ASCII like you know it or b) ASCII where the top 4 bits and bottom 4 bits are transposed. How would you tell the difference? Well, one can imagine creating a histogram of the bits for each half of the bytes and comparing them to expectations based on the distribution of bits in naturally occurring English text. The half with most of the entries in the 0x5, 0x6, and 0x7 is the upper order half.

If you don't know what naturally occurring e.g. Japanese looks like in Unicode code points, take this on faith: flipping the order does not give you a document which looks probably correct. (Also, crucially, Japanese with the order flipped doesn't resemble any sensible document in any language -- you end up with Unicode code points from a mishmash of unrelated pages.)

P.S. Why care about that algorithm? Here's a hypothetical: you're a forensic investigator or system administrator who, given a hard drive which has been damaged, need to extract as much information as possible from it. The BOM is very possibly not in the same undamaged sector which you are reading right now, and it may be impossible to stitch the sectors without first reading the text. How would you determine a) whether an arbitrary stream of bytes was likely a textual document, b) what the encoding was, c) what endianness it was, if appropriate, and d) what human language it was written in?

ajross · on April 3, 2012

Note that precisely such an encoding detection algorithm was specified as part of XML to avoid the absurdity of the BOM.

funkah · on April 3, 2012

Sounds hard to me.

chc · on April 3, 2012

How about this simplified version:

1. Try both byte orders

2. If one produces valid text and the other does not, choose that one (this will get you the correct answer almost every time, even if the source text is Chinese)

3. If both happen to produce valid text, use the one with the smallest number of scripts

(Note that this just determines byte order, while Patrick was talking about the more ambitious task of heuristically determining whether a random string of bytes is text and if so what encoding it is. My point is just that you really don't need to be told the order of the bytes in most cases.)

kijin · on April 3, 2012

Simple in theory, but hard enough in practice that companies like Microsoft screw it up from time to time.

Try saving a text file in Windows XP Notepad with the words "Bush hid the facts" and nothing else. Close it and open the file again. WTF Chinese characters! Conspiracy!

jerf · on April 3, 2012

That's not Microsoft "screwing it up", that's you not feeding the algorithm enough characters for it to be really sure. While that short string is below the threshold, the threshold is actually quite surprisingly small; if I remember correctly it's just over 100 bytes and any non-pathological input will be correctly identified with effectively 100% success.

chc · on April 3, 2012

That's a bug having to do with uncertain encoding (which is what I called "the more ambitious task"), not uncertain byte order.

btn · on April 3, 2012

The BOM is optional in UTF-16 too:

  The UTF-16 encoding scheme may or may not begin with a BOM.
  However, when there is no BOM, and in the absence of a 
  higher-level protocol, the byte order of the UTF-16 
  encoding scheme is big-endian.

(D98 in http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)

prodigal_erik · on April 3, 2012

Oh jeez, that's just begging to get mangled by a transition between protocols, e.g., HTTP PUT followed by rsync (to a box which doesn't know the PUT was UTF-16LE).

pmjordan · on April 3, 2012

There's nothing stopping you from leaving off the BOM as far as I know. The BOM is pointless in UTF-8 (the encoding is inherently endian-independent) and if you can guarantee by convention that your UTF-16 text will be read in the same byte order as it was written, you also don't need it.

Another thing I noticed in the article: the encodings issue was historically even more complicated: Unicode previously only covered what is now the basic multilingual plane (BMP), which meant all code points could be encoded with a single 2-byte value, a.k.a. UCS-2. When more code points got added, that exceeded UCS-2's range, so that mutated into the variable-length UTF-16 encoding. Had Unicode been introduced into Windows and Java later, who knows if they'd ended up using UTF-16.