Every Unicode character shown once per frame as a 33-minute movie

coderdude · on April 11, 2011

I'm surprised that I was able to perceive individual characters. At the beginning when the movie is displaying the alphabet it seems like I can see each letter. Is this really one character per frame (assuming playback at 24 frames per second)?

Edit:

49,571 (characters) divided by 25 (frames per second) comes out to 1,982.84. 1,982.84 seconds is 33.03 minutes. Taking into account the padding at the beginning and end of the video reveals that the video displays 1 character per frame at 25 frames per second.

jerf · on April 12, 2011

Some people adapt this into a system they claim helps speed read [1], and while I don't particularly believe the speed reading claims it is true that you can actually read quite quickly with words flashing at fairly high rates.

There are various sites online that can help you play with this, but I can't get any of them to work on the machine I'm posting this on (no Java, zapreader.com doesn't work for unknown reasons).

[1]: http://en.wikipedia.org/wiki/Rapid_Serial_Visual_Presentatio...

sliverstorm · on April 12, 2011

I could buy the speed reading claims, perhaps at slightly slower rates. One of the cool things about language and our brains is you can miss letters and get order mixed up, and instantly guess what was meant with a very high rate of success.

Emxalpe: I persmue you sitll udnr snatd tis?

Rondrak · on April 12, 2011

sliverstorm, I don't want to take this too far off topic, but what are the rates you're hearing of that you have trouble believing? I read at a little over 1,300 words per minute (and I keep getting faster). As to comprehension, without taking a test it's hard to say, but by reading a section of material for a few minutes, writing down everything I can remember without looking at the book, then re-reading to see what I've missed, I seem to get roughly 80% comprehension, which I find acceptable for most things (and it keeps getting better through practice).

Disclaimer: That's depending on the material, I read at about half that rate (around 600 wpm) if it's extremely technical or dense, or a good deal faster in anything at or below a high school reading level.

sliverstorm · on April 12, 2011

What I have trouble believing is that anyone could, using Rapid Serial Visual Presentation (which was the subject of the comment I replied to) process characters (or words) that flash on the screen as fast as (or faster than) 25fps.

Problem with serial presentation is of course, you cannot skip words or process words in parallel. I suspect when you attain 1,300wpm, you are focusing your eyes on half a page or perhaps the entire page, and only skimming sentences and absorbing key words and ideas, rather than processing each word one at a time.

Rondrak · on April 12, 2011

Ah, my apologies for misunderstanding your comment. That's what I get for attempting rational thought at one in the morning.

Before I learned to speed read the way I currently do, I tried learning using RSVP. In my personal experience I didn't find it very helpful, and my comprehension rates (when using RSVP) plummeted. I'm in complete agreement there.

As to my 1,300wpm, you're partially correct. 1,200wpm is just about the limit at which I can read word by word, line by line in a book, using my hand as a pacer. After 1,200, I stop dropping conjunctions, pronouns, and prepositions, and anything that can be inferred through context or isn't necessary to the meaning of the sentence (e.g. - absorbing key words and ideas, but still more precise than 1/2 a page at a time).

Here's the book I learned from so you can judge the merits based upon something better than my attempted explanation: http://www.amazon.com/Breakthrough-Rapid-Reading-Peter-Kump/...

sliverstorm · on April 12, 2011

You can perceive them because they 'burn into' your retina and leave an after-image. Your eyes act as a queue or buffer of sorts; by the time your brain registers the character (especially characters outside the latin alphabet, and arabic numeral system- characters your brain sees and recalls much less often), several other characters have already flashed by.

You don't notice this effect with contiguous video, but 25fps becomes more of a problem with high-speed action flicks and their ilk.

chc · on April 11, 2011

This is why there's a push for high-FPS video — 25 FPS is really too slow to look smooth without a lot of blurring or an audience that's very willing to overlook slight flicker.

tintin · on April 12, 2011

I'm also surprised the codec can handle this quite well. I expected a lot more artifacts.

pavel_lishin · on April 11, 2011

"Oh, neat, I found where the Asian characters begin!"

And then they never ended.

fedd · on April 11, 2011

i personally think it's not very correct to call hieroglyphs characters and assign each of them a code.

i may be wrong but i think that there is an growing number of hieroglyphs as there is an infinite (growing) number of words; and every h. consists of more primitive parts - i think they should be coded so that the sequence of primitives would form a hieroglyph like in european languages letters form words

seanalltogether · on April 11, 2011

Sometimes I wonder what the computer industry would look like if it had been developed entirely in japan inside a protective bubble from the outside world. What would code look like? What would the input mechanism look like? It's a bit staggering to think of the complexity needed.

The standard keyboard has been simply shoehorned to accommodate japanese and chinese character sets.

pavel_lishin · on April 12, 2011

My guess is that touchpad-style inputs would have won out over keyboards.

qq66 · on April 12, 2011

That's an interesting idea, but there seems like there would be a tremendous amount of complexity in encoding how the various strokes or components fit together. In English, there are twenty-six letters that fit together in one linear order, delimited by spaces and a few punctuation marks. How would you write that "down-and-to-the-left diagonal stroke meets unfilled square somewhat above the midpoint of that square's right-hand side?"

Umofomia · on April 12, 2011

The granularity doesn't necessarily need to get down to the stroke level. In fact, if you start playing the video at around the 6 minute mark, you'll see that many of the characters are composed of the same elements, known as radicals (see: http://en.wikipedia.org/wiki/Radical_(Chinese_character) ).

Unicode does have the concept of "combining characters" in which a string of characters is used to compose one glyph (see: http://en.wikipedia.org/wiki/Combining_character ), but currently they generally are only used for adding diacritics. All Chinese characters in the current Unicode standard are precomposed, but it's potentially not out of the question to encode them as a composite of two or more sub-characters. The downside to this is that each character would end up taking several more bytes to encode, but one advantage is that novel characters could be created by combining two or more existing characters, which currently cannot be done without explicitly adding the novel character to the Unicode standard (see: http://en.wikipedia.org/wiki/Precomposed_character#Chinese_c... ).

The input methods that fedd mentions are just input methods, which translate what the user inputs into encodings of precomposed characters. This is different than having the encoding themselves represent the composition of the characters.

calloc · on April 12, 2011

This here explains why this has not been done: http://news.ycombinator.com/item?id=2435708

fedd · on April 12, 2011

it's implemented in several ways, i am late as always :)

http://en.wikipedia.org/wiki/Chinese_input_methods_for_compu...

(thanks chc for the link)

andos · on April 11, 2011

You really mean hyeroglyphs? Or ideographs/ideograms?

Umofomia · on April 12, 2011

The correct term is logogram or logograph. Hieroglyphs more commonly refer to the Ancient Egyptian writing system, whereas logogram and logograph are more general terms. Ideogram/graph refer to a specific type of logogram in which the character represents a specific idea or concept, but only a small minority of Chinese characters are ideograms.

For more information, see: http://en.wikipedia.org/wiki/Logogram http://en.wikipedia.org/wiki/Chinese_character_classificatio...

fedd · on April 11, 2011

hm, seems the terminology has changed since i was trying to invent the chinese keyboard in childhood. we used to call the chinese characters hieroglyphs (characters that represent the bigger notions like words or parts of the word), as opposed to letters (characters that represent the sounds). now i see in wikipedia that they are called sinographs and contain of ideograms.

my idea was to split the chinese character to the most primitive strokes like horizontal, vertical, diagonal strokes, squares, commas etc.

i dont know, they have keyboards in china, and they are not that big. do they let to input sounds and then choose from several chars that sound like the input, or implement my childish idea?

andos · on April 12, 2011

On western keyboards, Japanese is typed using an input method. Basically, you spell the words using a phonetic convention and the computer guesses what sinograph you meant: you type zitensha, press space, and it hopefully becomes 自転車 (bicycle). Because each of zi, ten, and sha could be the reading of any of multiple characters the computer might guess wrongly. It works pretty much all the time, though.

There are also Japanese keyboard layouts featuring all the 51 syllables (besides the latin alphabet). You just type the syllables, there's no phonetic convention, just the guessing game.

I don't know how it works for Chinese, though. It's probably the same, I guess. I know that on Mac OS X one can draw the characters with the finger on the touchpad.

fedd · on April 12, 2011

thanks, i meant something like this from chc link:

http://en.wikipedia.org/wiki/Chinese_input_methods_for_compu...

except the phonetic method you describe, you may enter graphical parts of symbols. so i thought unicode should encode these parts. but there are more than one shape-based method, along with several phonetic methods, so they encode all the possible ideograms, making the unicode chart that big

chc · on April 11, 2011

Both systems are used, though the shape-based methods use a more sophisticated granularity than simple strokes (for example, there is one component consisting of a diagonal stroke, three horizontal strokes and a square). Relevant Wikipedia article: http://en.wikipedia.org/wiki/Chinese_input_methods_for_compu...

vorg · on April 12, 2011

> there is an growing number of hieroglyphs as there is an infinite (growing) number of words; and every h. consists of more primitive parts - i think they should be coded so that the sequence of primitives would form a hieroglyph like in european languages letters form words

The F.A.Q for Chinese and Japanese at the Unicode Consortium's website http://unicode.org/faq/han_cjk.html poses this very question:

> Q: Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?

Their reply:

A: The Han ideographic script is largely compositional in nature. The overwhelming number of characters created over the centuries (and still being coined) are made by adjoining two or more old characters in simple geometric relationships. For example, the Cantonese- specific character U+55F0 嗰 was created by adjoining the two older characters, U+53E3 口 and U+500B 個, one next to the other.

The compositional nature of the script—and, more to the point, the fact that this compositional nature is well-known—means that over time tens of thousands of ideographs have been created, and these are currently encoded in Unicode by using one code point per ideograph. The result is that some 71,000 code points are consumed by ideographs in Unicode 5.0, nearly three-quarters of the characters encoded.

The compositional nature of the script makes it attractive to propose a compositional encoding model, such as can be used for Hangul. Such a mechanism would result in the savings of thousands of code points and relieve the IRG from the burden of having to examine potential candidates for encoding.

Unfortunately, there are some difficulties involved with a compositional model for Han.

First of all, while the rules for drawing composed Jamos as Hangul syllables are relatively straightforward, those for Han are surprisingly complex. To use U+55F0 嗰 as an example again, although it is built structurally out of two pieces, the left piece occupies far less than 50% of the character's horizontal space. This reduction in size is a result of the nature of U+53E3 口 itself and doesn't apply to other characters. Either the rendering process would have to be sophisticated enough to take such ideographic idiosyncrasies into account, or the encoding model would have to provide more information than just the geometric relationship between the composing pieces. (This is the main reason why the existing Ideographic Description Sequence mechanism is inadequate even for drawing described ideographs.)

Even more difficult is the problem of normalization, which would be necessary for operations such as comparison or searching. A normalization algorithm would first have to parse the sequence of composing Han for validity, and then make sure that all substrings are normalized. It should also to be able to recognize a "canonical" form for a sequence of composing Han. Thus, U+55F0 嗰 could be spelled using three pieces (U+53E3 口, U+4EBB 亻, U+56FA 固) as well as with two. Indeed, since U+4EBB 亻 is a well-known variant form of U+4EBA 人, it could be spelled using that character, as well. Providing a canonical representation would have to take these multiple spellings into account.

The open-ended nature of the script and possibilities for ambiguous spelling make it virtually impossible to guarantee that two characters made up by two different people would be treated as equivalent even if they look exactly the same and are intended to be equivalent.

Other computer processes such as machine-based translation or text-to- speech would probably have to skip such characters when they occur in plain text, because there is no simple, authoritative way for these processes to be able to determine even approximate definitions or pronunciations from the visual representation alone. Even if the data are available, the need to parse strings of variable length before looking them up creates complications.

Finally, East Asian governments, while aware of the compositional nature of the script, do not wish to actively encourage the coining of new forms because of the practical problems they create. In particular, new coinages are rarely an aid to communication, since they have no obvious inherent meaning or pronunciation. They are little more than dingbats littering otherwise intelligible text.

While the number of encodable ideographs has proven far greater than Unicode had originally anticipated, the standard is in no danger of running out of room for them any time soon. 71,000 ideographs encoded in 17 years amounts to just over 4000 ideographs per year. At this rate, it would take nearly two hundred years to fill up the available space in Unicode with ideographs.

And while the number of unencoded but useful ideographs is larger than originally anticipated, it is also finite and probably smaller than the number of ideographs already encoded. The bulk of useful unencoded forms is likely to come from placenames, personal names, or characters needed for Chinese dialects other than Mandarin and Cantonese. Many unencoded forms occurring in existing texts are actually variants of encoded characters and would best be represented as such.

While it currently takes several years for the IRG to fully process proposed ideographs so that they can be encoded, steps are being taken to streamline this, and further steps will be possible in the future should they prove necessary. Indeed, the bulk of the work currently done by the IRG would still have to be done for composed ideographs in order to provide support for them beyond rendering.

martin_k · on April 11, 2011

+1

What a great way of making yourself feel a bit more insignificant and smaller.

pavel_lishin · on April 11, 2011

Or you could feel proud of yourself for speaking a language - a meme - that was sufficiently virulent to overtake most others.

train_robber · on April 12, 2011

Sadly, my 'meme' isn't there on the video. Helvetica doesn't like my language I guess.

rick_bc · on April 11, 2011

No, it should make you feel proud -- understanding the world through 26 (or other small numbers of) alphabets.

(Speaking as an asian.)

billybob · on April 11, 2011

Time?

pavel_lishin · on April 11, 2011

I clicked ahead randomly.

Vivtek · on April 11, 2011

Fantastic!

What is the sound and how was it generated? It sounds like it might be narrow samples of some voice.

JonnieCache · on April 11, 2011

It is a computer voice saying each character, timestretched to a fraction of their original length, likely the fraction needed for the longest one to fit into one frame.

I've wasted a lot of time messing around with different sound sources in granular synthesis, and the timbre it produces is unmistakable.

You get a similar effect fast forwarding an audio book, depending on the software.

EDIT: maybe I spoke to soon, upon further inspection it seems to go up in pitch much too linearly with time. It's definitely a voice synthesis algorithm though, or perhaps one of those synthesisers that use parts from those algorithms to make musical sounds.

EDIT2: I think it may be the voice algorithm reading out the character code, with the pitch set to the code * some constant. The jumps in pitch are from when there is a run of undefined codes. Higher numbers take longer to say, so the sound becomes more continuous as it goes on.

hrldcpr · on April 11, 2011

It gets higher and higher pitched as the video goes on, so may have to do with the unicode values. But too jumpy to just be that.

andos · on April 11, 2011

You're probably right. Not all values have been assigned a glyph yet.

pavel_lishin · on April 11, 2011

Sounds like random binary data - in fact, I bet the audio is generated the same way as the video.

sp332 · on April 11, 2011

It's not every Unicode character, just the ones in Helvetica.

fishtoaster · on April 11, 2011

I'm curious how the audio was generated. It seems to be somewhat dependent on the character being displayed, as evidenced by jumping between different types of characters (basic letters vs japanese characters)

BasDirks · on April 11, 2011

See also: http://www.youtube.com/user/ffoitl#p/a/u/1/gGiENL4XrjE

syncsynchalt · on April 11, 2011

Only shows the Basic Multilingual Plane, when does the first sequel come out?

biot · on April 12, 2011

I was really hoping to see things ordered by closeness of one character's shape to another so that as it plays it looks like something organic growing and moving around. I'll buy a beer for the person who creates that video.

stretchwithme · on April 11, 2011

They say characters are the most important element of a story.

chrischen · on April 11, 2011

They're the most important element to a movie, especially this one.

BasDirks · on April 11, 2011

Strangely beautiful.

Rondrak · on April 11, 2011

It is, isn't it? I found myself thinking, "I'll just watch for a minute to see what this is", then getting lulled in until a distraction broke my reverie.

artfail · on April 11, 2011

Where were you guys in 2009?

http://vimeo.com/7489601

charlesju · on April 11, 2011

This video would be a lot cooler with some good background music.

fedd · on April 11, 2011

u vee double u ex wye zed now we know the alphabet