Computer program deciphers a dead language that mystified linguists

willchang · on July 4, 2010

The headline is technically correct, but quite misleading. The language in question, Ugaritic, was deciphered in 1932. As it was discovered in 1929, it can hardly be said that the decipherment was an arduous process. The significance of the computer program is along the lines of:

> The key strength of our model lies in its ability to incorporate a range of linguistic intuitions in a statistical framework.

As someone working in NLP, this is quite remarkable, and I don't mean to detract from it; but it should be noted that the model (like the linguists) relied on the fact that Ugaritic is closely related to Hebrew. It would be impossible to use such a model to decipher Linear A, which is probably not related to any known language. (But the model could perhaps be used to check if linguists had overlooked any connection between Linear A and another known language.)

anateus · on July 4, 2010

Exactly, hardly a lost language.

I happen to have taken an Ugaritic class in college, and after a semester could read the texts without too much trouble. I am a native speaker of Hebrew, so the roots are similar, which made it not very hard.

It used a novel alphabet (likely the first ever) that was a phonetic abjad (no vowels) derived from Akkadian cuneiform (in turn derived from Sumerian cuneiform).

As far as "few tablets" goes, I happen to have used VERY thick books full of transcriptions of Ugaritic tablets. One of the advantages of knowledge about Ugarit (the city) being lost is that until the recent rediscovery of its ruins it lay mostly undisturbed.

So, if we discover another language with close genetic ties to a known tongue and an alphabet that's different but similar, we'll have just the thing :) Of course, this is better than what we had before, my jest doesn't mean to detract!

Onwards with Linear A :>

micheljansen · on July 4, 2010

Exactly, quoting the article:

> We applied our model to a corpus of Ugaritic, an ancient Semitic language discovered in 1928. Ugaritic was manually deciphered in 1932, us- ing knowledge of Hebrew, a related language.

They specifically used a well-known language to be able to test the effectiveness of their approach.

Pretty cool nevertheless, but hardly "a dead language that mystified linguists" :)

Groxx · on July 4, 2010

> The computer program relies on a few basic assumptions in order to make intuitive guesses about the language's structure. Most importantly, the lost language has to be closely related to a known, deciphered language, which in the case of Ugaritic is Hebrew. Second, the alphabets of the two languages need to share some consistent correlations between the individual letters or symbols. There should also be recognizable cognates of words between the two languages, and words that have prefixes or suffixes in one language (like verbs that end in "-ing" or "-ed" in English) should show the same features in the other language.

So... it's a statistical decoder ring? Impressively effective, and a distinct possibility for accelerating the decoding of newly discovered languages, but that doesn't sound like much more than a Markov chain attached to a diff tool.

Also:

> The lost language of Ugaritic was last spoken 3,500 years ago. It survives on just a few tablets, and linguists could only translate it with years of hard work and plenty of luck. A computer deciphered it in hours.

That's not "mystifying" to linguists, that's a mildly tough nut to crack. Mystified them for a while, certainly - but so do many small-sample-size languages.

edit: have not read the "original paper", this is all based off the article.

edit2: a brief skim of the original paper implies it finds cognates by statistically-similar use of morphemes. It appears they've got their whole algorithm in there, if anyone cares to investigate more deeply. So I wasn't too far off at least, and the article's writer did a decent job explaining how it worked. Much better than newspapers typically manage :)

RiderOfGiraffes · on July 4, 2010

Same story, different source, some comments already:

http://news.ycombinator.com/item?id=1477122

mahmud · on July 4, 2010

This is a very tightly-knit gang of uber hackers. Their last project was a program which generated articles for Wikipedia:

http://people.csail.mit.edu/csauper/?page_id=64

petercooper · on July 4, 2010

That gives me hope for Perl code from the 90s being intelligible even 1000 years into the future.

microtherion · on July 4, 2010

It may be intelligible 1000 years from now, but probably no sooner ;-)

palish · on July 4, 2010

So.. What did the translations actually say?

Groxx · on July 4, 2010

> The surviving Ugaritic texts tell the stories of a Canaanite religion that is similar but not identical to that recorded in the Old Testament, providing Bible scholars a unique opportunity to examine how the Bible and ancient Israelite culture developed in relation to its neighbors.

I missed it the first time through too. No idea what was in the data set however, nothing mentions that.

shasta · on July 4, 2010

Darmok and Jalad at Tanagra

Sokath, his eyes opened

kevbin · on July 4, 2010

So what. I do this with "javac" every day.