The headline is technically correct, but quite misleading. The language in question, Ugaritic, was deciphered in 1932. As it was discovered in 1929, it can hardly be said that the decipherment was an arduous process. The significance of the computer program is along the lines of:
> The key strength of our model lies in its ability to incorporate a range of linguistic intuitions in a statistical framework.
As someone working in NLP, this is quite remarkable, and I don't mean to detract from it; but it should be noted that the model (like the linguists) relied on the fact that Ugaritic is closely related to Hebrew. It would be impossible to use such a model to decipher Linear A, which is probably not related to any known language. (But the model could perhaps be used to check if linguists had overlooked any connection between Linear A and another known language.)
I happen to have taken an Ugaritic class in college, and after a semester could read the texts without too much trouble. I am a native speaker of Hebrew, so the roots are similar, which made it not very hard.
It used a novel alphabet (likely the first ever) that was a phonetic abjad (no vowels) derived from Akkadian cuneiform (in turn derived from Sumerian cuneiform).
As far as "few tablets" goes, I happen to have used VERY thick books full of transcriptions of Ugaritic tablets. One of the advantages of knowledge about Ugarit (the city) being lost is that until the recent rediscovery of its ruins it lay mostly undisturbed.
So, if we discover another language with close genetic ties to a known tongue and an alphabet that's different but similar, we'll have just the thing :) Of course, this is better than what we had before, my jest doesn't mean to detract!
> We applied our model to a corpus of Ugaritic, an ancient Semitic language discovered in 1928. Ugaritic was manually deciphered in 1932, us- ing knowledge of Hebrew, a related language.
They specifically used a well-known language to be able to test the effectiveness of their approach.
Pretty cool nevertheless, but hardly "a dead language that mystified linguists" :)
> The computer program relies on a few basic assumptions in order to make intuitive guesses about the language's structure. Most importantly, the lost language has to be closely related to a known, deciphered language, which in the case of Ugaritic is Hebrew. Second, the alphabets of the two languages need to share some consistent correlations between the individual letters or symbols. There should also be recognizable cognates of words between the two languages, and words that have prefixes or suffixes in one language (like verbs that end in "-ing" or "-ed" in English) should show the same features in the other language.
So... it's a statistical decoder ring? Impressively effective, and a distinct possibility for accelerating the decoding of newly discovered languages, but that doesn't sound like much more than a Markov chain attached to a diff tool.
Also:
> The lost language of Ugaritic was last spoken 3,500 years ago. It survives on just a few tablets, and linguists could only translate it with years of hard work and plenty of luck. A computer deciphered it in hours.
That's not "mystifying" to linguists, that's a mildly tough nut to crack. Mystified them for a while, certainly - but so do many small-sample-size languages.
edit: have not read the "original paper", this is all based off the article.
edit2: a brief skim of the original paper implies it finds cognates by statistically-similar use of morphemes. It appears they've got their whole algorithm in there, if anyone cares to investigate more deeply. So I wasn't too far off at least, and the article's writer did a decent job explaining how it worked. Much better than newspapers typically manage :)
> The surviving Ugaritic texts tell the stories of a Canaanite religion that is similar but not identical to that recorded in the Old Testament, providing Bible scholars a unique opportunity to examine how the Bible and ancient Israelite culture developed in relation to its neighbors.
I missed it the first time through too. No idea what was in the data set however, nothing mentions that.
> The key strength of our model lies in its ability to incorporate a range of linguistic intuitions in a statistical framework.
As someone working in NLP, this is quite remarkable, and I don't mean to detract from it; but it should be noted that the model (like the linguists) relied on the fact that Ugaritic is closely related to Hebrew. It would be impossible to use such a model to decipher Linear A, which is probably not related to any known language. (But the model could perhaps be used to check if linguists had overlooked any connection between Linear A and another known language.)