Spoken language is pretty diverse. The data set we've been using for spoken lang...

Spoken language is pretty diverse. The data set we've been using for spoken language understanding is a corpus of telephone conversations between strangers about an assigned topic. Naturally, this contains a lot of ums and uhs!

But spoken language about technical topics and difficult ideas tends to also be disfluent.

For a long time computational linguists were using algorithms that performed okay on written language but poorly on spoken language. Specifically, they used algorithms that weren't linear-time in the length of the input. This meant that the input had to be pre-segmented, so you had to run another model before the parser, and then accept the errors from the previous model.

Now that we've got linear-time algorithms that are doing everything jointly, it's not so bad. We just need to get a bit better at using intonation features, and parsing lattice input so that we can deal with the recognition errors.

Mostly, it's a relief not to have to deal with the really tortured language often found in journalese. News reports are written really weirdly.