Spoken language is pretty diverse. The data set we've been using for spoken language understanding is a corpus of telephone conversations between strangers about an assigned topic. Naturally, this contains a lot of ums and uhs!
But spoken language about technical topics and difficult ideas tends to also be disfluent.
For a long time computational linguists were using algorithms that performed okay on written language but poorly on spoken language. Specifically, they used algorithms that weren't linear-time in the length of the input. This meant that the input had to be pre-segmented, so you had to run another model before the parser, and then accept the errors from the previous model.
Now that we've got linear-time algorithms that are doing everything jointly, it's not so bad. We just need to get a bit better at using intonation features, and parsing lattice input so that we can deal with the recognition errors.
Mostly, it's a relief not to have to deal with the really tortured language often found in journalese. News reports are written really weirdly.
But spoken language about technical topics and difficult ideas tends to also be disfluent.
For a long time computational linguists were using algorithms that performed okay on written language but poorly on spoken language. Specifically, they used algorithms that weren't linear-time in the length of the input. This meant that the input had to be pre-segmented, so you had to run another model before the parser, and then accept the errors from the previous model.
Now that we've got linear-time algorithms that are doing everything jointly, it's not so bad. We just need to get a bit better at using intonation features, and parsing lattice input so that we can deal with the recognition errors.
Mostly, it's a relief not to have to deal with the really tortured language often found in journalese. News reports are written really weirdly.