Legitimate question: Why isn't this being done with software? The Speech-to-Text...

hga · on Feb 13, 2015

Echoing jamesbrownuhh, the vicious PR for this lawsuit at http://nad.org/news/2015/2/nad-sues-harvard-and-mit-discrimi... specifically calls out what appears to be nonsense YouTube auto-captioning ... and two of the examples are of nothingburger visits by Lady Gaga and Obama. It's unclear jamesbrownuhh's generosity would extend to properly captioning video of every vacuous celebrity's visit.

Worse, you've got accept that throwing this into the legal arena adds costs way beyond just proper captioning. Any settlement that will make the plaintiffs happy will require the establishment of a ADA enforcement unit at these institutions, and resultant red tape for anyone in the communities to publish anything with audio. Which going forward will have a clear chilling effect; we're not just talking about a lot of Harvard and MIT potentially going dark, but continuing in that mode except for the most important things that are worth the extra captioning and legal effort.

jamesbrownuhh · on Feb 13, 2015

Speech to text is simply not a solved problem. The technology gets better all the time, enormous strides have been made, but voice recognition is something that the human brain does well and computers generally do not, at this time.

In terms of accuracy, once you start falling beneath the high-nineties percentage threshold, it becomes increasingly hard for humans to make sense of the text. Dropping to 66% accuracy means that one word in three is wrong - it's almost impossible to make sense of something that badly degraded.

Here's a nice example, here are the 'Automatic Captions' - provided by YouTube's voice recognition technology - of an MIT YouTube video. This is a 'best of all possible worlds' example - there is a single speaker, whose words are pronounced slowly and clearly, with no background noise or undue slurring or interference. You'll see that this is good, but nevertheless in a 90 second video there are 20+ errors (not all of them obvious or easy to decipher.)

"0:00 the use of microneedles in the gastrointestinal tract 0:03 presents a unique opportunity to enable the oral delivery of large molecules 0:08 again sewing that are currently limited to injection 0:11 and adjustable capsules such as the one shown could be imagine 0:14 it would contain a reservoir to house the therapeutic payload 0:18 and have a pH responsive coating to cover the neil's 0:21 allowing for easy ingestion 0:23 after ingestion the bill would pass through the stomach and into the 0:26 intestine 0:27 there because I'm a dissolved revealing the microneedles 0:30 the pair started motion in the tissue would compress the reservoir 0:34 expelling the drug out the needles and into the tissue 0:37 insulin injections were tested in the GI tract a pig's 0:40 as a result injection a small bowl can be seen in the tissue 0:44 this small injection result in a robust drop in the animal's blood glucose 0:49 that superior to the effect elicited by traditional subcutaneous injection 0:54 oil administration as expected has no effect 1:01 the safety impasse 1:02 manga vice was also tested in pics the model device was placed her in over two 1:08 into the stomach the pics once in the stomach 1:11 it was released 1:16 by look 1:17 radio graphic 1:18 image to the pill shown here the progress on the bill through the animal 1:22 can be tracked by serial X-rayed 1:25 the pill was found to be safe and well-tolerated"

As you can see, this is very, very good in the circumstances - but it's still not quite there.

This whole thing is a shame, because MIT do provide captioning on a good number of their YouTube videos, which deserves credit. It seems righteously unfair for NAD to pick on an uncaptioned video and then point out isolated errors in YouTube's (not MIT's) transcriptions of same, a point which I hope is made in MIT's defence.

There are 263 videos on MIT's YouTube channel, of which 91 are captioned. Most of the videos are only a few minutes long, so getting that channel 100% captioned would take maybe a day or two, if that - I'm almost tempted to donate a weekend of my time to it (assuming that MIT would even take the finished results, which is of course by no means certain.)

Obviously that's just YouTube - the wider selection of courses is another matter. But it illustrates why things like this aren't always solvable with software. It can certainly help, and save the humans some time, but it's not a "fit and forget" solution, and isn't likely to be for quite some time yet.