Legitimate question: Why isn't this being done with software? The Speech-to-Text problem has been around for a long time, and it seems like there are a lot of people who are financially motivated to solve it. If the best solutions on the market, or ideally a combination of the best solutions, can't provide a baseline decent transcript then why aren't people tripping over themselves to solve this problem?
It seems like an 80% solution would be good enough. Hell, even a 66% solution seems like a good compromise or starting point. If an automatically generated transcript can convey at least 2/3rds of the information from a lecture for a one time or small incremental cost then I don't see why both parties wouldn't be ok with it. Those with disabilities would have to do some extra work to look up garbled words or ideas that don't translate well to text, but it would be within the bounds of reason (say a 1 hour lecture would now take 2 hours to parse). The organizations producing the content would most likely having to pay for speech-to-text software, either several thousand dollars per year per class or $X per lecture, but they would still come out cheaper than paying someone per minute to do the transcription. It isn't a win-win situation, but more of an equitable lose-lose.
They say a fair deal has been reached when both sides in a negotiation are a little bit unhappy. A software solution would seem to do that without ignoring the rights of the disabled or placing prohibitive costs on the content producers. And it would set a precedent going forward: Content producers must make an effort to accommodate those with disabilities, but the disabled should be willing to make some extra effort themselves. Asking an elderly woman in a wheelchair to lift herself over a sidewalk curb is not reasonable. Asking the same person to spend an extra 30 minutes to decipher an unclear transcript might be.
Echoing jamesbrownuhh, the vicious PR for this lawsuit at http://nad.org/news/2015/2/nad-sues-harvard-and-mit-discrimi... specifically calls out what appears to be nonsense YouTube auto-captioning ... and two of the examples are of nothingburger visits by Lady Gaga and Obama. It's unclear jamesbrownuhh's generosity would extend to properly captioning video of every vacuous celebrity's visit.
Worse, you've got accept that throwing this into the legal arena adds costs way beyond just proper captioning. Any settlement that will make the plaintiffs happy will require the establishment of a ADA enforcement unit at these institutions, and resultant red tape for anyone in the communities to publish anything with audio. Which going forward will have a clear chilling effect; we're not just talking about a lot of Harvard and MIT potentially going dark, but continuing in that mode except for the most important things that are worth the extra captioning and legal effort.
Speech to text is simply not a solved problem. The technology gets better all the time, enormous strides have been made, but voice recognition is something that the human brain does well and computers generally do not, at this time.
In terms of accuracy, once you start falling beneath the high-nineties percentage threshold, it becomes increasingly hard for humans to make sense of the text. Dropping to 66% accuracy means that one word in three is wrong - it's almost impossible to make sense of something that badly degraded.
Here's a nice example, here are the 'Automatic Captions' - provided by YouTube's voice recognition technology - of an MIT YouTube video. This is a 'best of all possible worlds' example - there is a single speaker, whose words are pronounced slowly and clearly, with no background noise or undue slurring or interference. You'll see that this is good, but nevertheless in a 90 second video there are 20+ errors (not all of them obvious or easy to decipher.)
"0:00 the use of microneedles in the gastrointestinal tract
0:03 presents a unique opportunity to enable the oral delivery of large molecules
0:08 again sewing that are currently limited to injection
0:11 and adjustable capsules such as the one shown could be imagine
0:14 it would contain a reservoir to house the therapeutic payload
0:18 and have a pH responsive coating to cover the neil's
0:21 allowing for easy ingestion
0:23 after ingestion the bill would pass through the stomach and into the
0:26 intestine
0:27 there because I'm a dissolved revealing the microneedles
0:30 the pair started motion in the tissue would compress the reservoir
0:34 expelling the drug out the needles and into the tissue
0:37 insulin injections were tested in the GI tract a pig's
0:40 as a result injection a small bowl can be seen in the tissue
0:44 this small injection result in a robust drop in the animal's blood glucose
0:49 that superior to the effect elicited by traditional subcutaneous injection
0:54 oil administration as expected has no effect
1:01 the safety impasse
1:02 manga vice was also tested in pics the model device was placed her in over two
1:08 into the stomach the pics once in the stomach
1:11 it was released
1:16 by look
1:17 radio graphic
1:18 image to the pill shown here the progress on the bill through the animal
1:22 can be tracked by serial X-rayed
1:25 the pill was found to be safe and well-tolerated"
As you can see, this is very, very good in the circumstances - but it's still not quite there.
This whole thing is a shame, because MIT do provide captioning on a good number of their YouTube videos, which deserves credit. It seems righteously unfair for NAD to pick on an uncaptioned video and then point out isolated errors in YouTube's (not MIT's) transcriptions of same, a point which I hope is made in MIT's defence.
There are 263 videos on MIT's YouTube channel, of which 91 are captioned. Most of the videos are only a few minutes long, so getting that channel 100% captioned would take maybe a day or two, if that - I'm almost tempted to donate a weekend of my time to it (assuming that MIT would even take the finished results, which is of course by no means certain.)
Obviously that's just YouTube - the wider selection of courses is another matter. But it illustrates why things like this aren't always solvable with software. It can certainly help, and save the humans some time, but it's not a "fit and forget" solution, and isn't likely to be for quite some time yet.
It seems like an 80% solution would be good enough. Hell, even a 66% solution seems like a good compromise or starting point. If an automatically generated transcript can convey at least 2/3rds of the information from a lecture for a one time or small incremental cost then I don't see why both parties wouldn't be ok with it. Those with disabilities would have to do some extra work to look up garbled words or ideas that don't translate well to text, but it would be within the bounds of reason (say a 1 hour lecture would now take 2 hours to parse). The organizations producing the content would most likely having to pay for speech-to-text software, either several thousand dollars per year per class or $X per lecture, but they would still come out cheaper than paying someone per minute to do the transcription. It isn't a win-win situation, but more of an equitable lose-lose.
They say a fair deal has been reached when both sides in a negotiation are a little bit unhappy. A software solution would seem to do that without ignoring the rights of the disabled or placing prohibitive costs on the content producers. And it would set a precedent going forward: Content producers must make an effort to accommodate those with disabilities, but the disabled should be willing to make some extra effort themselves. Asking an elderly woman in a wheelchair to lift herself over a sidewalk curb is not reasonable. Asking the same person to spend an extra 30 minutes to decipher an unclear transcript might be.