Hacker News new | past | comments | ask | show | jobs | submit login
Your next conference should have real-time captioning (lkuper.github.io)
136 points by oskarth on June 1, 2014 | hide | past | favorite | 41 comments



It's hard to over-state the power and value of having a text transcript of any kind of spoken material like this. It literally unlocks a world of possibilities - not just for search and video indexing, both massively useful in themselves, but also to make such talks genuinely digitally accessible, in such a way that they can be read and interpreted by people who cannot hear, or a worldwide audience of people who don't necessarily have English as a first or second language.

Caption all the things. There are so many benefits that it's just daft not to.


My native language is not English, had to learn it myself on written material. I could read and write just fine, but no way understand spoken English. I'd sometimes start watching some tech video, or listen to a talk, try to catch a few words here and there. Captioning would be a blessing back then.

Actually it was. I discovered American TV shows on the internet (well.. had no legal way to access them) Started watching first season 24 with subtitles, they were like training wheels for ~20 eposiodes. Then I managed to ditch them and started listening (with a finger on a shortcut to jump 10 seconds back).

So, please have text transcripts if possible. They make a huge difference for non English speakers.


Oh yeah I've learned a lot of my English from TV show with subtitles. Some countries (Russia) always dub English shows and moves while others like Romania pretty much never dub and most always just write subtitles. I think that account for %-wise more Romanian young people speaking better English.


I was recently in Romania and was blown away at how casually people spoke English! A slightly American accent, with pacing and rhythm similar to a native speaker.

Contrasting that to German English speakers whom likely had high quality English classes throughout education speak with a specific accent and rhythm closer to that of the German language.

I don't know if their are technical reasons for this, but anecdotally Romanians had said they learned a lot of their English from watching dubbed TV and movies.

Someone else in this thread mentioned the show 24. It's fascinating that, indirectly, watching pirated copies of a typically Hollywood entertainment television show (which academic and intellectual communities may find "crude") can open up a wealth of knowledge and culture (English language only resources) for people whom, potentially, wouldn't have had the educational opportunities otherwise.

It makes my spine tingle when stories like this flip your understanding and perspective of things like the Hollywood Entertainment Machine. Things have value in unexpected and fascinating ways.


In China, you can buy the scripts to Friends with the Chinease translation printed below as a learning resource.

I knew a guy at university who had an ok understanding of thw english language with a fairly Chinease accent, but would occasionally say strange slang informal words that no one has said since Friends aired.


That is likely because Germany dubs all of their shows on television (and I think most of western Europe does the same). Nobody that I know of in eastern Europe dubs their shows for television, we just caption them.

Hence, people speak good English with a slightly american accent. We're exposed to it pretty much all our lives.


> It makes my spine tingle when stories like this flip your understanding and perspective of things like the Hollywood Entertainment Machine.

No doubt. Not sure if they ever made a conscious choice of "we'll teach our people English better" or they just took the easier route. But watching American shows in English is really invaluable to learning American conversational English. A huge part of it is always cultural references and sayings that just never translate directly things like "Don't put the cart before the horse" saying, of even stupid movie and pop culture memes like "Hasta la vista, baby" (Terminator 2) That is stuff one has very little chance of learning in school (grammar and vocabulary always come first).


Agree. The only question is should it be real-time or post process. There was real time transcription at DEFCON last year (my first time at that con) and while it was occasionally humorous, and occasionally confusing, it was almost always useful as a fallback when you couldn't quite hear the speaker, and also particularly during Q&A when the discussion wasn't well mic'd.

So I would say transcription is a must have and even real time transcription is a feature a large portion of attendees will benefit from, not just those who can't hear well, unless your conference space is fairly small or your sound system is truly excellent.

I guess it's also worth considering why the sound systems for these conferences are so epically bad. For example, what's with passing around microphones when we are all already carrying our own personal mic?


Real-time transcription can be excellent for distributed meetings as well. We do this on some teams at Mozilla (e.g., Servo and Rust) for all of our group meetings, which allows people who either couldn't make it, don't dial in to the video conference, have a poor internet connection, or where english is their second (and sometimes third) language to participate.

Though I'll be the first to admit that I'm in awe of the skills of their stenographer; I'm not nearly that fast, and rely on other meeting participants to fix up my typos (especially in code) in our collaborative editor, https://etherpad.mozilla.org/


I first encountered steno'd captions in person at Google's TGIF and it was awesome even for the non-hearing-impaired. You get a few seconds of scrollback in case you miss a word, which is easy to do at large gatherings.

The only thing that seems to be missing from the !!Con transcripts are timestamps. I'm not familiar with Plover so it might already be a feature, but being able to output one of the standard subtitle formats (SubRip, timed text, SSA) would make remuxing an MP4/MKV of each talk with subtitles much easier.


If you have a good transcript, you can temporarily upload the video to YouTube (mark it as private if you like, it only needs to be accessible to you) - then in the "Captions and Subtitles" options, upload the plain text transcript.

YouTube will sync the transcript to the audio, (removing the ambiguity of it having to guess what is being said, as you're telling it that - so now it knows what words it's listening out for) and you can download the resulting automatically timed file as an SBV or SRT type file.

It's not 100% perfect, and is something of a hack, but it usually works pretty well. :)


http://amara.readthedocs.org/en/latest/index.html (Amara: Create Captions and Subtitles) might be of interest to you.


Thank you! Those docs foxed me completely but I did take a look at Amara's main site - an interesting approach. Not a magic bullet but I love the principle of opening up tasks like these to the crowd.


Can anyone offer pointers to efficient algorithms for achieving this? Seems like the naive way would be to run voice recognition on the video and try to line up matches could give you anchor points, and then you would just extrapolate between those points to build an index from word->timestamp?


We work with large volumes of subtitles here and that's basically what we are aiming to do. There's a couple commercial solutions that still do a poor job for a hefty price tag, if that's the only thing you need from them. Not profitable compared with human sync unless you have to sync more than a couple hundred hours of content.


The BBC developed an "Assisted Subtitling" system which on paper is pretty fancy - accepts scripts in most common formats, automatically determines optimum colours for each speaker, processes video to find shot changes, uses voice recognition to spot the dialogue at the right points, and turns out an almost-complete subtitle file that just needs a look over by a human to ensure it's sensible.

Better still, it's now open source - although sadly the voice recognition part of it relies on a closed source commercial product which (at first glance) might not even still be available.

Dangerous in the wrong hands, but interesting: http://subtitling.sourceforge.net/


This would be hugely welcome at conferences for me! I'm hearing impaired and while I can usually hear the speakers (with hearing aids), I sometimes struggle to understand the actual words they are saying (poor speech discrimination). Seems like something that could really increase the quality of conferences for everyone for not a lot of expense. Kudos to !!con for doing this.


I am in exact same situation. Would love it.


Same here! Would love it and make me actually attend conferences.


Even though my hearing is fine, I'd love to see something like this!

Sometimes the acoustics of the room are bad, or there's background noise, or maybe you just missed a word and need to have it repeated. All of these problems would be solved by this system.

This kind of thing would benefit all conference goers.


Are there any speech to text systems out there that could do this reliably, say 80% accuracy?


Look at any con video on YouTube and turn on the "English Automatic Captions" - they're generated by some pretty good voice recognition software. But as you'll see, the results are a long way from perfect.

My favourite example of misrecognition is one of Travis Goodspeed's talks, where YouTube's VR output "Geek women are expensive, but not prohibitively so."

Voice recognition is OK but it falls a long way short of a usable level of accuracy, and even the accuracy it can muster goes right down the toilet if there is any background noise or music, or if the speaker is in any way unclear (accents, rushed or slurred speech, etc). There's a long way to go before you can just get a usable transcript of speech automatically.

Quite a lot of voice recognition engines seem to have been trained on thousands of hours of C-SPAN or Meet The Press or something, because when recognition conditions get challenging, some engines start to degenerate into outputting nonsense like "congress Muslims Kenya capitol great today Cheney".

There is no substitute for a human pair of ears and a lightning-fast means of text entry like a steno keyboard - nice to see Plover getting a mention in the source article too.


This is far beyond what a computer could do. For instance, see this transcript:

https://github.com/hausdorff/bangbangcon.github.io/blob/gh-p...

For instance, she'll separate out conversations:

>> Can you move the mic closer to your mouth? >> Yes. Is this better? Is this better? Okay. I will talk like this, then. >> You can move the mic. >> Like... >> Take it off the stand and hold it up to your face.

She can also figure out when something is an acronym (like LARP), make sure everything is capitalized correctly (Python, Ruby), separate out what's being said into paragraphs when the speaker starts talking about something new, and a ton of other things.


From the FAQ on Mirabai Knight's website: http://stenoknight.com/FAQ.html#cartspeech

"Automatic speech recognition is not currently a substitute for human transcription, because computers are unable to use context or meaning to distinguish between similar words and phrases and are not able to recognize or correct errors, leading to faulty output. The best automatic speech recognition boasts that it's 80% to 90% accurate, but that means that, at best, one out of ten words will be wrong or missing, which results in a semantic accuracy rate that's often far lower than 90%, depending on which word it is."

(this is a subset of the answer to "Will speech recognition make CART obsolete?")


The thing in that category that I have found useful is a speech recognition system trained on the speaker, using the same headset it was trained on. (This was with IBM ViaVoice almost 10 years ago with maybe 30 minutes of training; Dragon is said to be more accurate.) It was substantially worse than human, but good enough to usually get the meaning across, which YouTube's auto-captions mostly don't.


I feel like all college lectures would become immediately more valuable if this could be done accurately. There were so many lectures where I missed a detail, lost track of what the professor was explaining, and then zoned out for the rest of the class.


For simple English (and other major languages), I would assume so. But with all the jargon, nonstandard language, and acronyms that ar used at tech (and other fields to a lesser extent) conferences... I would expect significant decreases in accuracy.


80% accuracy is unacceptable for a deaf or hearing impaired person to use. It really needs to be 95-98% or so for a good understanding of the entire thing.


If you actually repeat the speech into the microphone, you can go quite a bit higher.


This is used quite a lot in the UK for captioning live news broadcasts - it's called respeaking and relies on a speaker basically repeating the speech in a flat monotone voice, such that some voice recognition software can more easily make it out.

It works to some degree, and has the advantage for the subtitling companies that respeakers are easier to train and don't need to be paid as much as a proper stenographer. The disadvantage is that the output is much slower and subject to a rather greater delay. There is still nothing that beats steno - but respeaking is cheaper and people don't complain enough about the inaccuracies.


In Italy it was the other way around - steno was cheaper for some reason, but there were issues with foreign words / names / etc. I used to do this stuff for a while and I saw the most incredible things go on air on both sides...


Id like to explore if there are any automated solutions?

What software/sdk's have you used?


Even the best automated solutions rely on recognising someone who is speaking clearly and precisely, ensuring that every word is well spaced and completely clear.

There are no automated solutions that can universally do a good job of transcribing natural speech from people who aren't specifically "speaking to be recognised", if that makes sense.

Maybe one day, but not yet. It's a problem waiting to be solved, so the reward for the first who can really crack it will be substantial.


Nuance's SDK, but unfortunately "automated" is out of question with that...


I work with CaptionAccess, which provides the type of services covered in the article. One reason there isn't more captioning is because people associate captioning with disabled people and accessibility.

The article began with people looking into captioning as a means of meeting an accessibility need (a speaker losing a hearing aid) and ended with the discovery that there are many other benefits beyond accessibility. It was a great read and everyone won. Many planners never get to that point.

The lack of widespread real-time captioning isn't a technical or cost issue, it's an education issue.


It really helps with accessibility! Captioning helps elsewhere, too, and there are other methods - we've gotten crowdsourced transcripts of the Metafilter Podcast by using Fanscribed[1], and it's been appreciated.

[1] https://www.fanscribed.com


A little more commentary on the difference between realtime stenography and automated speech recognition: http://blog.stenoknight.com/2012/05/cart-problem-solving-ser...


But how would the people creating the transcripts deal with all the technical lingo in say programming conferences? The lexicon seems highly domain-specific and can get quite obscure at time... That being said, this seems like it would indeed be a GREAT addition to any conference.


Anyone know any London/UK based people who can do this?


Contact your local organisations for people with hearing impairment and ask them for details of "speech to text" people.

There is an oddity in the UK where a person with a hearing impairment can get speech to text paid for, but they are the client and they get to control what happens to the text. It would be useful if text to speech paid for by the state and used in public meetings was automatically given to the meeting as a whole as well as the person as an individual.


You should also have real-time audience feedback powered by MeetingPulse. MeetingPulse: Make your events interactive! http://meet.ps

(I just thought if you make the plug outrageously shameless it will pass for a joke, still being a plug though)

Jokes and plugs aside, it can produce a graph of audience's per-moment sentiment, that can be overlaid on the transcript/video, so you know how they felt at each moment of the talk.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: