Hacker News new | past | comments | ask | show | jobs | submit login
Reaching new records in speech recognition (ibm.com)
123 points by igravious on March 11, 2017 | hide | past | favorite | 37 comments



I'm responsible for the Watson Speech JS SDK, which is aimed at making the speech services easy to use in web apps.

Code is at https://github.com/watson-developer-cloud/speech-javascript-...

Simple demos at http://watson-speech.mybluemix.net/

More complex demo at https://speech-to-text-demo.mybluemix.net/

I'm goin to be out some this evening but feel free to ask me questions and I'll answer them as available.


What's up with the lack of https and encryption in IBM's cloud? I'm sure your machine learning stuff is great and all but the lack of proper encryption and security measurements makes it a total no-go. How can you expect anybody to take your cloud services seriously when you do http in 2017? I'm asking as someone who had their boss talked into trying out ML stuff one Bluemix by your sales people.


Can you give me an example?

http://stream.watsonplatform.net/ (the domain that the speech APIs use) redirects to https. Ditto for http://gateway.watsonplatform.net/ which is what most of the other APIs use.

Both of the linked demos also redirect to https if you try a http URL.


Yes, most of your links (just like many of IBM's cloud related web services). If anything you, IBM and any potential customers, should be seriously concerned that they begin with http:// in the first place.


How does pricing[1] for this work? I see $0.02 / min of Speech to text. Is 1 minute a minimum transaction size or can you send over 6 transactions which in total add up to 1 min of recognition time?

1: https://www.ibm.com/watson/developercloud/speech-to-text.htm...


The final total is rounded up to the nearest minute at the end of each month and then the bill is based on that. So, in your 6 transaction example, you'd be billed for 1 minute as long as all 6 were within the same month.


Thanks! How much have you tested in environments with a lot of background noise? (in car, around kids or construction)

Is it also built for those situations, or mainly focused on accuracy in low ambient noise contexts?


I haven't done much high noise testing, but I am aware that it has better accuracy with low background noise.

I've been told that cell phones make a good mic in noisy environments FWIW.


How well would it work with technical or specific jargon? (Anatomy and medical terminology, in our specific use-case)

Is there a way to feed some kind of (text) dictionary to aid recognition? Or does it also need audio samples to learn from?


Yes! You can upload a corpus of text that the service will learn new words (and their context) from and/or you can tell it it specific words and their pronunciation. No audio samples are needed, the customization works on the existing language models.

More details: https://www.ibm.com/watson/developercloud/doc/speech-to-text...


If you use Kaldi you can mix any type of domain-specific texts, it usually improves accuracy significantly, particularly for technical domains. You do not need audio for that.


How accurate is the speaker diarization? How does it compare to LUIM?


Speaker Labels/diarization is still a beta feature, so it's a bit limited right now, but it's pretty accurate for two-person conversations of decent length. (The beta was primarily trained for phone conversations.)

I've never tried LUIM, so I can't comment there.


We have been a pretty large user of this feature within Watson for the last 6 months... while it is pretty good, it lacks the ability to take external inputs such as stereo recordings with channel markers. I've been working on migrating our solution to voicebase whom in my opinion has a much more robust solution when compared to ibm with respect to speaker diarizarion specifically because they include a feature to do channel markers. The result is a conversational transcription being much easier to read. Prior to this we used the Lium project to attempt diarization on a single channel recording. We had mixed results. Without a doubt in the last 12 months speech to text has rapidly improved


Why migrate to another service in 2017 when open source toolkits like Kaldi provide you both better results and more features and no vendor lock-in.


Cool - well we are hiring, so if you'd like to do this reach out. We have lots of neat projects like this going on all the time.


As someone unfamiliar with the terminology, are the speakers isolated in single tracks, or is there a mix on each channel and due to the differences in relative volume, the system is able to distinguish speakers? The latter seems tremendously valuable if difficult to accomplish.


For the evaluation in the paper, speakers are on separate channels (mono, it's a telephone conversation, after all). Generally there are solutions for separating speakers on a single channel that can work fairly well (assuming your training data is similar to the target domain) if you know the number of speakers beforehand, but it's tremendously hard if you don't (think transcription of large meetings).


In fairness, sorting out speakers in a conference call is hard for a human.


I wish they clarified whether this claim that humans have a 5.1% error rate is in "listen to this sentence once and transcribe it" or "study this recording however you like and transcribe it."

edit: They talk about this in the arxiv paper:

>The transcription protocol that was agreed upon was to have three independent transcribers provide transcripts which were quality checked by a fourth senior transcriber. All four transcribers are native US English speakers and were selected based on the quality of their work on past transcription projects.

>...The transcription time was estimated at 12-14 times realtime (xRT) for the first pass for Transcribers 1-3 and an additional 1.7-2xRT for the second quality checking pass (by Transcriber 4). Both passes involved listening to the audio multiple times: around 3-4 times for the first pass and 1-2 times for the second.


For anyone wondering what the recordings in the HUB5 2000 eval data (the test data) sound like: https://catalog.ldc.upenn.edu/desc/addenda/LDC2002S09.wav


God. Transcribing that would be mind-numbing. Glad computers are getting better at this.


I spent a hack weekend recently using this API as part of a project to help with collaborating on usability testing reviews.

Upload a video, it strips out the audio, pushes to Watson for transcription, converts the result to a caption/subtitle track, and then allows people to comment on the discussion (like Google Docs).

I plan to polish it up a little and open source it soon.


Yes, please publish this!

Additionally, if you're using WebVVT format subtitles, I'd be interested in merging that code into the appropriate SDK. They're all on GitHub if you'd like to send us a PR: https://github.com/watson-developer-cloud/ (I'm on the team that is responsible for these.)


Please do!


It would be great to see one of these frameworks offered as an offline solution. Right now the only options are WSR/Sphinx4, which have significant accuracy issues, and Nuance's products, which are extremely unfriendly to developers.


Mozilla's DeepSpeech has a native client https://github.com/mozilla/DeepSpeech


Kaldi is state of the art. A bit harder to use admittedly.


It's great to see the improvements in this area. The voice recognition (if not the natural language processing) in systems like Amazon's Echo are pretty decent at this point for basic commands.

That said, I've tried the computer speech to text systems for transcribing interviews. And even with just one person talking they're nowhere good enough for me to use. Even budget human transcription (e.g. CastingWords) is just so much better that it's nowhere worth my time to use a machine-based system.


Tl;dr: they stapled a 6-layer bi LSTM to Wavenet. Good to see IBM admit that deep learning is better than "Watson".

Also, with wavenet in the mix, no way this is used in production.


Watson is a made up marketing concept for IBM's collection of fragmentary services, almost all of which run on deep learning of some sort.


Also, these are cost prohibitive to use in apps widely or without limitations. Currently, Google Speech to Text is relatively cheaper at $1 for 166 messages. Source: https://cloud.google.com/speech/pricing


Cost prohibitive? IBM seems to be ~16 hours free then 2 cents/minute. Google is 1 hour free then 2.4 cents/minute.

I can't see anything that says how IBMs minutes are done though, whether every message is rounded to 1 minute or not. Edit - based on a comment from someone at IBM in this thread, they're not rounded up to a minute, rather only at the end of the month is the total time rounded. I can't see how they'd be more expensive than google.


Interesting. I worked for a company that did retail speech recognition in the late 90s, I should look up how much that cost in comparison to see how the economics are shaking out.


We'd be interested in that cost comparison!


Found something:

* $29.95 to register

* $9.95 per month subscription which entitles you to $14.00 worth of free transcription per month

* $3.50 per page (double-spaced, 225 words) for any pages in excess of your $14 allocation

$14 is four 225 word pages, so, 1-900 words for $0.11 per word, then $1.55 per word. Ish.


Ahh interesting. I like the subscription model from the perspective of the business. Seems like it would prohibit signing up for the low-volume user though, right?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: