What's up with the lack of https and encryption in IBM's cloud? I'm sure your machine learning stuff is great and all but the lack of proper encryption and security measurements makes it a total no-go. How can you expect anybody to take your cloud services seriously when you do http in 2017? I'm asking as someone who had their boss talked into trying out ML stuff one Bluemix by your sales people.
Yes, most of your links (just like many of IBM's cloud related web services). If anything you, IBM and any potential customers, should be seriously concerned that they begin with http:// in the first place.
How does pricing[1] for this work? I see $0.02 / min of Speech to text. Is 1 minute a minimum transaction size or can you send over 6 transactions which in total add up to 1 min of recognition time?
The final total is rounded up to the nearest minute at the end of each month and then the bill is based on that. So, in your 6 transaction example, you'd be billed for 1 minute as long as all 6 were within the same month.
Yes! You can upload a corpus of text that the service will learn new words (and their context) from and/or you can tell it it specific words and their pronunciation. No audio samples are needed, the customization works on the existing language models.
If you use Kaldi you can mix any type of domain-specific texts, it usually improves accuracy significantly, particularly for technical domains. You do not need audio for that.
Speaker Labels/diarization is still a beta feature, so it's a bit limited right now, but it's pretty accurate for two-person conversations of decent length. (The beta was primarily trained for phone conversations.)
We have been a pretty large user of this feature within Watson for the last 6 months... while it is pretty good, it lacks the ability to take external inputs such as stereo recordings with channel markers. I've been working on migrating our solution to voicebase whom in my opinion has a much more robust solution when compared to ibm with respect to speaker diarizarion specifically because they include a feature to do channel markers. The result is a conversational transcription being much easier to read. Prior to this we used the Lium project to attempt diarization on a single channel recording. We had mixed results. Without a doubt in the last 12 months speech to text has rapidly improved
As someone unfamiliar with the terminology, are the speakers isolated in single tracks, or is there a mix on each channel and due to the differences in relative volume, the system is able to distinguish speakers? The latter seems tremendously valuable if difficult to accomplish.
For the evaluation in the paper, speakers are on separate channels (mono, it's a telephone conversation, after all). Generally there are solutions for separating speakers on a single channel that can work fairly well (assuming your training data is similar to the target domain) if you know the number of speakers beforehand, but it's tremendously hard if you don't (think transcription of large meetings).
I wish they clarified whether this claim that humans have a 5.1% error rate is in "listen to this sentence once and transcribe it" or "study this recording however you like and transcribe it."
edit: They talk about this in the arxiv paper:
>The transcription protocol that
was agreed upon was to have three independent transcribers
provide transcripts which were quality checked by a fourth senior transcriber. All four transcribers are native US English speakers and were selected based on the quality of their work on past transcription projects.
>...The transcription time was estimated at 12-14 times realtime (xRT) for the first pass for Transcribers 1-3 and an additional 1.7-2xRT for the second quality checking pass (by Transcriber 4). Both passes involved listening to the audio multiple times: around 3-4 times for the first pass and 1-2 times for the second.
I spent a hack weekend recently using this API as part of a project to help with collaborating on usability testing reviews.
Upload a video, it strips out the audio, pushes to Watson for transcription, converts the result to a caption/subtitle track, and then allows people to comment on the discussion (like Google Docs).
I plan to polish it up a little and open source it soon.
Additionally, if you're using WebVVT format subtitles, I'd be interested in merging that code into the appropriate SDK. They're all on GitHub if you'd like to send us a PR: https://github.com/watson-developer-cloud/ (I'm on the team that is responsible for these.)
It would be great to see one of these frameworks offered as an offline solution. Right now the only options are WSR/Sphinx4, which have significant accuracy issues, and Nuance's products, which are extremely unfriendly to developers.
It's great to see the improvements in this area. The voice recognition (if not the natural language processing) in systems like Amazon's Echo are pretty decent at this point for basic commands.
That said, I've tried the computer speech to text systems for transcribing interviews. And even with just one person talking they're nowhere good enough for me to use. Even budget human transcription (e.g. CastingWords) is just so much better that it's nowhere worth my time to use a machine-based system.
Also, these are cost prohibitive to use in apps widely or without limitations. Currently, Google Speech to Text is relatively cheaper at $1 for 166 messages. Source: https://cloud.google.com/speech/pricing
Cost prohibitive? IBM seems to be ~16 hours free then 2 cents/minute. Google is 1 hour free then 2.4 cents/minute.
I can't see anything that says how IBMs minutes are done though, whether every message is rounded to 1 minute or not. Edit - based on a comment from someone at IBM in this thread, they're not rounded up to a minute, rather only at the end of the month is the total time rounded. I can't see how they'd be more expensive than google.
Interesting. I worked for a company that did retail speech recognition in the late 90s, I should look up how much that cost in comparison to see how the economics are shaking out.
Ahh interesting. I like the subscription model from the perspective of the business. Seems like it would prohibit signing up for the low-volume user though, right?
Code is at https://github.com/watson-developer-cloud/speech-javascript-...
Simple demos at http://watson-speech.mybluemix.net/
More complex demo at https://speech-to-text-demo.mybluemix.net/
I'm goin to be out some this evening but feel free to ask me questions and I'll answer them as available.