Hacker News new | past | comments | ask | show | jobs | submit login
KTH and Wikipedia develop first crowdsourced speech engine (kth.se)
154 points by conductor on March 15, 2016 | hide | past | favorite | 23 comments



Interesting stuff. As a side note, I've always thought that if speech synthesis got good enough we could make much more immersive gaming worlds. Playing games such as Fallout, Witcher, GTA etc you keep hearing the same 10 sentences uttered no matter where you go and it really breaks down the immersion when it becomes so obvious how wooden the NPCs are.

So if the game instead contained thousands of lines of written statements, then a speech synthesizer could choose a line at random and play it and just by sheer probability you'd be unlikely to hear the same thing twice.

With all these recent advances in neural nets, could speech synthesis be made to sound more human by giving it a huge set of training examples? You could have 10-20 different voice actors, each training a different NN by reading through a dictionary out loud, and then at the end you could synthesis any sentence with one of the NNs and it would sound like the voice actor actually said it.


By the time we have generative neural networks capable of replicating human voice with emotion and nuance (way more difficult than neutrally reading a text on Wikipedia), I think it's fair to assume we'll also have decent "thought vector" networks that – much like how neural networks can turn words or even sentences into vectors and back (translation) – can turn the meaning a character wants to convey into multiple sentences arranged in unique ways. Basically taking your example a bit further.


> replicating human voice with emotion and nuance (way more difficult than neutrally reading a text on Wikipedia)

How much of that difficulty could be skipped with a bit of basic markup, to indicate things like emotion (<sad>, <happy>), tone (<sarcastic>, <forceful>) etc? That is, how much of the difficulty is inferring the emotion/tone/etc from the text, as opposed to expressing it in the generated speech?


An algorithm understanding enough about the text to infer the correct emotional inflection to give a speech may edge into the category of strong AI, but I would have guessed that it would be easier to create a neural network that, given a text spoken in one voice, could transform it into another with the correct stress, intonation, etc. Perhaps even that's a more difficult task than I assumed, although speech generation seems to receive a lot less academic and industrial attention than speech recognition and understanding.


Yeah, it's a bit more difficult of a task than you've assumed.

Speech synthesis receives a lot of attention, but it's hard, so you rarely hear any news about it. People are throwing DNNs at it at the moment, but nothing earth shattering has come of it (yet). I have a couple of 'naturalness' filters that use DNNs and about 30% of the time, they drop all of their tones and I end up with an angry whisper as output. I don't work late too often.


For people interested in how hard it is, I recently read this [1] NYT article providing a comparison of synthetic speech that IBM experts tested for Watson in the Jeopardy competition.

[1] http://www.nytimes.com/2016/02/15/technology/creating-a-comp...


I'm not sure those two are related. A 'thought vector' seems far more advanced than emulating emotion in voice no? Though you raise a good point about text itself not carrying any weight behind it. Solution here could be to also be able to tag words as 'emotional' or 'angry' in a visual editor, which the NN uses as hints when synthesizing.


Many games approximate a reasonable thought vector already. For example, Mount & Blade: Warband is a fantastic but relatively dated game (2010) which creates a world of hundreds of characters, each of which dynamically alter their relationship with the player along a single-axis based upon complex events. When pressed for information on subjects such as who within an alliance should be allocated newly conquered lands, they will make a decision and explain themselves by virtue of their relationship with others and the player. Whilst simplistic, it's actually extremely immersive already... and that was 2010. This year a 2016 remake Mount & Blade: Bannerlord is due for release... looking forward to wasting hundreds of hours more examining improvements!


For this application you'd probably want to have the input sentences in something more than just plain English. E.g. using the phonetic alphabet.

But are you sure voice actors are the rate limiting step here? If the limiting step is instead "people who come up with the sentences", which it might actually be, speech synthesis solves nothing.


Well you can write down sentences whenever they pop into your mind, but you have to schedule a voice actor for a particular day and timeslot in a voice recording studio. Also the modding community would be free to add their own stuff easily if it were as simple as editing a text file.


Regarding the last point: I was assuming your idea was to precompute a large corpus of sentences using a (probably proprietary) NN running on some hefty hardware. I'm not sure how it would help the modding community.


I meant the NN speech synthesizer would ship with the game, and translate lines from a text file at runtime (a parallel process could keep a queue going on in the background for generated, ready-to-say phrases).


Does anyone know of any interesting startups in speech synthesis?


I've started looking at using the pretty impressive speech synthesis in Chrome on android. Hopefully this work will feed into having other platforms and browsers get to the same state of usability, as well as directly benefitting Wikipedia.

I'm really interested in the higher level - how to manage navigation, and represent things like tabular and graphical info, and how to get articles to be written with the spoken alternative in mind.

I guess the ultimate for me on Wikipedia is a more multimedia presentation format, where articles are text for people who want that, and more like hitchhiker's guide to the galaxy where that works better


I'd like to imagine a future where Morgan Freeman is reading me the recipe off of Allrecipes.


I think this wouldn't actually be terribly difficult if a company hired Morgan Freeman for a few hours of studio time, similar to how you can get Mr. T directions on TomTom. There aren't terribly many phrases or ingredients used in most recipes, so the total set of required recordings in manageably small.


In the Congress, Robin Wright sells a digital version of herself to a studio.

https://en.wikipedia.org/wiki/The_Congress_%282013_film%29


BTW Some of recent voices of Festival speech synthesis engine are awesome.

Try "Nick - 2 (English RP)" and "Peter (English RP male)" at http://www.cstr.ed.ac.uk/projects/festival/morevoices.html

They are not publicly available, though.


It's pretty amazing if they pull this off. It could mean education is revolutionised. A very time consuming part of educational videos is dialog and more importantly correcting errors. This is no small thing.


Speech out, not in.


anyone know of an open TTS speech synthesis engine that provides delay (duration) timings per word spoken? thanks


Any of them have this information available in some form - openmary, festival. For example in Festival, you can access synthesized utt markup with utt.something functions like this: (utt.save.segs (utt.synth (Utterance Text "Hello world")) "out.seg")


I love <3 my university.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: