Hacker News new | past | comments | ask | show | jobs | submit login
WaveNet launches in the Google Assistant (deepmind.com)
463 points by stablemap on Oct 4, 2017 | hide | past | favorite | 117 comments



I'm most interested in its potential for audiobooks, especially of the vast array of old and less-common books that don't have human-made audiobook equivalents. I find myself constrained by the limits of audiobook choices, which tends toward best-sellers lists or pop-sci. Current attempts to use text-to-speech to generate audiobooks results in something that's frankly unlistenable. If TTS could get to a "good enough" point for audiobooks, that would open up a huge range of less-common content.


I've used "Voice Dream" for this for years. It's brilliant. iOS/Android text to speech app:

http://www.voicedream.com/reader/

It has so many voices available in the voice store, and I've had no trouble reading books with it. I'm constantly amazed by the quality of some of these voices when played at high speed... They take breaths, express emotion, everything. My only problem is they occasionally mispronounce brands, names & acronyms: Quite amusing to learn about "Sharr-E-poynt" instead of "SharePoint", "Amd" not "A.M.D", or use an incorrect pronunciation of a character in a book when talking about the book... Can't really see how WaveNet would fix that, but look forward to seeing these voices turn up in Voice Dream.


Tried the 3 demos they linked to, the new Google one is so much better, let's hope they follow suite and improve


Shameless Plug: https://auditus.cc

Uses Amazon Polly for the speech gen


Neat! I'll have to look into this.


One of my co-workers told me in 2012 that he was doing exactly this. He used an ebook reader to download free ebooks from Gutenberg and then the IVONA text to speech engine for Android to listen to them on his drives. He had already finished a few classics like Treasure Island this way.

I'm sure things have improved significantly since 2012 so what you're looking for is probably easily done.


If your co-worker could share how he did this, it would be appreciated (specifically, what apps/code needs to be run).


Not the coworker, but I've been doing this for a couple years. If you're on iOS, the simplest option is Voice Dream Reader [1].

It can read .epub files directly, along with text files and webpages, and integrates with Dropbox, Pocket, Gutenberg, etc.

On Android, you can get Voice Dream Reader (though it has less features than on iOS) or @Voice Aloud Reader which is free [2].

[1] http://www.voicedream.com/reader/

[2] http://www.hyperionics.com/atVoice/


It also works reasonably well for academic papers as PDF. Equations get butchered, of course, but overall it's not bad (you can set PDF margins so that headers/footers are not read out aloud on every page). I use it to proof-read my own writing, it helps you spot things that spelling/grammer checkers miss.


I have been doing this on Android for a few years using Ivona, FBReader with the FBReader TTS+ Plugin. I got Ivona (beta) from the Play Store before they were bought by Amazon - it looks like Ivona voices have since been pulled and are no longer available.


For Android take a look at this: https://www.youtube.com/watch?v=MAHg31ucDhY


Google's "cloud-based" TTS (not the local one, which was terrible) on the Play Books app was pretty good even years ago.

However, it was unusable because it would stop as soon as the screen would turn-off. Other ebook readers' read-aloud features worked with the screen off. So thanks for nothing Google!


It used to stop; it's been well-behaved for at least a year. I use it daily!


That's not my experience, as I can turn off the screen and still read the book. I tested it just now.


Android is different for each person


For old books, I've liked https://www.librivox.org/, which has free recordings of public domain books read by volunteers.


This was kind of the target of the 2016 Blizzard Challenge (http://festvox.org/blizzard/blizzard2016.html), as the training data was children's books (clean audio, 'reading voice' prosody, etc).

Some of the voices that came out of that were incredible.


Are any of those voices published? Would be interesting to try rendering audio books with them.


I'm not sure - the exact voices themselves generally aren't published, but the papers are detailed enough that you could recreate the system if you wanted.


You should check out https://getspeechify.com/ - I use it all the time for exactly this. It also has a great team behind it


Wow, hadn’t thought about that possibility. Too bad for professional narrators of Audible (and other) audio books. I am a happy customer of Audible, and have some favorite narrators. Sad to think that they will lose work.

I am the tech manager for a machine learning team and potentially eliminating jobs is the bad side of my field. That said, I think AI, in some sense of the term, will also help us be more efficient. I imagine everyone having effective assistants that help us in our work, help with communications and scheduling, etc.


Yeah, I was thinking about exactly this. I'm actually going to try to reach out to the team to see if they're going to be looking into this. Google Books on the rise! Hopefully!

Also music generation - the original post (September of last year) had some really interesting music created by training the system on Chopin or something.


I don't have time at the moment to search it out, but in a previous speech-synth thread there's an in depth discussion of this. In short, people were pessimistic on wavenet as audiobook because audiobooks are all about knowing the story and speaking accordingly


But it will only get better and if humans can learn it...


I didn't say I was pessimistic on it :)


Ancient Kindle Keyboard can text-to-speech most Amazon ebooks with zero hassle (I think authors can disable it). 'Unlistenable' is in the ear of the beholder! My only gripe personally is having a single "narrator" for all content.


I agree "unlistenable" is a personal judgement, but there are very rare cases when buying an audiobook, even with a terrible reader, is not a better value/cost ratio than the original Kindle TTS, it was convenient, but not great.


Love this for company documents as well on my commute.


I am wondering what their baseline is. They call it "Current Best Non-WaveNet". Quite frankly, Apple's most recent deep learning-based speech synthesis sounds superior, but there aren't enough samples to for a proper comparison: https://machinelearning.apple.com/2017/08/06/siri-voices.htm...


It could just be a matter of opinion, but I prefer both Google's unit selection synthesis, and their WaveNet synthesis. The prosody in Apple's latest method is still annoying, nowhere near as good as the Google models of 2015 and 2016, and not remotely comparable to the WaveNet models.

Apple's change in voice talent is an improvement though, and they may have more units than before, which is helpful. I believe their model also works offline, which is a huge plus (though I think Google's prior model works offline as well).


> prosody

I learned a useful new word, thank you!


I think the voice for the samples in your link still has the problems they talk about in that article.

There are noticeable blips in the speech that sound unnatural, particularly when certain sound combinations are used.

The very first sample with "Bruce Frederick" is clearly off. The intonation and timing between the end of bruce and the beginning of frederick is... mechanical.

There's a similar problem in the OPs link with the non-wavenet English voice 1 when it says "Wavenet".

Those issues are much less apparent in the wavenet voices. Timing problems are less noticeable, intonation problems are less noticeable.

Frankly, the voices there sound VERY good, compared to anything I've heard.

That said, I completely agree that there's not enough samples there to make any real judgement.


I think I read "commercial" somewhere in there. So it'd be "the best you can buy", though not necesarily "the best other competing companies use" (ie: Apple).

Still, they picked one that makes theirs look vastly superior.


Wow, I hear a huge improvement in the Japanese model: the difference between a robot in person and a young woman on the phone.


It's a huge improvement, but still nowhere near the state of the art for Japanese TTS which has always been ahead of English (since the phonetics are much simpler).

Listen to the samples here for example: http://voicetext.jp/samplevoice/


Mm, I see what you mean. It seems like voicetext.jp has more "correct" prosody, and seems to reproduce vocal fry correctly, I'm noticing that's missing from the WaveNet sample now. There's still some work to be done around filtering with voicetext, I hear pretty glaring artifacts between the units, whereas WaveNet doesn't produce any such artifacts.


The Japanese wasn't so good.

I like the sound of Risa in http://voicetext.jp better. Try it out the Japanese Google used in their example. If someone knows of a better one it would be great to find out.

一つのウエーブネットのみで、多数の事なら、話者の音声波形極めて正確モデルができます。


this was what I noticed too, I don't really have an ear for Japanese, but it sounded as natural as when I have heard it spoken by others. Huge huge improvement.


This is interesting - I don't speak much Japanese but that's the one where the difference in quality was most apparent IMHO.


I've always wondered why companies don't just take all the close-captioned TV streams and use that as training data for their voice models. Seems like it would create a much more natural sounding voice model (at least as far as humans are accustomed to).


I think a better data source would be professionally produced audiobooks. Better, clearer voices, nearly perfect "transcription", enormous supply already recorded on a wide range of topics potentially available from one or two sources, etc.


Even if there is no background noise present, the quality is nowhere near that of professional studio recordings and would be very noticeable in the output.

Also, for traditional systems you need a lot of data from one speaker only, they can't take advantage of other speakers' recordings (although WaveNet does that now).

And "TV natural" might not be the style of natural you want from a TTS system.


What happens if the CC isn't entirely in sync with the video or audio?


Strictly speaking, the text is never 'entirely in sync' because spoken words inherently blur together and are seamless; individual letters in the text do not start and end at precise intervals. This is one of the things that makes speech recognition so hard: letters, syllables, and words do not really exist as discrete things on the raw audio level. So this problem exists for any speech transcription dataset. To provide a loss function, then, you would use something like CTC: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=FD2... Fortunately, NNs are good at handling noisy data, and in practice they work very well for speech recognition/transcription.


Recordings are force-aligned to the transcriptions anyway (using essentially a speech recognition system) to obtain phone-level alignments. You don't need explicit timing information beforehand.


I still don't understand the difference between Google Assistant and Google Now (or whatever), other than when I accidentally launch Assistant instead of Now, my commands are never understood.


I don't understand the difference either other than Assistant seems to be replacing Now. The issue, I'm finding, is that Google Assistant requires that you turn on the recording of usage data and history to your account, such as Web & App activity, Location History, Voice & Audio activity. All recorded and tied to your account. I don't believe this was the case with Google Now.

Google, to their credit, normally allows you to erase all this data and turn recording off, but with Google Assistant, they require it all to be on and recording. I avoid using it because of this restriction.


IIRC one could use Google Now with voice recordings off and the extra-spying setting of "web and app activity" off (though web and app activity had to be on for many features). All this must now be on for the assistant, as you say.

I did try out Google Now for a time due to having an Android Wear watch and found that as time went on more functionality required the turning on of these settings. One particular oddity for a time was that "OK Google, Navigate Home" would produce only a complaint about web and app activity being off, but "OK Google, navigate to $HOME_STREET, $HOME_TOWN" was fine.

I've abandoned Android Wear because of this.


Consider the information that Google Now was providing you before. It was almost certainly using this data.


Ui


To be honest, it still obviously sounds machine generated. I guess it's a slight improvement, but the examples shown do not include any challenging words or phrases. We've been able to generate adequate sounding speech for simple phrases for quite some time. I bet this was really fun to work on however.


If you ever listen to people try to record sound that's clear and precise, they actually sound fairly robotic. See this Google 20% Project where they explore Google Assistant's voice creation: https://youtu.be/qnGNfz7JiZ8?t=5m23s

WaveNet is probably modeling the source data very well. It sounds like they just need more data with emotion and inflection, rather than having source data that is optimized for monotonicity and precision.


Great video!

If you work in radio or voiceover you learn really quickly that the voice is so much more complex than people give it credit for. Subtle changes in delivery, timing, inflection, syllabic emphasis, pauses etc... make a massive difference.

Anyone can "talk like a robot" but speaking naturally is way more dynamical than just making the sounds of the words transition smoothly.

I'm not sure if it's way easier or way harder than we're doing it now.


The improved voices sound exactly like the audio from CBT's I've taken in the past. Could possibly have fooled me.


What is CBT?


CBT is commonly understood to mean Cognitive Behavioral Therapy, but in this case I believe OP means computer based training(s?)


Computer-Based Training


Perhaps I'm just less demanding but I was pretty blown away. Perhaps it sounds robotic but it's coming from the same voice recording. The main difference was how perfectly the words blend together. Now I'm just waiting to get Stephen Fry samples instead.


The Japanese one is miles ahead of the previous one however.


I guess the hope is that even if it doesn't sound much better, this will open new doors, such as being able to iterate faster on new languages, new voices and new speaking styles (singing, whispering, etc).

EDIT: A big example of that is the Japanese example at the bottom. English has had far more effort put into the old model, so it's already pretty good. But the difference between the old and new Japanese voice is really striking, and they were most likely able to make the new one much faster.


I thought it's just me. I remember the first WaveNet demo sounded significantly more natural than the non-WaveNet stuff. But now they sounded almost the same. The main difference is that with WaveNet you don't hear that robotic tone at the end of a phrase, that's typical of old TTS technologies. But the way the phrase is spoken still sounds like a machine said it, rather than a human.

I wonder what compromises they made to improve the performance by 1,000x. There have to be some.


This would be great for games with lots of speaking characters and other npcs.

You could vary so much


I hope somebody uses this to immortalize Sir David Attenborough or the sadly departed Don LaFontaine.


Oh man, I can already see the court cases of a certain robot's voice sounding a little too similar to a deceased human's that it should make royalty payments.


Heh, in my original post I wanted to follow up that paragraph with a question: Is someone's voice their IP?

Would e.g. Don LaFontaine's family go all J.R.R. Tolkien and forbid the use of his voice in certain contexts?

Then again, for now, it seems that it's easy to get away with synthesizing voices of popular characters as long as you don't use copyrighted names:

https://acapela-box.com/AcaBox/index.php

(choose English(USA) - "Little Creature". Borked on Firefox but works in Chrome)

"Little Creature" my ass.

Apparently, names are IP, voices not so much. But we can't have nice things because like you said - eventually, someone's going to figure out how to file an effective lawsuit.


I don't see it as being very different than using someone's face to promote something.

For example, should a lifelike digital model (or even a still image) of an actor's face still generate royalties for the actor's family after their death? Both cases are using a unique attribute of the person to promote something.


I don't think voices are copyrightable, but the recordings used to analyze them are, and so the resulting voice might be considered a derivative work.


I can't tell if that's Yoda or Cookie Monster.


https://lyrebird.ai/ and Adobo VoCo are two products which can take a small sample of a persons voice then generate any words/sentences in their voice


So, when can we expect an open source release of the tech and model?


This is Google, so they'll release a paper and then someone will clone it as an Apache project. Or maybe Linux Foundation these days.


I sure hope so, although I am extremely angered that they don’t release the model.

Google is only having any success with this because they are getting training data from the public for free, they should also return the model to the public for free (at least for noncommercial use)


And their Web ranking model and their spam filtering model? Face it, Google's gonna be Google.

(And I don't think WaveNet is even based on public training data.)


Well, we have to have some ideals, so we can work towards them.

Public Money, Public Code should also extend to Public Data, Public Code.


Never, and it fucking sucks.


Does anyone have experience/success training WaveNet on their own?


Do you know any higher quality speech synthesis software?


Nope.


These things are highly nuanced.

The amount of tweaking and finessing that goes on under the hood for specific scenarios can change outcomes quite a lot.

And from a product perspective, it can be taken quite far - for example, most of the 'common' things Siri says are not synthesized, it's literally are recording of the voice over artists. The more arcane stuff is synthesized.

It's always comparing apples to oranges to bananas unless you really know what they're doing, even then it's hard.


I would love something like this for computer notifications, or any sort of automated system notification. For a concrete example, my RC controller has the ability to do voice prompts (e.g. "landing gear down"), and it would be great if I could get a high-quality voice to speak all this.

Hell, I'd settle for an API where I could send text and get high-quality voice back. Maybe I can somehow hack the Google Assistant app to do it...


I think AWS Polly (https://aws.amazon.com/blogs/aws/polly-text-to-speech-in-47-...) can do this. I'm not sure of the quality compared to Google's or Apple's TTS though.


Apparently it's semi-decent, thank you!:

https://aws.amazon.com/polly/


I was hoping that the voice in the video was created by Polly. That would've been amazing but the samples they provide sound quite robotic.


Yeah, they do, unfortunately :/ Hopefully there will be some improvement they make or some other API that will be able to provide this.


If the texts are fixed, why not just pay someone to record them? You can get 100 words for $5.


Mainly because that price is over the "can be bothered" threshold. I'm not going to find someone on fiverr and contract them to record a few words so I can have aural notifications for finished downloads, for example, but I might write a simple script to take the text and store an MP3 with the audio if it tacks a cent on my AWS bill.


While the 100x speedup sounds impressive, a raw speed number without details regarding hardware is pretty meaningless. I'm guessing they got the 20x realtime speed from running it on their new TPU hardware, which they say can do 180 Tflops. That means you would need 9 Tflops of computing power to run this in realtime -- still pretty far away from running on a phone, or PC for that matter.


They literally said it's launching in the Google Assistant (i.e. running on phones) now.

In fact that's the title of the article.


He also missed the speedup by a factor of 10, it's 1000x faster now :) I doubt it's just more hardware, they must have optimized the networks seriously.


Of course they did, I was merely challenging the implied message that it is now 'fast enough', that the speed problem is solved.


Does TTS run on the phone or the cloud?


It runs on the phone afaik, while speech recognition on the other hand is supported by stuff Google runs in the cloud.

Edit: I might be wrong, at the end of a paragraph they say it runs on Google’s TPU cloud infrastructure, though it isn't clear to me whether they just use that for training.

Edit 2: I just tried it on my phone. At least stuff like asking it to "Turn on WiFi" works without an internet connection, and yields a TTS response.


> I just tried it on my phone. At least stuff like asking it to "Turn on WiFi" works without an internet connection, and yields a TTS response.

But this is the status quo. You would not expect Google to disable offline TTS just for slightly improved quality. The real question is, is it running Wavenet offline or the previous version of its TTS engine offline?


Today's announcement is cloud-only. We also support an older algorithm for offline use that's less computationally intensive.


Is the offline one improving also? Google Maps often falls back to it (much more often than necessary for some reason) and it sounds completely different and far worse.


It runs in the cloud.


Google TTS supports both. Maps prefers the cloud one so that you're not at the mercy of whatever TTS your manufacturer has installed (Samsung uses their own, right?), but will fall back to the local one it there's no connectivity. I can't remember if the Google TTS implements that or if it's custom to Maps.


I'm interested in using similar generative adversarial networks to reduce artifacting in video streams. For example, highly compressed streams tend to show blocking artifacts on dark scenes, gradients, and static that could be smoothed in the decoder.

I haven't actually done much about it yet, but I'm interested.


Adding matching of the laplacian to the optimization could be really helpful for this. I found this paper the other day and really enjoyed it https://arxiv.org/abs/1707.01253


Is there an api for this like amazon Polly?


I was wondering about that too. Google already has speech recognition APIs. I don't understand why they don't provide TTS.


Awesome. I guess. Hard to know when everything AI/Assistant related is currently focused on the US market.


But I'm imagining, but don't the wavenet versions sound very formal and a little depressed? As if it were a reflection of our society. Listen to them again. Very formal, lifeless and depressed.


Wow. I can't believe they got such a significant speed increase.


Hopefully there will be enough details on how they accomplish this when they release their paper.


It has been known for some time that WaveNet can be greatly sped up, and exponentially so. See https://arxiv.org/abs/1611.09482 (Fast Wavenet) for an example.


Guys, for those of you who would like to see how Microsoft Cognitive service’s TTS fares when compared to Google’s TTS.

We had launched a Bot which gives voice summary of web content on Messenger, Slack, Telegram & Twitter with In-line audio player on first three. It’s great for sharing audio summary to our visually impaired friends.

Check it out here - https://larynx.io/#larynxBot


Can I get this on Mac OS as system voice?


Apple’s iOS 11 voice is pretty amazing too. I wonder what they are using.


Beuatiful, but for certain voices.. seems very close to how computerized it sounds.... except Japanese....

I'd like to be able to record my voice - and have it translated to text, then compare how it sounds via both engines vs my voice/cadence.


Any ideas when Assistant for iOS will released outside US?


wow. I hope they will publish details on the optimisations


is there open-source implementation on github?


There are two reachable from here: https://arxiv.org/abs/1611.09482


Oh how much I wish Google paid to Morgan Freeman or David Attenborough to have their voice as an option.


Wonder -- in all seriousness -- if the BBC will do Attenborough. If you're British and you've grown up with his documentaries, all other nature voices simply sound wrong.


I think the logical next step for this is voice recognition.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: