WaveNet: A Generative Model for Raw Audio

augustl · on Sept 8, 2016

The music examples are utterly fascinating. It sounds insanely natural.

The only thing I can hear that sounds unnatural, is the way that the reverberation in the room (the "echo") immediately gets lower when the raw piano sound itself gets lower. In a real room, if you produce a loud sound and immediately after a soft sound, the reverberation of the loud sound remains. But since this network only models "the sound right now", the volume of the reverberation follows the volume of the piano sound.

To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings.

Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing!

There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example.

TheOtherHobbes · on Sept 8, 2016

I can hear some distortion in the piano notes - which may be an audio compression artefact, or it may be the output of the resynthesis process.

If you train NNs at the phrase level and overfit, then you get something that is indeed more or less the same as cross-fading at random between short sections.

Piano music is very idiomatic, so you'll capture some typical piano gestures that way.

But I'd be surprised if the music stays listenable for long. Classical music has big structures, and there's a difference between recognising letters (notes), recognising phrases (short sentences), recognising paragraphs (phrase structures), and parsing an entire piece (a novel or short story with characters and multiple plot lines.)

Corpus methods don't work very well for non-trivial music, because there's surprisingly little consistency at the more complex levels.

NN synthesis could be an interesting thing though. If you trained an NN on $sounds$ at various pitches and velocity levels, you might be able to squeeze a large and complex collection of samples into a compressed data set.

Even if the output isn't very realistic, you'd still get something unusual and interesting.

Scaevolus · on Sept 9, 2016

The samples are uncompressed WAV files, so everything you hear is a direct result of the synthesis process. Some of the distortion is a result of the 16kHz sample rate-- it's not 44.1kHz CD quality.

IanCal · on Sept 9, 2016

It's quantized to just 256 values though, which could be causing some of the distortion.

DarkTree · on Sept 8, 2016

It shot me forward to a time where people just click a button to generate music they want to listen to. If you really like the generation, you save it and share it. It wouldn't have all of the other aspects that we derive from human-produced music like soul/emotion (because we know it's coming from a human, not because of how it sounds), but it would be a cool application idea anyway.

JasonStorey · on Sept 9, 2016

Have you tried https://www.jukedeck.com ? AI composed music at the touch of a button.

chriswarbo · on Sept 8, 2016

Something like https://www.youtube.com/watch?v=Wx3by7ZaaZA ? ;)

mrkgnao · on Sept 9, 2016

This reminds me of the Library of Babel short story.

ThePhysicist · on Sept 8, 2016

I agree, the samples sound very natural. I ask myself though how similar they are to the data that has been used for training, as it would be trivial to rearrange individual pieces of a large training set in ways that sound good (especially if a human selects the good samples for presentation afterwards).

What I'd really like to see therefore is a systematic comparison of the generated music to the training set, ideally using a measure of similarity.

benanne · on Sept 9, 2016

A nice property of the model is that it is easy to compute exact log-likelihoods for both training data and unseen data, so one can actually measure the degree of overfitting (which is not true for many other types of generative models). Another nice property of the model is that it seems to be extremely resilient to overfitting, based on these measurements.

augustl · on Sept 8, 2016

Good point! Are (some of) the chords completely made up, for example, or is it only using chords it has heard before?

grav · on Sept 8, 2016

Filtering out certain notes from a piano chord can be done by e.g. Melodyne, but that seems far from what's necessary to generate speech, so it would surprise me, if WaveNet can do that?

JoeDaDude · on Sept 8, 2016

Decades ago, I was testing a LPC-10 vocoder. I discovered many new and strange sounds by playing with the input mike, such as blowing into it, or rubbing it. Like the LPC-10, I wonder about untapped musical possibilities that this allows.

ArkyBeagle · on Sept 8, 2016

That seems completely tractable by simply adding a bit of the right reverb to the generated sample, more or less "in post".

augustl · on Sept 9, 2016

Good point! Just train it with recordings that has no reverberation, and add it later.

ArkyBeagle · on Sept 9, 2016

It's quite difficult to have no reverberation, but not too bad at all to keep to a minimum. But reverb plus reverb equals reverb, so it's just a matter of finding one that sounds good.

It'd also be interesting to know if this technique could solve the "de-reverberation" problem.

erichocean · on Sept 8, 2016

This can be used to implement seamless voice performance transfer from one speaker to another:

1. Train a WaveNet with the source speaker.

2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.

3. Record raw audio from the source speaker.

Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.

To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).

4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)

5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.

Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.

itcrowd · on Sept 8, 2016

Another fun fact: this actually happens with (cell) phone calls.

You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.

Edit: not by using a neural net or deep learning, of course.

JonnieCache · on Sept 8, 2016

In case anyone is wondering, the technique is called linear predictive coding.

AstralStorm · on Sept 9, 2016

Predictive coding. Linear is a specific variant of it used in older codecs.

VikingCoder · on Sept 9, 2016

What's the difference in bandwidth?

svantana · on Sept 9, 2016

> Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse"

Now wait a minute, most algorithms cannot be run in reverse! The only general way to reverse an algo is to try all possible inputs, which has exponential complexity. That's the basis of RSA encryption. Maybe you're thinking about automatic differentiation, a general algo to get the gradient of the output w.r.t. the inputs. That allows you to search for a matching input using gradient descent, but that won't give you an exact match for most interesting cases (due to local minima).

I'm not trying to nitpick -- in fact I believe that IF algos were reversible then human-level AI would have been solved a long time ago. Just write a generative function that is capable of outputting all possible outputs, reverse, and inference is solved.

shoo · on Sept 9, 2016

This also makes me think of "inverse problems", in the context of mathematics, physics.

E.g. a forward problem might be to solve some PDE to simulate the state of a system from some known initial conditions.

The inverse problem could be to try to reverse engineer what the initial conditions were given the observed state of the system.

Inverse problems are typically much harder to deal with, and much harder to solve. E.g. perhaps they don't have a unique solution, or the solution is a highly discontinuous function of the inputs, which amplifies any measurement errors. In practice this can be addressed by regularisation aka introducing strong structural assumptions about what the expected solution should be like. This can be quite reasonable from a Bayesian perspective.

https://en.wikipedia.org/wiki/Inverse_problem#Mathematical_c...

romaniv · on Sept 8, 2016

Maybe I'm reading this paper incorrectly, but it seems that in this system "voice" is part of the model parameters not inputs. What they did was train the same model with multiple reader voices while using one of the inputs to keep track of which voice the model was currently trained on. So the model can switch between different voices, but only between those which it was trained on.

"The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers."

Am I missing something?

erichocean · on Sept 8, 2016

These are the "inputs" I'm talking about recovering (from the link):

"In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet."

The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.

How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.

swsieber · on Sept 8, 2016

Oh, pair this with facial mapping[1] and you pretty much have an "impersonate any famous person" system.

[1] http://www.graphics.stanford.edu/~niessner/thies2016face.htm...

erichocean · on Sept 8, 2016

Yup, I work in virtual filmmaking and there are tons of way to use this stuff.

I give us 10-15 years before it's not possible to trust anything you see or hear that's recorded.

stavros · on Sept 8, 2016

Really? I haven't trusted anything recorded in years.

copperx · on Sept 10, 2016

Speech production is incredibly hard to fake at the moment.

sangnoir · on Sept 11, 2016

> Speech production is incredibly hard to fake at the moment.

Sound-alikes have been used in the music industry since forever.

mirimir · on Sept 9, 2016

Or transmitted from one place to another :(

zardo · on Sept 8, 2016

Basically the same idea as style transfer with image algorithms. Looking forward to Abraham Lincoln reading audiobooks to me.

infinite8s · on Sept 8, 2016

That would require audio recordings of Abraham Lincoln's voice. Not sure recording technology existed back then.

zardo · on Sept 8, 2016

Audio quality does leave something to be desired. https://vimeo.com/47987691

barrkel · on Sept 8, 2016

Lincoln died before Edison invented the phonograph. That's a hoax.

Houshalter · on Sept 9, 2016

Lincoln died in 1865, but the oldest recordings are from the 1860s. The video is definitely a hoax (http://www.firstsounds.org/research/others/lincoln.php), but it's at least theoretically possible his voice could have been recorded. In fact I believe there are some even older recordings from the 1850s, but I don't think those have been successfully recovered yet.

These early recordings are incredibly crude, and they did not have the technology at the time to play them back. They were just experiments in trying to view sound waves, not attempts to preserve information for future generations.

infinite8s · on Sept 8, 2016

Ah I stand corrected, thanks.

dhammack · on Sept 8, 2016

It seems like you're using WaveNet to do speech-to-text when we have better tools for that. To transfer text from Trump to Clinton, first run speech-to-text on Trump speech and then give that to a WaveNet trained on Clinton to generate speech that sounds like her but says the same thing as Trump.

erichocean · on Sept 8, 2016

> It seems like you're using WaveNet to do speech-to-text

I'm proposing reducing a vocal performance into the corresponding WaveNet input. At no point in that process is the actual "text" recovered, and doing so would defeat the whole purpose, since I don't care about the text, I care about the performance of speaking the text (whatever it was).

In your example, I can't force Trump to say something in particular. But I can force myself, so I could record myself saying something I wanted Clinton to say [Step 3] (and in a particular way, too!), and if I had a trained WaveNet for myself and Clinton, I could make it seem like Clinton actually said it.

dhammack · on Sept 8, 2016

I see. I still think it's easier to apply deepmind's feature transform on text rather than to try to invert a neural network. Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?

Text -> features -> TrumpWaveNet -> Trump saying your text

erichocean · on Sept 8, 2016

> Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?

Yes, that should work, and by tweaking the WaveNet input appropriately, you could also get him to say it in a particular way.

creshal · on Sept 9, 2016

Sounds like a very fancy way to do compression with a massive custom dictionary.

posterboy · on Sept 9, 2016

Thanks for the tl;dr. However, the fun fact is not true for surjective functions, IIRC, in which case multiple inputs may relate to one output, if this is relevant for WaveNets.

mdup · on Sept 9, 2016

Nitpicking: surjective functions do not relate to unicity of ouptuts; you'd rather talk about non-injective functions. I agree with your point, though.

(surjective != non-injective, in the same way that non-increasing != decreasing)

rdtsc · on Sept 8, 2016

Wonder if there are any implications here for breaking (MitM) ZRTP protocol.

https://en.wikipedia.org/wiki/ZRTP

At some point to authenticate both parties verify a short message by reading it to each other.

However, NSA has already tried to MitM that about 10 years ago by using voice synthesis. It was deemed inadequate at the time. Wonder if TTS improvements like these, change that game and make it more plausable scenario.

luckystarr · on Sept 9, 2016

This will make private in person key exchange way more important. Especially as the attack vector is so cheap (software).

dharma1 · on Sept 8, 2016

The samples sound amazing. These causal convolutions look like a great idea, will have to re-read a few times. All the previous generative audio from raw audio samples I've heard (using LSTM) has been super noisy. These are crystal clear.

Dilated convolutions are already implemented in TF, look forward to someone implementing this paper and publishing the code.

kastnerkyle · on Sept 8, 2016

I did a review for PixelCNN as a part of my summer internship, it covers a bit about how careful masking can be used to create a chain of conditional probabilities [0], which AFAIK is exactly how this "causal convolution" works (can't have dependencies in the 'future'). The PixelCNN and PixelRNN papers also cover this in a fair bit of detail. Ishaan Gulrajani's code is also a great implementation reference for PixelCNN / masking [1].

[0] https://github.com/tensorflow/magenta/blob/master/magenta/re...

[1] https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...

dharma1 · on Sept 8, 2016

Heh, just read it! Very useful, will have to go through in detail

novalis78 · on Sept 8, 2016

What's really intriguing is the part in their article where they explain the "babbling" of wavenet, when they train the network without the text input.

That sounds just like a small kid imitating a foreign (or their own) language. My kids grow up bilingual and I hear them attempt something similar when they are really small. I guess it's like listening in to their neural network modelling the sound of the new language.

sjwright · on Sept 8, 2016

To my Australian English ears, the babbling sounded vaguely Scandinavian.

novalis78 · on Sept 9, 2016

Indeed. I was surprised by that as well. Sounded like a Dutch speaker with a muffled voice behind a screen.

space_fountain · on Sept 9, 2016

Might just be that English is fairly close to German and the like but as English speakers it doesn't sound like English to us because we know English so it gets mapped as a similar but different language.

dharma1 · on Sept 9, 2016

Confirms my thought that Dutch sounds like unintelligible babbling :)

vintermann · on Sept 9, 2016

Especially funny as the main authors are Dutch.

rattray · on Sept 9, 2016

Ah. Perhaps it was trained on Dutch speakers, not English.

sjwright · on Sept 11, 2016

That would explain it. Would be interesting to hear babbling trained with other languages and accents.

ralfd · on Sept 11, 2016

To my German ears it sounded definitely English, not Dutch, like a very hard to understand dialect.

noonespecial · on Sept 8, 2016

So when I get the AI from one place, train it with the voices of hundreds of people from dozens of other sources, and then have it read a book from Project Gutenberg to an mp3... who owns the mechanical rights to that recording?

visarga · on Sept 9, 2016

> who owns the mechanical rights to that recording?

The monkey who shot the picture. https://en.wikipedia.org/wiki/Monkey_selfie

novalis78 · on Sept 8, 2016

good point ... I am pretty sure there are a thousand audible products waiting to be launched.

kuschku · on Sept 8, 2016

Every single person who had rights on the sources for audio you used.

For the same reason, Google training neural networks with userdata is very legally doubtful – they changed the ToS, but also used data collected before the ToS change for that.

fergal_reid · on Sept 8, 2016

>Every single person who had rights on the sources for audio you used.

What if my 'AI' was a human who learned to speak by being trained with the voices of hundreds of people from dozens of other sources? What's the difference?

Those waters seem muddy. I think that'd be an interesting copyright case, don't think it's self evident.

kuschku · on Sept 9, 2016

So if I remix just 200 songs together, the result is not copyright protected anymore?

noonespecial · on Sept 9, 2016

No its not like remixing. Its more like listening to 200 songs and then writing one that sounds just like them.

More like turning the songs into series of numbers (say 44100 of these numbers per second) and then using an AI to predict which number comes next to make a song that sounds something like the 200. The result is not possible without ingesting the 200 songs but the 200 songs are not "contained" in the net and then sampled to produce the result like stitching together a recoding from other recordings by copying little bits.

The hairs split too fine at the bottom for our current legal system to really handle. That's why its interesting.

kuschku · on Sept 9, 2016

In the US legal system, that’d still be a derived work.

This might be an interesting read for you: http://ansuz.sooke.bc.ca/entry/23

skoocda · on Sept 9, 2016

LibriVox

jay-anderson · on Sept 9, 2016

Any suggestions on where to start learning how to implement this? I understand some of the high level concepts (and took an intro AI class years ago - probably not terribly useful), but some of them are very much over my head (e.g. 2.2 Softmax Distributions and 2.3 Gated Activation Units) and some parts of the paper feel somewhat hand-wavy (2.6 Context Stacks). Any pointers would be useful as I attempt to understand it. (EDIT: section numbers refer to their paper)

visarga · on Sept 9, 2016

Best advice is to wait for a version to pop up on github. It's hard to implement such a paper as a beginner.

datenwolf · on Sept 11, 2016

Well, I think since we now have frameworks for doing this kind of stuff (Tensorflow and similar) the barrier of entry is much, much lower. Also the computing power required to build the models can be found in commodity GPUs.

On a hunch I'd say an absolute beginner may be able good results with these tools, just not as quickly as experts on the field who already know how to use the tools properly. That's why I'm going to wait for something to pop up on GitHub, because I have zero practical experience with these things, but I can read these papers comfortably without the need to look up every other term.

There are a number of applications I'd like to throw at deep learning to see how it performs. Most notably I'd like to see how well a deep learning system can extract feature from speckle images. At the moment you have to average out the speckles from ultrasound or OCT images before you can feed it to a feature recognition system. Unfortunately this kind of averaging eliminates certain information you might want to process further down the line.

jm547ster · on Sept 11, 2016

Agreed there's a lot of breath here, I'm coming from the opposite end with some experience in "manual" concatenative speech synthesis and very little in the ML area, you'd need to be cross disciplined from the get go

jay-anderson · on Sept 15, 2016

https://github.com/ibab/tensorflow-wavenet - looks like they're starting to show up.

chestervonwinch · on Sept 8, 2016

Is it possible to use the "deep dream" methods with a network trained for audio such as this? I wonder what that would sound like, e.g., beginning with a speech signal and enhancing with a network trained for music or vice versa.

dontreact · on Sept 8, 2016

We tried this but with less success than what wavenet did. https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2...

dontreact · on Sept 9, 2016

There is a link to examples at the end

chestervonwinch · on Sept 9, 2016

Interesting! So if I understand correctly, much of the noise in the generated audio is due to the noise in the learned filters?

I assume some regularization is added to the weights during training, say L1 or L2? If this is the case, this essentially equivalent to assuming the weight values are distributed i.i.d. Laplacian or Gaussian. It seems you could learn less noisy filters by using a prior that assumes dependency between values within each filter, thereby enforcing smoothness or piecewise smoothness of each filter during training.

dontreact · on Sept 9, 2016

Yes. Working on some different regularization techniques.

Applejinx · on Sept 9, 2016

The piano stuff already seemed like 'dream music', as did the 'babble' examples. I found myself terribly frustrated by how short all those examples were. I wanted lots more :)

fastoptimizer · on Sept 8, 2016

Do they say how much time is the generation taking?

Is this insanely slow to train but extremely fast to do generation?

georgehm · on Sept 8, 2016

"After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio."

So it looks like generation is a slow process.

kastnerkyle · on Sept 8, 2016

Relatively, training is fast (due to parallelism / masking so you don't have to sample during training) but during generation sampling is a sequential process. They talk about it a bit in the previous papers for PixelCNN and PixelRNN.

microtherion · on Sept 9, 2016

According to 3rd hand reports I've heard (apply copious amounts of salt), it may take 1 hour of CPU time to generate 1 second of speech.

lucb1e · on Sept 8, 2016

I was wondering the same. They don't mention anything about how long it took on what kind of system. Even for a first beta it would give us some ballpark idea of how slow it is -- because it's clearly slow, they just keep back how slow exactly, so it's probably bad.

ronreiter · on Sept 8, 2016

Please please please someone please share an IPython notebook with something working already :)

ThePhysicist · on Sept 8, 2016

I have some iPython notebooks for speech analysis using a Chinese corpus. I used those for a tutorial on machine learning with Python and unfortunately they are still a bit incomplete, but maybe you find them useful nevertheless (no deep learning involved though). What I do in the tutorial is to start from a WAV file and then go through all the steps required for analyzing the data (using a "traditional" approach), i.e. generate the Mel-Cepstrum coefficients of the segmented audio data and then train a model to distinguish individual words. Word segmentation is another topic that I touch a bit, and where we can also use machine learning to improve the results.

Here's a version with very simple speech training data (basically just different syllables with different tones):

https://github.com/adewes/machine-learning-chinese/blob/mast...

More complex speech training data (from a real-world Chinese speech corpus [not included but downloadable]):

https://github.com/adewes/machine-learning-chinese/blob/mast...

There are other parts of the tutorial that deal with Chinese text and character recognition as well, if you're interested:

https://github.com/adewes/machine-learning-chinese

For part 2 I also train a simple neural network with lasagne (a Python library for deep learning), and I plan to add more deep learning content and do a clean write-up of the whole thing as soon as I have some more time.

ronreiter · on Sept 8, 2016

Thanks! will take a look.

visarga · on Sept 9, 2016

It takes 90 minutes to synthesize 1 second. Sorry, no laptop version yet.

https://twitter.com/hardmaru/status/773968758519902208

novalis78 · on Sept 8, 2016

I second that!

grandalf · on Sept 8, 2016

This is incredible. I'd be worried if I were a professional audiobook reader :)

AndrewUnmuted · on Sept 9, 2016

I worked for Audible for five years, and this exact conversation was had often in my division (ACX.com - Audible's "Audiobook Creation Exchange".)

Audible brought ACX together in order to bolster its catalog. The company-wide initiative was called PTTM ('pedal to the metal') and ACX was Audible's secret weapon to gain an enormous competitive foothold over the rest of the audiobook industry. Because we paid amateurs dirt-cheap rates to record horrible, self-published crap (to which Amazon, Audible's parent company had the exclusive rights), Audible was able to bolster its numbers substantially in a short period of time.

The dirty not-so-secret behind this strategy was: nobody bought these particular audiobooks. These audio titles were not really made to be "purchased," but rather to bulk up Audible's bottom line. We knew that the ACX titles were not popular, because the amateur narrators' acting talents and audio production skills were remarkably subpar.

Neural nets may be able to narrow the gap between the pros and the lowest-common-denominator to the point where they can become the next "ACX," but frankly, it won't matter to audiobook listeners, because audiobook listeners don't buy "ACX" audiobooks. Books, even in audio form, are a major intellectual and temporal commitment (not to mention -- they tend to be pricey.) Customers will always want to buy the human-narrated version of a book - the professional production of a book. If that stops being offered, Audible will anger a lot of customers and I think Bezos has better shit to worry about than his puny audiobooks subsidiary.

Despite that, user-generated content is a secret weapon that a lot of websites wield effectively - including HN - but this is beginning to shed its effectiveness. Indeed, the next generation of cost-slashing-while-polluting-the-quality-of-your-catalog will belong to the neural nets. They may be able to get better sales than ACX titles do today with AI-generated audio content, but the actors are going nowhere.

Falcon9 · on Sept 16, 2016

I've listened to some LibriVox recordings of public domain works, notably A Princess of Mars. The price was right at the time, though the quality was, as you say, remarkably subpar. If I could have had a neural net read me the book instead of having to change with narrators changing every chapter, that would have been preferable.

That said, I have money now, so give me Todd McLaren narrating Altered Carbon for the cost of an Audible Credit every time.

espadrine · on Sept 8, 2016

I wouldn't. The results they offer are excellent, but the missing points they need to achieve human level are related to producing the correct intonation, which requires accurate understanding of the material. That is still at least ten years in the future, I expect.

syllogism · on Sept 9, 2016

Not really. They're training directly on the waveform, so the model can learn intonation. They just need to train on longer samples, and perhaps augment their linguistic representation with some extra discourse analysis.

A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.

Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.

ycombinatorMan · on Sept 9, 2016

theres 0 chance of effective intonation and tone without understanding of the material

syllogism · on Sept 9, 2016

I think your use of the term "understanding" is very unhelpful here. It's better to think about what you need to condition on to predict correctly.

In fact most intonation decisions are pretty local, within a sentence or two. The most important thing are given/new contrasts, i.e. the information structure. This is largely determined by the syntax, which we're doing pretty well at predicting, and which latent representations in a neural network can be expected to capture adequately.

espadrine · on Sept 9, 2016

The same sentence can have a very nonlocal difference in intonation.

Say, “They went in the shed”. You won't pronounce it in a neutral voice if it was explained in the previous chapter that a serial killer is in it.

On the other hand, if the shed contains a shovel that is quickly needed to dig out a treasure, which is the subject of the novel since page 1, you will imply urgency.

Cybiote · on Sept 9, 2016

With enough labor, you could annotate enough sentences to cover a lot of dialogue cases. Sections like "'stop!', he said angrily/dryly/mockingly are probably fairly common. You'd be modeling the next most probable inflection given previous words and selected tones.

What would require understanding would be novel arrangements and metaphor to indicate emotional state. On the fly variations to avoid mononticity might also be difficult, as well as sarcasm or combinations/levels (e.g. she spoke matter of factly but with mirth lightly woven through).

Houshalter · on Sept 9, 2016

And who says it can't understand the material? There have been recurrent networks trained that can translate between languages, or predict the next word in a sentence, at remarkable accuracy. Combined with wavenet this could be quite effective.

thomasahle · on Sept 9, 2016

There could be cases where the intonation is dependent on things entirely outside of the book. If say a politician does something in the writing that is far from what we would expect them to do in today's world.

visarga · on Sept 9, 2016

How about we allow annotation of text with prosody cues? Mark the words you want stressed. We already use question and exclamation marks.

atty79 · on Sept 10, 2016

I'd love that. Writing is a poor representation of language. It'd be nice to bring it up a notch. Here's a suggestion in a paper I wrote on better second language acquisition. https://www.researchgate.net/publication/261022308_BETTER_SE...

spiritus_ · on Sept 9, 2016

Like traditional audio books can capture perfectly what you're referring to...

ycombinatorMan · on Sept 9, 2016

They can, though?

grandalf · on Sept 8, 2016

I don't see why many aspects of intonation couldn't be taught the same way ...

swsieber · on Sept 8, 2016

I think the point is that different parts of the story need different intonation patterns (reading a scary part vs a boring part, etc.).

So in theory, it could be achieved by having multiple training sets (for the different intonation styles), along with analysis of the text to direct which part of the text needs what intonation. You might even be able to blend intonations.

losoq · on Sept 8, 2016

Or just pay MTurk workers to annotate texts with intonation cues.

I kinda doubt that would be profitable relative to just hiring readers, but in general you don't need to replace workers completely to cannibalize some of their wages/jobs.

CamperBob2 · on Sept 8, 2016

Or treat it as part of the original author's job. When you write a piece of music you add tempo and intensity metadata to the score, so why not do the same when writing a novel?

drewler · on Sept 8, 2016

Or the author could just add that information to the text. This way there's no need to "understand" it.

partycoder · on Sept 8, 2016

There is significant advance in sentiment analysis too. Trading bots use sentiment analysis as some of the input for their time series prediction algorithms. I would not say 10 years.

iandanforth · on Sept 8, 2016

What about auto-tuning? I can do a pretty good reading-with-intention but I don't have the melt-your-brain-rich tones of Stephen Fry or Ian McKellen.

swalsh · on Sept 8, 2016

That is so exciting for me. I love listening to audiobooks when I'm walking my dog, or driving, or something boring that doesn't need my brain but does need my arms.

The issue is the selection is so much smaller than the selection of books.

grandalf · on Sept 8, 2016

Indeed. It also sounds like it could be trained to correctly read math or code, the two things that require enough expertise to properly pronounce that most text to speech engines fail miserably.

Something like:

  a(b+c)

"a times the quantity b plus c"

If read with proper inflection, this would be a vast improvement and could open up all sorts of technical material to people for whom audio learning is preferred.

I think back to the first math teacher I had whose pronunciation of the notation was precise and unambiguous enough that one didn't really have to be watching the board. This is a rare gift, yet it is possible in many areas of math, yet few teachers master it (or realize how helpful it is).

mdip · on Sept 8, 2016

I'm an audiobook junkie and as far as professional narrators go, I think it'd be hard to replace a high-end performance with something computer generated and end up with the level of quality offered by the likes of a great narrator like Scott Brick. I mention him by name because it was him that made me realize how important good quality narration is. I had purchased a book at an airport bookstore on a whim and while waiting for a plane was so disgusted with the poor quality of the writing that I actually threw the book out[0]. Years later, I had grabbed an audio book by an author I hadn't heard of simply because it was read by Scott Brick and recommended to "Read Next". Two hours in and I realized the book I had been enjoying so much was the same terrible book I had thrown out years before[1].

While I don't doubt it'll be possible for a computer to match it with enough input data (both in voice and human adjustment), it'll probably be a while before we'll be there and when we are there it'll likely require a lot of adjustment on the part of a professional. A big part of narration is knowing when and where a part of the story requires additional voice acting (and understanding what is required). A machine generated narration would have to understand the story sufficiently to be able to do that correctly. They might be able to get the audio to sound as good as it would sound if I narrated it, but someone with talent in the area is going to be hard to match.

All of that aside, it's getting pretty close to "good enough". When it reaches that point, my hope is more books will have audio versions available[2] and in all likelihood, some books that would have been narrated by a person today will likely be narrated by technology when it reaches that point, limiting human narration only to the top x% of books.

[0] I always resell books or donate them. This book was so bad that the half-hour it took from my life felt like a tragedy. I threw it out to prevent someone from experiencing its awfulness -- even for free.

[1] I realized it was the same book at the point a story was told that I had only read in the first book (and found mildly humorous). The reason I hated the other book was that it was written in the first person as a New York cop. I couldn't form a mental picture and the character was entirely unbelievable and one dimensional. When narrated properly, that problem was eliminated.

[2] I "speed read" (not gimmicky ... scan/skimming) and consume a ton of text. I've been doing it for 20 years or so and find it difficult to read word-for-word as is required for enjoyment of fiction, so to "force" it, I stick with audio books for fiction and love them.

grandalf · on Sept 9, 2016

I too greatly appreciate highly skilled readers. It's another layer of creativity and inspiration in addition to the text, and when done well adds a lot to the book.

visarga · on Sept 9, 2016

I only fell in love with the voice of a single audiobook narrator. I checked, and yes, he was Scott Brick. I think he adds about 50% on top of the value of the written book by his interpretation.

mdip · on Sept 13, 2016

He's incredible - some people complain that he's a little bit of a slow reader, but audio book apps usually have a speed option. He enunciates well and adds a depth of feeling to the work that can take a book that's average up several notches.

He's also the only narrator that I can name[0].

[0] Who isn't well known for other things -- Douglas Adams narrated his entire series, and some actors are also regular audio book narrators, but Scott Brick is purely a narrator (or at least, was when I last looked).

JoshTriplett · on Sept 8, 2016

How much data does a model take up? I wonder if this would work for compression? Train a model on a corpus of audio, then store the audio as text that turns back into a close approximation of that audio. (Optionally store deltas for egregious differences.)

kastnerkyle · on Sept 8, 2016

It would be a slow (but very efficient information-wise - only have to send text which itself can be compressed!) decompression process with current models / hardware due to sequential relationships in generation.

I am sure people will start trying to speed this up, as it could be a game changer in that space with a fast enough implementation. Google also has a lot of great engineers with direct motivation to get it working on phones, and a history of porting recent research in to the Android speech pipeline.

The results speak for themselves - step 1 is almost always "make it work" after all, and this works amazingly well! Step 2 or 3 is "make it fast", depending who you ask.

Houshalter · on Sept 9, 2016

We've known for decades that neural networks are really good at image and video compression. But as far as I know, this has never been used in practice, because the compression and decompression times are ridiculous. I imagine this would be even more true for audio.

dharma1 · on Sept 9, 2016

The magic pony guys (who sold to twitter) have patents and implementations of a super resolution CNN for realtime video.

http://www.cv-foundation.org/openaccess/content_cvpr_2016/pa...

bbctol · on Sept 8, 2016

Wow! I'd been playing around with machine learning and audio, and this blows even my hilariously far-future fantasies of speech generation out of the water. I guess when you're DeepMind, you have both the brainpower and resources to tackle sound right at the waveform level, and rely on how increasingly-magical your NNs seem to rebuild everything else you need. Really amazing stuff.

fpgaminer · on Sept 9, 2016

I'm guessing DeepMind has already done this (or is already doing), but conditioning on a video is the obvious next step. It would be incredibly interesting to see how accurate it can get generating the audio for a movie. Though I imagine for really great results they'll need to mix in an adversarial network.

visarga · on Sept 9, 2016

Oh yes, extract voice and intonation from one language, and then synthesize it in another language -> we get automated dubbing. Could also possibly try to lipsync.

JonnieCache · on Sept 8, 2016

Wow. I badly want to try this out with music, but I've taken little more than baby steps with neural networks in the past: am I stuck waiting for someone else to reimplement the stuff in the paper?

IIRC someone published an OSS implementation of the deep dreaming image synthesis paper fairly quickly...

kastnerkyle · on Sept 8, 2016

Re-implementation will be hard, several people (including me) have been working on related architectures, but they have a few extra tricks in WaveNet that seem to make all the difference, on top of what I assume is "monster scale training, tons of data".

The core ideas from this can be seen in PixelRNN and PixelCNN, and there are discussions and implementations for the basic concepts of those out there [0][1]. Not to mention the fact that conditioning is very interesting / tricky in this model, at least as I read it. I am sure there are many ways to do it wrong, and getting it right is crucial to having high quality results in conditional synthesis.

[0] https://github.com/tensorflow/magenta/blob/master/magenta/re...

[1] https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...

JonnieCache · on Sept 8, 2016

Is there any usable example code out there I can play with? I don't care if it sounds noisy and weird, it's all grist for the sampler anyway.

visarga · on Sept 9, 2016

And when you think of all those Hollywood SF movies where the robot could reason and act quite well but in a tin-voice. How wrong they got it. We can simulate high quality voices but we can't have our reasoning, walking robots.

ilaksh · on Sept 9, 2016

Depending on how you mean 'reasoning, walking robots' then not yet really.. but every few weeks or months another amazing deep learning/NN whatever thing comes out in different domains. So these types of techniques seem to have very broad application.

Of course, if you mean 'walking' in a literal sense, there are a number of impressive walking robots such as Atlas https://www.youtube.com/watch?v=rVlhMGQgDkY, HRP-2 https://www.youtube.com/watch?v=T6BSSWWV-60 or HRP 4C https://www.youtube.com/watch?v=YvbAqw0sk6M, etc.. Also there are many types of useful reasoning systems. I am guessing you are thinking of language understanding and generation.. but I believe these types of techniques are being applied quite impressively in that area also, from DeepMind or Watson https://www.youtube.com/watch?v=i-vMW_Ce51w etc.

ericjang · on Sept 9, 2016

"At Vanguard, my voice is my password..."

kragen · on Sept 9, 2016

This is amazing. And it's not even a GAN. Presumably a GAN version of this would be even more natural — or maybe they tried that and it didn't work so they didn't put it in the paper?

Definitely the death knell for biometric word lists.

imaginenore · on Sept 8, 2016

Please make it sound like Morgan Freeman.

TeeWEE · on Sept 9, 2016

Morgan Freeman +1

banach · on Sept 8, 2016

I hope this shows up as a TTS option for VoiceDream (http://www.voicedream.com/) soon! With the best voices they have to offer (currently, the ones from Ivona), I can suffer through a book if the subject is really interesting, but the way the samples sounded here, the WaveNet TTS could be quite pleasant to listen to.

imurray · on Sept 8, 2016

Would delete this post if I could. Was a request to fix a broken link. Now fixed.

andrew3726 · on Sept 8, 2016

It seems fixed now.

rounce · on Sept 8, 2016

So when does the album drop?

rounce · on Sept 8, 2016

In case the above came across as an example of bad sarcasm, I'm very serious. I've a somewhat lazy interest in generative music, and found the snippets in the paper quite appealing.

Though, as was mentioned in a previous comment, due to copyright (attribution based on training data sources, blah blah) I might already have an answer. :(

b0ner_t0ner · on Sept 10, 2016

“Is this Hiromi Uehara or WaveNet?”

nitrogen · on Sept 9, 2016

I wonder how a hybrid model would sound, where the net generates parameters for a parametric synthesis algorithm (or a common speech codec) instead of samples, to reduce CPU costs.

partycoder · on Sept 8, 2016

The first to do semantic style transfer on audio gets a cookie!

mtgx · on Sept 9, 2016

When can we expect this to be used in Google's TTS engine?

tunnuz · on Sept 8, 2016

Love the music part! Mmmh ... infinite jazz.

AstralStorm · on Sept 9, 2016

Finally a convincing Simlish generator!

billconan · on Sept 8, 2016

hope they can release some source code.

wonder how many gpus are required to hold this model.

baccheion · on Sept 8, 2016

I suppose it's impressive in a way, but when I looked into "smoothing out" text to speech audio a few years ago, it seemed fairly straightforward. I was left wondering why it hadn't been done already, but alas, most Engineers at these companies are either politicking know-nothing idiots, or are constantly being road blocked, preventing them from making any real advancements.