Hacker News new | past | comments | ask | show | jobs | submit login
Voder Speech Synthesizer (griffin.moe)
253 points by CyborgCabbage on July 18, 2023 | hide | past | favorite | 42 comments



A short explanation as to how this works:

The voice can be modeled using two main components. The vocal chords are a periodic source of sound, which is then filtered by the mouth and tongue to produce vowel sounds [0]. The filter can be modeled as a set of band-pass filters, each of which let through a specific band of frequencies — these are called ‘formants’ in acoustic phonetics. Different vowel sounds are produced by combining formants at different pitches in a systematic way [1]. You can hear this yourself by very slowly moving your mouth from saying an ‘eeeee’ sound to an ‘ooooo’ sound: if you listen carefully, you can hear one formant changing pitch while the others stay the same. (I like [2] as an intro to this kind of stuff.)

The ‘voder’ works by having one key for each possible frequency band-pass filter. Pressing multiple keys adds the resulting sounds, producing an output sound with distinct formants. If you use the right formants, the resulting sound is very similar to that produced by a human mouth saying a specific vowel! Software such as the vowel editor in Praat [3] take it further, by allowing selection of formants from a standard vowel chart.

[0] Consonantal sounds are a bit more complicated, since they tend to involve various different noise sources and transient disturbances of the sound. For instance, /ʃ/ (the ‘sh’ sound) is noise of a lower frequency than /s/. I can’t work out how Harper produced the difference between those two sounds in the video — it seems to be impossible to do this with the live demo. In fact, any sort of pitch control seems to be impossible in the demo.

[1] This is how overtone singing and throat singing works! Selectively amplifying one formant gives the impression that you’re singing that note as the same time as the ‘base’ pitch. In fact, if you do that, your vocal cords are producing a pitch plus all its overtones, while your mouth is enhancing one overtone while filtering out all the others.

[2] https://newt.phys.unsw.edu.au/jw/voice.html

[3] https://www.fon.hum.uva.nl/praat/ — probably also available from your favourite Linux distro!


There's also a very nice simulation, where you can play with the very different parts of vocal chords:

https://imaginary.github.io/pink-trombone/


I made a fork with few more features, it might even work on your phone browser:

https://jmiskovic.github.io/voicebox


Thank you for this. I had a lot of fun scaring my cat in bed and it inspired me to become a late middle aged opera savant.


I actually am a late middle-aged opera savant but sadly I have no cat to scare


Unfortunately this webapp (along with the original Pink Trombone) produces super glitchy audio and consumes 95% of CPU on my Chrome v114.0.5735.198 running on Ubuntu 22.04 (which is running on my Thinkpad X220)


That's rather strange. The graphics part is lightweight (pre-rendering the background and then drawing few shapes), but if you could shrink the browser to very small dimensions and test we could eliminate this one.

The audio part is bit more involved. The vocal tract is simulated in segments, each segment receiving, filtering and reflecting the soundwave energy. The algorithm is computationally heavy, but it ran well on my mediocre smartphone.

Maybe if stuttering is detected it could lower the number of tract segments, which also lowers the quality. Increasing the buffer size would probably also help with glitches but I don't think it would solve the high CPU utilization.


Apparently the Voder had a pitch pedal:

https://imgz.org/i9TzhzWu/


Ah, that would explain it. Thanks for finding that image!


No problem! It's from a video linked in a thread below (the extended World's Fair presentation).


Someone was selling a vocoder on eBay, so they made a video of the vocoder describing it's own selling features.

https://www.youtube.com/watch?v=5kc-bhOOLxE


I may be wrong, but a vocoder is entirely different than a Voder.


This is quite off topic, but it reminded me of something I have been thinking about recently – perhaps at the limit all highly capable narrow AI systems must become generally intelligent.

I was thinking about the complexity of expression in TTS voice synthesizers recently and it struck me just how difficult a problem that is.

To be as expressive as a human the AI model would need to fully "understand" the context of what is being said. Consider how a phrase like "I hate you" can be said in a loving way between friends sharing a joke at each others expense, vs being said with anger or in sadness.

It got me wondering if all sufficiently complex problems require models to be generally intelligent – at least in the sense that they have deep, nuanced models of the world.

For example, perhaps for a self-driving car to be as "good" as a human it actually needs to generally intelligent in that it needs to understand that it's appropriate to drive differently if it is in an emergency situation vs a leisurely weekend drive through a scenic part of town. When driving through my city after 8PM on the weekend I tend to drive slower and more cautiously because I know drunk people often walk out in front for my car – would a good self-driving car not need to understand these nuances of the world too?

This is interesting because it highlights just how important the element human understanding is in to accurately convey expression in a voice synthesizer. While I'd argue modern voice synthesizers have been more intelligible than this for some time the expressiveness of this machine has probably only been recently been rivalled by state of the art AI models.


Probably to some degree, but for your two examples I would argue that isn't necessary:

For TTS, the "tone" is something you should encode in the input rather than have TTS figure out. I can imagine ebook > LLM > annotated text with speakers, emotions etc > TTS. So the TTS can remain rather dumb.

For the self-driving car, it shouldn't know cultural norms and be "more careful" sometimes. It should always know how much it sees and what stoping distance it can get with max breaking and its reaction time and adjust accordingly.

Agreed on stuff like emergencies etc.


> For the self-driving car, it shouldn't know cultural norms and be "more careful" sometimes. It should always know how much it sees and what stoping distance it can get with max breaking and its reaction time and adjust accordingly.

I used to live next to two schools. In the morning before school the pavement and road outside my house was always full of school kids on bikes. During this time I'd drive with the assumption that at any moment a bike could drive out in front of my car because those kids were nuts and often did.

But to assume this generally just to be safe would be extremely inconvenient. In reality if I see a group of bikers wearing lycra I will assume their competent bikers. While I'll still drive carefully, I won't assume they're about to pull out in front of my car.

If self driving cars operate with the assumption that every pedestrian is drunk and every bike on the road is a 12 year school boy then no one will use them. Do self driving cars try to this currently? If I jaywalked in front of a Tesla is it designed to always be able to stop in time?


I'd expect self driving cars to have much better sensors and reaction times than we do, and as a consequence not needing to choose between those risks and actually carrying people from one point to another.

But they will probably be way slower than people on streets that are just at the side of sidewalks and full of pedestrians.


> I'd expect self driving cars to have much better sensors and reaction times than we do

That is never going to happen.


Words like "good", "better" and "should" always carry freight that's often worth unpacking. Here, "better" really needs a definition.

A CCD is better than human eyes inasmuch as it captures a field rather than a narrow focus, with fuzzy periphery, that must be pointed at an object to resolve it.

I'm sure we could find metrics where a 360-degree lidar is better than human eyes.

It's disingenuous to pretend that sensor quality is the whole story, of course.

Human drivers have notoriously variable reflexes. I once rear-ended someone because I was inattentive. I assert that the current gen has better reflexes than some percentile of real-world meat-drivers, and I suspect that the percentile is higher than 90. Human reflexes simply aren't that quick without significant priming.


Yes. In Iain M. Banks’s Culture, even the guns are generally intelligent.


I think our current gen AI is only 1 piece of the puzzle.

This gen understands how to put words together to satisfy its internal requirement to please the instruction it is given, but it has no volition of its own and no drive it arrived at of its own cognition.

I believe GAI will need to have multiple current gen systems running simultaneously, (in unison if not in harmony) simply to form a subconscious layer that a truly next gen AI would then pick and choose from.


I was skeptical that you could even type an intelligible phonetic "She saw me" with only two phonemes let alone give it the rise and fall demonstrated.

I've played with the SP0256 speech synthesis IC and found constructing intelligible words challenging even with all the phonemes available on that silicon.

This extended video has me thinking it probably was legit though:

https://youtu.be/TsdOej_nC1M


Wolfgang von Kempelen (creator of the fake chess automaton known as the Turk) made a similar thing in the 18th century. [0] It had multiple reeds tuned to the same frequency - conceptually similar to the Voder. It might not be coincidence that Bell Labs developed this, given that Bell himself had also made attempts to improve the design, which is how he ended up inventing the telephone.

[0] https://en.wikipedia.org/wiki/Wolfgang_von_Kempelen%27s_spea...


The Voder was part of a much larger Bell Labs project, one that eventually developed into one of the first unbreakable encrypted telephony systems used in World War II.

https://99percentinvisible.org/episode/vox-ex-machina/


Re: the author of that -

Hey I know the person who made this!

Thanks for sharing, it really was a labor of love. I remember Griffin being super excited about how it turned out. They are really passionate about the worlds fair!


The intonation is very good in a way that modern speech synthesizers don’t get quite right.


What do you mean? Text-to-speech systems from the past few years are indistinguishable from an actual human voice.


I agree that they got pretty good but there’s still something that they get wrong, their intonation is a kind of passable average. If you want to be able to distinguish them from actual human speech pay close attention to intonation/inflection. They’re still very usable, Im not claiming otherwise


I can't hear much difference in the Studio voice:

https://cloud.google.com/text-to-speech/docs/wavenet

I'm fairly sure I couldn't tell Studio voices and real people apart in a blind test.


It would be a good test. I don't think they are yet indistinguishable, for what it's worth.


I don’t think they’ll ever get undistinguishable because humans have variability. Too much consistency and it starts having an artificial smell. Look at ChatGPT for example, you could read a perfectly writen answer and yet kind of sense it was written by ChatGPT..


Another vocal-tract-model synth that showed up on HN a while ago: https://news.ycombinator.com/item?id=18912628


I have wanted to try one of these- the playable soft-synth is great


Something somebody told me was that it seems really amazing but without the host prompting the listener as to the phrase, "she saw me", most of the time you wouldn't know what it was saying.

I heard a sample of "Say, good afternoon radio audience", then the Voder produces something very similar, but listen to it without the prompt and you would have to guess what it meant.

A Derren Brown kind of trick :-)


I first heard the Voder as the first sample on the Klatt Record [1]. Unfortunately, there it's credited solely to Homer Dudley; neither Bell Telephone Laboratory nor women like Helen Harper who operated the machine were mentioned.

[1]: http://www.festvox.org/history/klatt.html


Odd indeed not to mention the artist playing the sample. Also odd that OP article about the instrument does not mention the inventor.


I've been interested in how these were actually played. If anyone has access to the material used to train the operators, I'd love to hear about it.

BTW, there was one fellow who built one, something I'd like to try someday. See his recreation here:

https://www.youtube.com/watch?v=gv9m0Z7mhXY


The recently released Soma Terra synthesizer contains a key-per-formant synthesis mode which operates like the Voder: https://somasynths.com/terra/ (Ctrl+F "Voder" in the manual)


This is a cool page. I like the interactive synthesizer, but the unvoiced noise is too sharp. It sounds like white noise rather than pink noise or similar which would be more accurate to how humans sound


What is the best available open source option today for TTS?


Interesting! thanks for sharing.

I have added this to my feature list for https://glicol.org

the source code looks fairly straightforward. very cool

```js function makeFormantNode(ctx, f1, f2) { const sinOsc = ctx.createOscillator(); sinOsc.type = 'sawtooth'; sinOsc.frequency.value = 110; sinOsc.start();

  const bandPass = ctx.createBiquadFilter();
  bandPass.type = 'bandpass';
  bandPass.frequency.value = (f1 + f2) / 2;
  bandPass.Q.value = ((f1 + f2) / 2) / (f2 - f1);

  const gainNode = ctx.createGain();
  gainNode.gain.value = 0.0;

  sinOsc.connect(bandPass);
  bandPass.connect(gainNode);
  gainNode.connect(ctx.destination);

  return {
    start() {
      gainNode.gain.setTargetAtTime(0.75, ctx.currentTime, 0.015);
    },
    stop() {
      gainNode.gain.setTargetAtTime(0.0, ctx.currentTime, 0.015);
    },
    panic() {
      gainNode.gain.cancelScheduledValues(ctx.currentTime);
      gainNode.gain.setTargetAtTime(0, ctx.currentTime, 0.015);
    },
  };
}

function makeSibilanceNode(ctx) { const buffer = ctx.createBuffer(1, NOISE_BUFFER_SIZE, ctx.sampleRate); const data = buffer.getChannelData(0); for (let i = 0; i < NOISE_BUFFER_SIZE; ++i) { data[i] = Math.random(); }

  const noise = ctx.createBufferSource();
  noise.buffer = buffer;
  noise.loop = true;

  const noiseFilter = ctx.createBiquadFilter();
  noiseFilter.type = 'bandpass';
  noiseFilter.frequency.value = 5000;
  noiseFilter.Q.value = 0.5;

  const noiseGain = ctx.createGain();
  noiseGain.gain.value = 0.0;

  noise.connect(noiseFilter);
  noiseFilter.connect(noiseGain);
  noiseGain.connect(ctx.destination);
  noise.start();

  return {
    start() {
      noiseGain.gain.setTargetAtTime(0.75, ctx.currentTime, 0.015);
    },
    stop() {
      noiseGain.gain.setTargetAtTime(0.0, ctx.currentTime, 0.015);
    },
    panic() {
      noiseGain.gain.cancelScheduledValues(ctx.currentTime);
      noiseGain.gain.setTargetAtTime(0, ctx.currentTime, 0.015);
    },
  };
}

function initialize() { audioCtx = new (window.AudioContext || window.webkitAudioContext)(); audioNodes['a'] = makeFormantNode(audioCtx, 0, 225); audioNodes['s'] = makeFormantNode(audioCtx, 225, 450); audioNodes['d'] = makeFormantNode(audioCtx, 450, 700); audioNodes['f'] = makeFormantNode(audioCtx, 700, 1000); audioNodes['v'] = makeFormantNode(audioCtx, 1000, 1400); audioNodes['b'] = makeFormantNode(audioCtx, 1400, 2000); audioNodes['h'] = makeFormantNode(audioCtx, 2000, 2700); audioNodes['j'] = makeFormantNode(audioCtx, 2700, 3800); audioNodes['k'] = makeFormantNode(audioCtx, 3800, 5400); audioNodes['l'] = makeFormantNode(audioCtx, 5400, 7500); audioNodes[' '] = makeSibilanceNode(audioCtx); } ```


Just checked out GLICOL. It's quite cool! Is there MIDI support or any plans to add it?


only a woman could operate the machine yet it was built to create a man's voice




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: