Aside from the seriously impressive WaveNet based results, I think the article doesn't do the codec itself enough justice.
I mean, low-bitrate speech codecs have been around for some time (hey, vocoders are the oldest kind of audio codecs in history!), and I grew skeptical when they started to compare with mp3 and opus.
But looking at this page Codec2 really holds its own when compared to AMBE and especially MELP, two of the most prominent ultra low-bandwidth speech codecs used today:
https://www.rowetel.com/?p=5520
The article failed to mention the original reason why Codec2 is invented.
In digital amateur radio communication, currently the most widely-used codec is AMBE. But AMBE is a proprietary codec, covered by patents, unhackable - the counter-thesis of amateur radio. Codec2 was born to bring freedom to digital amateur radio communication, and technically even better than AMBE.
Codec2 is also fully open source and patent-free, in contrast to virtually every other ultra-low-bitrate voice codec (which are proprietary and have expensive patent licensing attached). He has a Patreon if you want to support him in the ongoing development of Codec2 and his SDR modems to enable use of it in amateur radio: https://www.patreon.com/drowe67
Codec2 might be patent-free, but Codec2 with a WaveNet decoder isn't because WaveNet (convolutional neural networks for generating audio sequence data) is patented: https://patents.justia.com/patent/20180075343
When it was patented? When I was working with AI about 15 years ago I was experimenting with conv nn to generate audio. I wouldn't have expected for this to be patented as this is so friggin obvious thing to do. It is like patenting 2+2=4 once you discover numbers.
I am not a scientist, just I was very interested in that space and it would be a long way to create scientific paper out of my experiments. Since patent law has been created for the privileged to reap profits I wouldn't stand a chance contesting that.
Question for IP experts: now that I have heard of the existence of WaveNet and a rough idea of how it works (training a neural network to decode low-bitrate speech data with as much fidelity as possible to the original), would I be prohibited from selling a similar product built with the same technique? How about if I had never heard of WaveNet and went about doing the same thing?
Yes, independent implementations of patented works are covered by the patent.
BUT: patents are far more specific than just "a neural network to decode low-bitrate speech data with as much fidelity as possible to the original)". Starting with that goal, you are unlikely to recreate WaveNet's specific structure that is patented.
In fact, WaveNet describes a more general method to efficiently work with sound signals, somewhat comparable to convolutions for images. It's also not impossible to work with sound using alternative MM structures that are not patented, and might actually perform better than WaveNet.
Speex and Opus bottom out around 6000-8000 bps. Codec2 starts at 3200 bps and goes down to 700 bps. The original target use for Codec2 is real-time transmission in the HF (shortwave) and VHF/UHF amateur radio bands where those are about as much as you can transmit within the same bandwidth as analog voice modes once you factor in error correction.
Having grown accustom to MP3 artifacts, it's strange to hear artifacts that are natural, but just aren't quite right. More specifically, in the male voice sample: "sold about seventy-seven", I received it as "sold about sethenty-seven".
Yes, and "certificates" sounds like "certiticates".
Reminds me of a story about a copying machine that had a image compression algorithm for scans which changed some numbers on the scanned page to make the compressed image smaller. (Can't remember where I read about that, must have been a couple years ago on HN)
And yes, I think this is a relevant comparison. As the entropy model becomes more sophisticated, errors are more likely to be plausible texts with different meaning, and less likely to be degraded in ways that human processing can intuitively detect and compensate for.
My understanding of this fault was that it was a bug in their implementation of JBIG2, not the actual compression? Linked article seems to support this.
I think it was just overly aggressive settings of compression parameters. I don't see any evidence that the jbig2 compressor was implemented incorrectly. Source: [1]
Right. Jbig2 supports lossless compression. I'm not very familiar with the bug, but it could have been a setting somewhere in the scanner/copier that it was changed to lossy compression instead. Or they had lossy compression on by default or misconfigured some other way (probably a bad idea for text documents).
No. The bug was when using the "Scan to PDF" function. It happened on all quality settings. Copying (scanning+printing in one step, no PDF) was not effected.
I remember differently, but I don't want to pull up the source right now.
I did check some of the sources, but was not able to find the one I remember which had statistics on it.
The xerox FAQ to it does lead me to consider that I might be confusing this with some other incident though, as they claim that Scanning is the only thing that is affected.
This is a big rabbit hole of issues I'd never even considered before. Should we be striving to hide our mistakes by making our best guess, or make a guess, that if wrong, is easy to detect?
The algorithm detected similar patterns and replaced these with references. This lead to characters being changed into similar looking characters that also appeared on the page.
If we're abandoning accurate reproduction of sound and just making up anything that sounds plausible, there's already a far more efficient codec: plain text.
Assuming 150wpm and an average 2 bytes per word (with lossless compression), we get about 5bps, which makes 2400bps look much less impressive. Add some markup for prosody and it will still be much lower.
This codec also has the great advantage that you can turn off the speech synthesis and just read it, which is much more convenient than listening to a linear sound file.
If you have such a codec, it would be worth testing the word error rate on a long sample of audio. e.g. take a few hours of call centre recordings, pass them through each of {your codec, codec2}, and then have a human transcribe each of:
- the original recording
- the audio output from your proposed codec (which presumably does STT followed by TTS)
- the audio output from CODEC2 at 2048
Based on the current state of open source single-language STT models, I would imagine that CODEC2 would be much closer to the original. And if the input audio contains two or more languages, I cannot imagine the output of your codec will be useful at all.
Speech to text is certainly getting better but it makes mistakes. If the transcribed text was sent over the link and then a text to speech spoke at the other end you'd lose one of the great things about codec2 - the voice that comes out is recognisable as it sounds a bit like the person.
A few of us have a contact on Sunday mornings here in Eastern Australia and it's amazing how the ear gets used to the sound and it quickly becomes quite listenable and easy to understand.
Yeah, the main use case for codec2 right now is over ham radio. David Rowe, along with a few others, also developed a couple of modems and a GUI program[1]. On Sunday mornings, around 10AM, they do a broadcast of something from the WIA and answer callbacks.
What you might be able to do is your the text codec as the first pass, then augment the audio with Codec2 or so to capture the extra information (inflections, accent, etc...), for something in between 2 and 700bps.
One of the very few things I know about audio codecs is that they at least implicitly embody a "psychoaccoustic model". The "psycho" is crucial because the human mind is the standard that tells us what we can afford to throw away.
So a codec that agressively throws away data but still gets good results must somehow enbody sophisticated facts about what human minds really care about. Hence "artifacts that are natural".
In the normal codec2 decoding it sounds like "seventy" but muffled and crunchy.
In the wavenet decoding, the voice sounds clearly higher quality and crisp, but the word sounds more like "suthenty". And not because the audio quality makes it ambiguous but it sounds like it's very deliberately pronouncing "suthenty".
It's as if in trying to enhance and crisp up the sound, it corrected in the wrong direction. It sounds like the compressed data that would otherwise code for a muffled and indistinct "seventy", was interpreted by wavenet but "misheard" in a sense. When wavenet reconstructs the speech, it confidently outputs a much clearer/crisper voice, except it locks onto the wrong speech sounds.
With the standard "muffled/crunchy" decoding, a listener can sort of "hear" this uncertainty. The speech sound is "clearly" indistinct, and we're prompted to do our own correction (in our heads), but also knowing it might be wrong. When the machine learning net does this correction for us, we don't get the additional information of how its guess is uncertain.
This is exactly the sort of artifact I'd expect with this kind of system. As soon as I heard the ridiculously good and crisp audio quality of the wavenet decoder, that fidelity just isn't included in the encoding bits, that's impossible. It's a great accomplishment and just impressive, but it has to "make up" some of those details in a sense very similar to image super resolution algorithms.
I'm just thinking we should perhaps be careful to not get into a situation like the children's "telephone" game, if for some reason the speech gets re/de/re/encoded more than once. Which is of course bad practice, but even if it happens by accident, the wavenet will decode into confident and crisp audio, so it may be hard to notice if you don't expect it.
If audio is encoded and decoded a few times, it's possible that the wavenet will in fact amplify misheard speech sounds into radically different speech sounds, syllables or even words, changing the meaning. Kind of like the "deep dreaming" networks. Sounds like a particularly bad idea for encoding audio books, because small flourishes in wording really can matter.
Edit: I just realised that repeated re/de/re-encoding can in fact happen quite easily if this codec is ever implemented and used in real world phone networks. Many networks use different codecs and re-encoding just has to be done if something is to pass through a particular network.
But the whole thing is ridiculously cool regardless :) And I wonder if they can improve on this problem.
That is very impressive! I wonder if a WaveNet decoder could be built for phone calls, as those still sound awful. If it's possible to do this only on the decoder side you don't have to wait for your network to start supporting HD voice or VoLTE to get better quality audio!
The original WaveNet repeated a lot of computations; with caching/dynamic programming, it became a lot faster. Other optimizations were also doable. In any case, that was eventually made moot by using model distillation to train a wide flat (not deep) NN, which is 20x realtime: https://deepmind.com/blog/high-fidelity-speech-synthesis-wav... (This was necessary to make it cost-effective to deploy onto Google Assistant.)
Actually if you're lucky and make a phone call with HDVoice, or whatever they're calling it, the quality is excellent. It makes a huge difference. Unfortunately the place where you really want good quality is call centres - it's often hard to hear people and half of the reason is the shitty POTS quality - and call centres will probably get HDVoice in about 40-50 years. Maybe.
Edit: nm should have read all of your comment before replying!
>Actually if you're lucky and make a phone call with HDVoice, or whatever they're calling it, the quality is excellent
Can confirm. I spend a lot of time in fringe reception areas, but every now and then I get a good, strong signal and the HD Voice kicks in between my iPhone and my wife's and it sounds like she's standing right next to me. It really is something to experience, especially if the previous phone call was over regular tech.
Back when AT&T was running the "You get what you pay for" ads to combat SPRINT and MCI, it had a service you could sign up for that would give your landline phone calls amazing quality.
Sadly, a majority of people would rather pay less for crap than more for quality; even back then.
Also why no one really appreciated ISDN over here in central Europe. Yes, there are ways to do better _now_, and it would have been trivial to support channels with better codecs by negotiating something different than u-law 8kHz PCM, but back then that resulted in rather good quality. The issue was that few people got ISDN phones, which resulted in them using analog outputs on an adapter device, which got later incorporated into the internet router, which at one point switched from ISDN to VoIP. And people plug a phone via an analog jacket into the router, instead of using a VoIP capable phone or even anything digital. While many do use DECT cordless phones, those rarely use the DECT hardware inside the router, and instead use the one in the charging dock, which itself connects via an analog, POTS-bandpass-filtered, phone jacket to the VoIP router.
Oh well, we will probably never get that kind of quality, which is only possible with QoS on the whole path, if there is any congestion. That is the one thing something like rocket.chat and discord can't provide.
Edit: the way to do this is to force quality upon people, wherever you won't drive them away with the cost this incurs. That way people will associate your brand as a whole with the quality, i.e., in that case, people will associate AT&T with quality, not AT&T premium.
Normal people do not even know what kind of plan they are on, except for about one hour before and after they sign the contract.
I am speaking as a German. Sure, larger businesses used proper ISDN, but your uncle/mom didn't. The best you could hope for there was only DECT compression, aka ADPCM 4bit/8kHz.
Sure they did. Adoption rate was about 30%, possibly even higher at it's peak.
DECT codec is useless if it's transmitted through analog lines, as it gets converted to standard 3.4k-Hz quality anyway. Except for in-house calls of course.
The problem with land line voice quality is that people expect a land line phone / VoIP adapter to cost like 20 USD or Euro. At this price point you can't have fancy codecs and audio hardware that delivers a decent signal. Bad audio hardware with a good codec can actually decrease audio quality (the mic's noise no longer gets filtered as it is with 8kHz PCM).
I think it was AT&T that had a test number you could call to hear a higher quality phone call (I was pretty young at the time, so my memory is fuzzy). I remember it sounding very good, but that test number was the only time I remember hearing that quality over the POTS. The VoLTE and HD voice I occasionally get on my iPhone reminds me of that system.
I don't know what technology it is specifically, but it's a brand name they used for actual high quality calls. Think, 128 kB/s MP3, rather than the standard cups-and-string quality.
He probably means Adaptive Multi-Rate Wideband (AMR-WB) [1], AKA G.722.2. There is a common misconception that it is VoLTE only, but actually it works pretty well on 3G too. It is night and day compared to legacy codecs.
Let say we have Codec2 with WaveNet, its 3.2Kbps now does similar to may be 16Kbps EVS. ( EVS being the codec used in VoLTE, which is slightly better then even Opus in Speeches. )
What "value" / "uses" does this bring us?
It cant be used in podcast because as shown it isn't very good with Music. And many podcast has Music in it.
While Codec 2 with WaveNet can have a 2-4x reduction in bitrate. I cant think of a application that benefits from this immediately.
The other thing I keep having in my mind is convolutional neural networks on Codec in general, Music, Movies, etc. What sort of benefits it bring us.
Maybe not too much for "us" with LTE and 128GB storage on our phones, but in cases of low bandwith (think digital police radio), or when you have low storage availability, that's really awesome.
I've become almost entranced with the concept of comparing things to the size of a Floppy Disk. I'm actually planning to get a tattoo of one on my right forearm. I've been working on a large business management platform for the last couple of years and noticed that after investing $500k (salaries/etc) and building a huge amount of functionality, the frontend and backend codebases are still under 1.5mb. Pretty amazing.
When we had a 486 running Windows 95, I used to convert CDs to WAV for fun. The GSM 6.10 codec in Sound Recorder (22050 Hz, Mono, 4 KB/s) could fit about 1 song onto a floppy.
Would be a fun experiment to use something like 3 or even 1 sine to get unintelligible speech, but then pair it with subtitles where each syllable of the text is animated synchronized with the speech. (Like the "follow the bouncing ball" song lyric animations.)
By pairing the audio with the text, you would almost certainly convince the listener that they can understand it.
> Sine-wave speech is an intelligible synthetic acoustic signal composed of three or four time-varying sinusoids. Together, these few sinusoids replicate the estimated frequency and amplitude pattern of the resonance peaks of a natural utterance (Remez et al., 1981). The intelligibility of sine-wave speech, stripped of the acoustic constituents of natural speech, cannot depend on simple recognition of familiar momentary acoustic correlates of phonemes. In consequence, proof of the intelligibility of such signals refutes many descriptions of speech perception that feature canonical acoustic cues to phonemes. The perception of the linguistic properties of sine-wave speech is said to depend instead on sensitivity to acoustic modulation independent of the elements composing the signal and their specific auditory effects.
To anyone who listens to this, I recommend rewinding to the segment starting at 1:23 a few times and not letting it reach the spoilers. After a few rounds, my brain adjusted to the distortion and I could make it out perfectly, without ever hearing the original.
Or what if you scrunched the audio down to a bandwidth beyond what was still intelligible, but still captured some semblance of the speaker's voice. Use the original audio to compute subtitles and store them alongside the audio. That's your file.
Then the player uses both as inputs to ai (some hand waving), which now has enough to put the pieces together and produce something intelligible again, in the speaker's voice.
Without the "intelligible" part, this makes me think of what the game Celeste does to give its characters voices without voice acting.
They make voice-like synth sounds, different for each character, that are about the length of the text they're saying. It adds prosody and intonation to the text-based dialogue of the game.
Basically turn the speaker’s voice into a “font”, and then render text with it.
Pretty sure it’s been done. Large initial delay while you get the whole font download, then basically just the text to be rendered and the occasional hint to the renderer
As far as I know, enough podcast apps require MP3 (and not even VBR!) that you have to use MP3, and you can't have multiple <enclosure>s, so how would you do this? A separate RSS feed for Opus, linked only on the website and not submitted to aggregators?
I highly doubt there are any devices that are capable of accessing the modern web, with all its JavaScript bloat, yet cannot decode a simple audio codec. Even when Apple was installing AAC hardware decoders, they were already almost obsolete by modern embedded CPU development (especially the rise of medium-power ARM SoCs). I highly doubt any devices released in the past 5 years have any sort of fixed-function audio decoder. Maybe an encoder, possibly some general-purpose DSPs, but not a format-specific decoder.
Yeah, the last time hardware audio decoders were relevant was like... back in the Nokia N-Gage days.
The N-Gage QD removed the MP3 decoder that was present in the original model. And you could install a software player, and it would struggle with bitrates above 128kbps :D
Modern phones can decode video in software (sucks for battery life, and framerate/resolution are more limited than with hardware, but it's possible). Audio is nothing for them.
> Yeah, the last time hardware audio decoders were relevant was like... back in the Nokia N-Gage days.
I guess it's irrelevant you feel overwhelmed by how long your phone can go on a charge. Plus, low-power/low-CPU requirements are an order of magnitude more critical in devices like smartwatches.
Opus is awesome for audiobooks at 24kbps (probably one could go lower than this even) and music at 96kbps. I don't hear any difference in quality. It makes a big difference for my mobile which is limited to 128Gb.
Man that's a big collection! Though to be fair the iPod 160GB came out like ten years ago, and I thought we'd have advanced a bit more in that department by now. (Like imagine an iPod but instead of the spinny hard drive it's all microSD! There's just no market for it I guess.)
Ignoring other issues, this will have rather poor power usage, which is especially relevant given how many people listen to podcasts on mobile devices.
All the +2k podcasts hosted on Podigee (a podcast hosting company mainly known in German-speaking countries) are distributed in opus. But it is, and probably will always be, a rather niche distribution format. AAC had its moment, but MP3 is alive and kicking. Even Apple acknowledges its importance by adding support for chapter markers in iOS12.
That moment is 15 years in with no signs of losing steam[1]. AAC effectively replaced MP3 for most online audio use cases, with podcasting as a notable exception[2]. And of course, AAC is the audio format for all basically all online video distribution.
[1] Apple kicked off the transition in 2003 with the introduction of AAC-based digital music sales.
[2] Because podcasting is a decentralized medium, and the vast majority of podcasters don't know much (if anything) about media encoding.
Perceptions are probably influenced heavily by your own usage and places for consumption. I'm also in the camp of "AAC is very rare, if at all"...
Considering also that YouTube uses WebM, which very explicitly is only Vorbis or Opus for audio, "basically all online video distribution" must exclude the web's most popular video distribution site...
Every YouTube video has had AAC audio from the very beginning. Same goes for every Vimeo video, every Netflix video, every Hulu video, etc. Streaming audio services like Pandora use AAC too.
That's because AAC is the only format you can count on to work on all devices, and to be hardware decoded on all devices where battery life matters.
There are a lot of playback issues - VBR MP3 and older OS releases of both iOS and Android, never mind the car players and similar all contribute to the problem.
The post-show of this podcast talks about these and other issues in detail - Marco is on both sides of the issue as a podcast producer and podcast app developer: http://atp.fm/episodes/182
The Wavenet stuff sounds great, but I'm curious how big the model is. The audio files may be tiny, but you may need a huge neural network to decode them.
"The man behind it, David Rowe, is an electronic engineer currently living in South Australia. He started the project in September 2009, with the main aim of improving low-cost radio communication for people living in remote areas of the world. With this in mind, he set out to develop a codec that would significantly reduce file sizes and the bandwidth required when streaming."
What do you know, it's sort of like Pied Piper without the magical compression or cloud handwaving.
I've been reading David Rowe's blog [0] since 2008, there are some other really interesting projects and products on it. One of my favorites back then was his home build electric car.
I noticed that when you listen to compressed audio first you hear the unnaturality of voice and clicks (probably when one frame's ending doesn't match next frame start). But in a few seconds you adapt to it and now voice sounds pretty clear.
> However, where it starts to get more interesting is the work done by W. Bastiaan Kleijn from Cornell University Library.
The authors are not from Cornell. I think the author made this mistake because the paper is posted on arXiv, and that’s what’s it says at the top of every page?
This is amazing! With this codec and enough processing power, you could do this bidirectionally and have enough bandwidth to stream a two way realtime voice chat using 2400bps modems over a standard analog phone line!!! ... Oh... Wait a minute...
The plain Codec2 decoder sounds like a TI-99/4A (and works on somewhat similar principles). If I hook a TI-99/4A to the WaveNet decoder, will it sound natural?
Side note: I'm still waiting for an open source, cheap way to do FreeDV/Codec2 on VHF either with a dongle that goes between a raspi/SBC or a laptop and a cheap ass radio like a baofeng, or an inexpensive radio with Codec2 support.
I think 2400B support is coming to the FreeDV GUI soon. I've seen some work done on that. That'll let you use a cheap FM radio and a laptop to get on the air with something codec2 based. I'm slowly chipping away at a TDMA mode for SDRs, but that's still probably a ways off.
Would be interesting to combine this Codec2 with LoRa modulation. Of course the latter is patented, but it combines both chirped and direct sequence spread spectrum to yield some very resilient modulation.
Works on iOS 11 for me but I had to press play and then wait a couple of seconds, press pause and then play again and wait another couple of seconds. Try that.
But looking at this page Codec2 really holds its own when compared to AMBE and especially MELP, two of the most prominent ultra low-bandwidth speech codecs used today: https://www.rowetel.com/?p=5520