Opus 1.5 released: Opus gets a machine learning upgrade

yalok · 2024-03-04T20:53:12 1709585592

The main limitation for such codecs is CPU/battery life - and I like how they sparsely applied ML in it here and there, combining it with classic approach (non-ML algos) to achieve better tradeoff of CPU vs quality. E.g. for better low bitrate support/LACE - "we went for a different approach: start with the tried-and-true postfilter idea and sprinkle just enough DNN magic on top of it." The key was not to feed raw audio samples to the NN - "The audio itself never goes through the DNN. The result is a small and very-low-complexity model (by DNN standards) that can run even on older phones."

Looks like the right direction for embedded algos and it seems to be a pretty unexplored one, as compared to the current fashion to do ML E2E.

indolering · 2024-03-05T04:01:28 1709611288

It's a really smart application of ML: helping around the edges and not letting the ML algo invent pheonems or even whole words by accident. ML transcription has a similar trade-off of performing better on some benchmarks but also hallucinating results.

kolinko · 2024-03-05T06:51:50 1709621510

A nice story about Xerox discovering this issue in 2003, when their copiers began slightly changing random numbers in copied documents

https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...

cedilla · 2024-03-05T08:05:54 1709625954

I don't think machine learning was involved there at all. As I understand it, it was an issue of a specifically implemented feature (reusing a single picture of a glyph for all other instances to save space) being turned on in archive-grade settings, despite the manual stating otherwise.

h4x0rr · 2024-03-05T07:51:05 1709625065

https://youtu.be/zXXmhxbQ-hk Interesting yet funny CCC video about this

yalok · 2024-03-05T08:26:32 1709627192

fwiw, in ASR/speech transcription world, it looks reverse to me - in the past, there was lots of custom non-ML code & separate ML models for audio modeling and language modeling - but current SOTA ASRs are all e2e, and that's what's used even in mobile applications, iiuc.

I still think the pendulum will swing back there again, to have even better battery/larger models on mobile.

spacechild1 · 2024-03-04T19:48:04 1709581684

I'm using Opus as one of the main codecs in my peer-to-peer audio streaming library (https://git.iem.at/cm/aoo/ - still alpha), so this is very exciting news!

I'll definitely play around with these new ML features!

RossBencina · 2024-03-05T08:33:25 1709627605

> peer-to-peer audio streaming library

Interesting :)

spacechild1 · 2024-03-05T12:39:34 1709642374

Ha, now that is what I'd call a suprise! The "group" concept has obviously been influenced by oscgroups. And of course I'm using oscpack :)

AOO has already been used successfully in a few art projects. It's also used under hood by SonoBus. The Pd objects are already stable, I hope to finish the C/C++ API and add some code examples soon.

Any feedback from your side would of course be very appreciated.

Dwedit · 2024-03-04T22:57:50 1709593070

I just want to mention that getting such good speech quality at 9kbps by using NoLACE is absolutely insane.

qingcharles · 2024-03-09T17:38:03 1710005883

I was the lead dev for a major music streaming startup in 1999. I was working from home as they didn't have offices yet. My cable connection got cut and my only remaining Internet was 9600bps through my Nokia 9000 serial port. I had to re-encode the whole music catalog at 8000kbps WMA so I could stream it and continue testing all the production code.

The quality left a little to be desired...!

kristopolous · 2024-03-04T23:55:04 1709596504

I wanted to see what it would sound like in comparison to a really early streaming audio codec, realaudio 1.0

    $ ffmpeg -i female_ref.wav - acodec real_144 female_ref.ra

And if you can't support that I put it back to wav and posted it: http://9ol.es/female_ref-ra.wav

This was seen as "14.4" audio, for 14.4kb/s dialup in the mid-90s. The quality increase over those nearly 30 years for what you can get out of what's actually a fewer number of bytes is really impressive.

anthk · 2024-03-05T02:01:33 1709604093

I used to listen opus avant agarde music from https://dir.xiph.org at 16kb/s under a 2G connection and it was usable once mplayer/mpv cached back the stream for nearly a minute.

recursive · 2024-03-05T04:03:21 1709611401

I don't know any of the details but maybe the CPUs of the time would have struggled to stream the decoding.

rhdunn · 2024-03-04T21:13:24 1709586804

I find the interplay between audio codecs, speech synthesis, and speech recognition fascinating. Advancements in one usually results in advancements in the others.

luplex · 2024-03-04T20:00:51 1709582451

I wonder: did they address common ML ethics questions? Specifically: Are the ML algorithms better/worse on male than on female speech? How about different languages or dialects? Are they specifically tuned for speech at all, or do they also work well for music or birdsong?

That said, the examples are impressive and I can't wait for this level of understandability to become standard in my calls.

jmvalin · 2024-03-04T20:42:31 1709584951

Quoting from our paper, training was done using "205 hours of 16-kHz speech from a combination of TTS datasets including more than 900 speakers in 34 languages and dialects". Mostly tested with English, but part of the idea of releasing early (none of that is standardized) is for people to try it out and report any issues.

There's about equal male and female speakers, though codecs always have slight perceptual quality biases (in either direction) that depend on the pitch. Oh, and everything here is speech only.

radarsat1 · 2024-03-04T20:35:28 1709584528

This is an important question. However, I'd like to point out that similar biases can easily exist for non-ML, hand-tuned algorithms. Even in the latter case test sets and often even "training" and "validation" sets are used for finding good parameters. Any of these can be a source of bias, as can the ears of evaluators making these decisions.

It's true that bias questions often come up in ML context because fundamentally these algorithms do not work without data, but _all_ algorithms are designed by people, and _many_ can involve data in setting their parameters. Both of which can be sources of bias. ML is more known for it, I believe, because the _inductive_ biases are less than in traditional algorithms, and therefore are more keen to adopt biases present in the dataset.

MauranKilom · 2024-03-04T23:34:38 1709595278

As a notable example, the MP3 format was hand-tuned to vocals based on "Tom's Diner" (i.e. a female voice). It has been accused of being biased towards female vocals as a result.

thomastjeffery · 2024-03-04T21:28:33 1709587713

Usually regular algorithms aren't generating data that pretends to be raw data. That's the significant difference here.

shwaj · 2024-03-05T02:35:40 1709606140

Can you precisely define what you mean by "generating" and "pretends", in such a way that this neural network does both these things, but a conventional modern audio codec doesn't?

"Pretends" is a problematic choice of words, because it anthropomorphizes the algorithm. It would be more accurate and less misleading to replace "pretends to be" with "approximates". But then it wouldn't serve your goal of (seeming to) establish a categorical difference between this approach and "regular algorithms", because that's what a regular algorithm does too.

I apologize, because the above might sound rude. It's not intended to be.

thomastjeffery · 2024-03-05T15:19:18 1709651958

I was avoiding the word "approximate", because that implies a connection to the original raw data.

A generative model guesses what data should be filled in, based on what is present in its own model. This process is totally ignorant of the original (missing) data.

To contrast, a lossy codec works directly with the original data. It chooses what to throw out based on what the algorithm itself can best reproduce during playback. This is why you should never transcode from one lossy codec to another: the holes will no longer line up with the algorithm's hole-filling expectations.

samus · 2024-03-05T07:32:31 1709623951

Not really. Any lossy codec is generating data that pretends to be close to the raw data.

thomastjeffery · 2024-03-05T15:12:29 1709651549

Yes, but the holes were intentionally constructed such that the end result is predictable.

There is a difference between pretending to be the original raw data, and pretending to be whatever data will most likely fit.

samus · 2024-03-05T19:59:33 1709668773

And that's why Packet Loss Concealment is only used to fill in the occasional lost packet. That way, the occasional vowel or the end of a syllable could be bridged over. Other improvements exist to prevent packet loss in the first place that are much less in make-samples-up territory.

thomastjeffery · 2024-03-05T20:14:03 1709669643

Yes, I agree that this use case is entirely reasonable. I also agree that it's reasonable to be concerned in the first place.

unixhero · 2024-03-04T20:56:25 1709585785

Why is the ethics question important? It is a new feature for an audio codec, not a new material to teach in your kids curriculum.

nextaccountic · 2024-03-05T02:47:38 1709606858

Because this gets deployed in real world, affecting real people. Ethics don't exist only in kids curriculum.

unethical_ban · 2024-03-04T21:25:36 1709587536

I get your point, but the questioner wasn't being rude or angry, only curious. I think it's a valid question, too. While it isn't as important to be neutral in this instance as, say, a crime prediction model or a hiring model, it should be boilerplate to consider ML inputs for identity neutrality.

gcr · 2024-03-04T21:59:32 1709589572

This is a great question! Here's a related failure case that I think illustrates the issue.

In my country, public restroom facilities replaced all the buttons and levers on faucets, towel dispensers, etc. with sensors that detect your hand under the faucet. Black people tell me they aren't able to easily use these restrooms. I was surprised when I heard this, but if you google this, it's apparently a thing.

Why does this happen? After all, the companies that made these products aren't obviously biased against black people (outwardly, anyway). So this sort of mistake must be easy to fall into, even for smart teams in good companies.

The answer ultimately boils down to ignorance. When we make hand detector sensors for faucets, we typically calibrate them with white people in mind. Of course different skin tones have different albedo and different reflectance properties, so sensors are less likely to fire. Some black folks have a workaround where they hold a (white) napkin in their hand to get the faucet to work.

How do we prevent this particular case from happening in the products we build? One approach is to ensure that the development teams for skin sensors have a wide variety of skin types. If the product development team had a black guy for example, he could say "hey, this doesn't work with my skin, we need to tune the threshold." Another approach is to ensure that different skin types are reflected in the data used to fit the skin statistical models we use. Today's push for "ethics in ML" is borne out of this second path as a direct desire to avoid these sorts of problems.

I like this handwashing example because it's immediately apparent to everyone. You don't have to "prioritize DEI programs" to understand the importance of making sure your skin detector works for all skin types. But, teams that already prioritize accessibility, user diversity, etc. are less likely to fall into these traps when conducting their ordinary business.

For this audio codec, I could imagine that voices outside the "standard English dialect" (e.g. thick accents, different voices) might take more bytes to encode the same signal. That would raise bandwidth requirements, worsen latency, and increase data costs for these users. If the codec is designed for a standard American audience, that's less of an issue, but codecs work best when they fit reasonably well for all kinds of human physiology.

cma · 2024-03-04T22:55:56 1709592956

What if it is a pareto improvement: better improvement for some dialects but no worse than the earlier version for anyone. Should it be shelved or tuned down so all improvement for each dialect see gains by an exactly equal percentage?

viraptor · 2024-03-04T23:58:58 1709596738

Here's a question that should have the same/similar answer: Increasingly some part of the job interviews is being handled over the internet. All other things being equal, people are likely to have a more positive response to candidates with more pleasant voice. So if new ML-enhanced codecs become more common, we may find that some group X has a just slightly worse quality score than others. Over enough samples that would translate to lower interview success rate for them.

Do you think we should keep using that codec, because overall we get a better sound quality across all groups? Do you feel the same as a member of group X?

shwaj · 2024-03-05T03:02:10 1709607730

I don't think it's a given that we shouldn't keep using that codec. For example, maybe the improvement is due to an open source hacker working in their spare time to make the world a better place. Do we tell them their contribution isn't welcome until it meets the community's benchmark for equity?

Your same argument can also be used to degrade the performance for all other groups, so that group X isn't unfairly disadvantaged. Or, it can even be used to argue that the performance for other groups should be degraded to be even worse than group X, to compensate for other factors that disadvantage group X.

This is argumentum ad absurdum, but it goes to show that the issue isn't as black and white as you seem to think it is.

viraptor · 2024-03-05T04:56:27 1709614587

A person creating a codec doesn't choose if it's globally adopted. System implementors (like for example Slack) do. You're don't have to tell the open source dev anything. You don't owe them to include their implementation.

And if their contribution was to the final system, sure, it's the owner's choice what the threshold for acceptable contribution is. In the same way they can set any other benchmark.

> Your same argument can also be used to degrade the performance for all other groups,

The context here was Pareto improvement. You're bringing a different situation.

shwaj · 2024-03-05T15:17:30 1709651850

The grandparent provided an argument why we might not want to use an algorithm, even if it provided a Pareto improvement.

I suggested that the same argument could be used to say that we should actively degrade performance of the algorithm, in the name of equity. This is absurd, and illustrates that the GP argument is maybe not as strong as it appears.

viraptor · 2024-03-05T21:18:59 1709673539

The argument doesn't make sense in practice. We could discuss it as a philosophy exercise, but realistically if the current result is better overall and biased against some group, you can just rebalance it and still get an overall better result compared to status quo.

Changing codecs in practice takes years/decades, so you always have time to stop, think and tweak things.

gcr · 2024-03-05T12:03:41 1709640221

One thing the small mom-and-pop hacker types can do is disclose where bias can enter the system or evaluate it on standard benchmarks so folks can get an idea where it works and where it fails. That was the intent behind the top-level comment asking about bias, I think.

If improving the codec is a matter of training on dataset A vs dataset B, that’s an easier change.

samus · 2024-03-05T07:53:34 1709625214

I would be very surprised if there is no improvement if the codec is biased towards particular dialects or other distinctive subsets of the data. And we could certainly be fine with some kinds of bias. Speech codecs are intended to transmit human speech after all. Not that of dogs, bats, or hypothetical extraterrestrials. On the other hand, a wider dataset might reduce overfitting and force the model to learn better.

If the codec has the intention of working best for human voice in general, then it is simply not possible to define sensible subsets of the user base to optimize for. Curating an appropriate training set has therefore technical impact on the performance of the codec. Realistically, I admit that the percentages of speech samples of languages in such a dataset would be according to the relative amount of speakers. This is of course a very fuzzy number with many sources of systematic error (like what counts as one language, do non-native speakers count, which level of proficiency is considered relevant, etc.), and ultimately English is a bit more important since it is de-facto the international lingua franca of this era.

In short, a good training set is important unless one opines that certain subsets of humanity will never ever use the codec, which is equivalent to being blind to the reality that more and more parts of the world are getting access to the internet.

TeMPOraL · 2024-03-05T08:51:10 1709628670

I get your point, but in the example used - and I can think of couple others that start with "X replaced all controls with touch/voice/ML" - the much bigger ethical question is why did they do it in the first place. The new solution may or may not be biased differently than the old one, but it's usually inferior to the previous ones or simpler alternatives.

The_Colonel · 2024-03-04T21:04:52 1709586292

Imagine you release a codec which optimizes for cis white male voice, every other kind of voice has perceptibly lower fidelity (at low bitrates). That would not go well...

panzi · 2024-03-04T21:15:54 1709586954

Yeah, imagine a low bitrate situation where only English speaking men can still communicate. That would create quite a power imbalance.

overstay8930 · 2024-03-04T21:49:23 1709588963

Meanwhile G.711 makes all dudes sound like disgruntled middle aged union workers

numpad0 · 2024-03-05T05:37:29 1709617049

No offense/taken, but Codec2 seem to be affected a bit for this problem.

shrubble · 2024-03-04T21:10:41 1709586641

[flagged]

Edman274 · 2024-03-04T21:26:29 1709587589

What people have historically called a "gay lisp" is actually a hyper-articulation of /s/ or /z/, and as you might expect, /s/ and /z/ have a lot of high frequency sounds in them. Weird as it sounds there is a possible scenario where an audio codec does a worse job reproducing the audio content of gay male speech compared to straight male speech.

stevage · 2024-03-04T21:57:49 1709589469

https://en.wikipedia.org/wiki/Gay_male_speech?wprov=sfla1

HTH.

p1esk · 2024-03-04T21:13:33 1709586813

An ML model might be able to, even if you can’t.

GaggiX · 2024-03-04T21:22:59 1709587379

Achieving the gaydar, do not give this technology to Saudi Arabia.

dist-epoch · 2024-03-04T21:30:29 1709587829

[flagged]

gcr · 2024-03-04T22:14:57 1709590497

Questions of "which users can use my product and which can't" certainly encompass political views to be sure, but they also extend further beyond. I don't see why you feel the need to argue from the position of "woke devil's advocate" here.

For a politically-neutral example of why poor DEI practice can lead directly to worse products, see my sister comment about faucet sensors and black skin.

astrange · 2024-03-04T21:36:47 1709588207

^- https://knowyourmeme.com/memes/conservatives-have-one-joke

samus · 2024-03-05T07:34:21 1709624061

This is actually a very technical question since it means the audio codec might simply not work that well in practice as it could and should.

kolinko · 2024-03-05T06:56:10 1709621770

As a person from a different language/accent who has to deal with this on a regular basis - having assistants like Siri not understand what I want to say, even though native speakers don't have such problem... Or before an advent of UTF - websites and apps ignoring special characters usable in my language.

I wouldn't consider this a matter of ethics, and more of a technology limitations or ignorance.

frumiousirc · 2024-03-05T13:14:57 1709644497

How about adding a text "subtitle" stream to the mix. The encoder may use ML to perform speech-to-text. The decoder may then use the text, along with the audio surrounding the audio drop outs, to feed a conditional text-to-speech DNN. This way the network does not have to learn the harder problem of blindly interpolating across the drop outs from just the audio. The text stream is low bitrate so it may have substantial redundancy in order to increase the likelihood that any given (text) message is received.

jmvalin · 2024-03-05T18:02:22 1709661742

Actually, what we're doing from DRED isn't that far from what you're suggesting. The difference is that we keep more information about the voice/intonation and we don't need the latency that would otherwise be added by an ASR. In the end, the output is still synthesized from higher-level, efficiently compressed information.

travisporter · 2024-03-04T18:17:06 1709576226

Very cool. seems like they addressed the problem of hallucination. would be interesting to see an example of it hallucinating without redundancy and corrected with redundancy

CharlesW · 2024-03-04T18:32:58 1709577178

Isn't packet loss concealment (PLC) a form of hallucination? Not saying it's bad, just that it's still Making Shit Up™ in a statistically-credible way.

jmvalin · 2024-03-04T20:50:55 1709585455

Well, there's different ways to make things up. We decided against using a pure generative model to avoid making up phoneme or words. Instead, we predict the expected acoustic features (using a regression loss), which means that model is able to continue a vowel. If unsure it'll just pick the "middle point", which won't be something recognizable as a new word. That's in line with how traditional PLCs work. It just sounds better. The only generative part is the vocoder that reconstructs the waveform, but it's constrained to match the predicted spectrum so it can't hallucinate either.

stevage · 2024-03-04T21:59:01 1709589541

Any demos of this to listen to? It sounds potentially really good.

GaggiX · 2024-03-04T22:15:29 1709590529

There is a demo in the link shared by OP.

CharlesW · 2024-03-04T20:52:46 1709585566

That's really cool. Congratulations on the release!

derf_ · 2024-03-04T18:35:33 1709577333

The PLC intentionally fades off after around 100 ms so as not to cause misleading hallucinations. It is really just about filling small gaps.

skybrian · 2024-03-04T19:07:17 1709579237

In a broader context, though, this happens all the time. You’d be surprised what people mishear in noisy conditions. (Or if they’re hard of hearing.) The only thing for it is to ask them to repeat back what they heard, when it matters.

It might be an interesting test to compare what people mishear with and without this kind of compensation.

jmvalin · 2024-03-04T20:32:09 1709584329

As part of the packet loss challenge, there was an ASR word accuracy evaluation to see how PLC impacted intelligibility. See https://www.microsoft.com/en-us/research/academic-program/au...

The good news is that we were able to improve intelligibility slightly compared with filling with zeros (it's also a lot less annoying to listen to). The bad news is that you can only do so much with PLC, which is why we then pursued the Deep Redundancy (DRED) idea.

tialaramex · 2024-03-04T20:40:55 1709584855

Right, this is why the Proper radio calls for a lot of systems have mandatory read back steps, so that we're sure two humans have achieved a shared understanding regardless of how sure they are of what they heard. It not only matters whether you heard correctly, it also matters whether you understood correctly.

e.g. train driver asks for an "Up Fast" block. His train is sat on Down Fast, the Up Fast is adjacent, so then he can walk on the (now safe) railway track and inspect his train at track level, which is exactly what he, knowing the fault he's investigating, was taught to do.

Signaller hears "Up Fast" but thinks duh, stupid train driver forgot he's on Down Fast. He doesn't need a block, the signalling system knows the train is in the way and won't let the signaller route trains on that section. So the Up Fast line isn't made safe.

If they leave the call here, both think they've achieved understanding but actually there is no shared understanding and that's a safety critical mistake.

If they follow a read-back procedure they discover the mistake. "So I have my Up Fast block?" "You're stopped on Down Fast, you don't need an Up Fast block". "I know that, I need Up Fast. I want to walk along the track!" "Oh! I see now, I am filling out the paperwork for you to take Up Fast". Both humans now understand what's going on correctly.

a_wild_dandan · 2024-03-04T19:49:18 1709581758

To borrow from Joscha Bach: if you like the output, it's called creativity. If you don't, it's called a hallucination.

Aachen · 2024-03-04T21:13:05 1709586785

That sounds funny, but is it true? Certainly there's a bias that goes towards what you're quoting, but would you otherwise genuinely call the computer creative? Is that a positive aspect of a speech codec or of an information source?

Creative is when you ask a neural net to create a poem, or something else from "scratch" (meant to be unique). Hallucination is when you didn't ask it to make its answer up but to recite or rephrase things it has directly observed

That's my layman's understanding anyway, let me know if you agree

skybrian · 2024-03-04T21:58:08 1709589488

That's almost the same. You could say it's being creative by not following directions.

Creativity isn't well-defined. If you generate things at random, they are all unique. If you then filter them to remove all the bad output, the result could be just as "creative" as anything someone could write. (In principle. In practice, it's not that easy.)

And that's how evolution works. Many organisms have very "creative" designs. Filtering at scale, over a long enough period of time, is very powerful.

Generative models are sort of like that in that they often use a random number generator as input, and they could generate thousands of possible outputs. So it's not clear why this couldn't be just as creative as anything else, in principle.

The filtering step is often not that good, though. Sometimes it's done manually, and we call that cherry-picking.

Dylan16807 · 2024-03-05T02:06:52 1709604412

Doesn't the context affect things much more than whether you like the particular results?

Either way, "creativity" in the playback of my voice call is just as bad.

CharlesW · 2024-03-04T20:16:56 1709583416

I love that, what's it from? (My Google-fu failed.) Unexpected responses are often a joy when using AI in a creative context. https://www.cell.com/trends/neurosciences/abstract/S0166-223...

a_wild_dandan · 2024-03-05T01:40:09 1709602809

It was from one of his podcast appearances. Which doesn't narrow it down much, unfortunately. Most likely options:

https://www.youtube.com/watch?v=LgwjcqhkOA4

https://www.youtube.com/watch?v=sIKbp3KcS8A

https://www.youtube.com/watch?v=CcQMYNi9a2w

Sonic656 · 2024-03-13T09:12:11 1710321131

There something darkly funny that Opus could act psychotic because It glitched out or was fed something really complex. But you could argue transparent lossy compression at 80 ~ 320kbps is a controlled Deliriant like hallucination going by how only rare few can tell them apart from Lossless.

h4x0rr · 2024-03-04T22:01:43 1709589703

Does this new Opus version close the gap to xHE-AAC, which is (was?) superior at lower bitrates?

AzzyHN · 2024-03-05T04:03:36 1709611416

Depends on whether you're encoding speech or music.

Sonic656 · 2024-03-13T09:02:36 1710320556

Love how Opus 1.5 is now actually transparent at 16kbps for voice and 96kbps is still beats 192kbps MP3. Meanwhile xHE-AAC still feels like It was farted out since It 96 ~ 256kbps area Is legit worse than AAC-LC(Apple, FDK) are at ~160kbps.

brnt · 2024-03-04T22:06:01 1709589961

What if there was a profiler or setting that helps to reencode existing lossy formats without introducing too many more artifacts? An sizeable collection runs into the issue, if the don't have (easily accessible) lossless masters.

I'd be very interested if I could move a variety of mp3s, aacs and vorbis to Opus if I knew additional quality loss was minimal.

cedilla · 2024-03-05T08:11:45 1709626305

The quality at 80% package loss is incredible. It's straining to listen to but still understandable.

nimish · 2024-03-05T02:54:51 1709607291

That 90% loss demo is bonkers. Completely comprehensible after maybe a second.

out_of_protocol · 2024-03-04T18:30:21 1709577021

Why the hell opus still not in Bluetooth? Well i know - sweet sweet license fees

(aKKtually, there IS opus codec, supported by pixel phones - google made it for VR/AR stuff. No one uses it, there are about ~1 headphone with opus support )

lxgr · 2024-03-04T18:53:47 1709578427

As you already mention, it's already possible to use it. As for why hardware manufacturers don't actually use it, you can thank beautiful initiatives such as this: https://www.opuspool.com/ (previous HN discussion: https://news.ycombinator.com/item?id=33158475).

giantrobot · 2024-03-04T19:04:33 1709579073

The BT SIG moves kind of slow and there's a really long tail of devices. Until there's a chip with native Opus support (that's as cheap as ones with AAC etc) you wouldn't get Opus support even if it was in the spec.

Realistically for most headphones people actually buy AAC (LC and HE) is more than good enough encoding quality for the audio the headphones can produce. Even if Opus was in the spec and Opus-supporting chips were common there would still be a hojillion Bluetooth devices in the wild that wouldn't support it.

It would be cool to have Opus in A2DP but it would take a BT SIG member that was really in love with it to get it in the profile.

out_of_protocol · 2024-03-04T19:13:51 1709579631

They chose to make totally new inferior LC3 codec though.

Also, on my system (Android phone + BTR5/BTR15 Bluetooth DAC + Sennheiser H600) all options sound realy crappy compared to plain old usb, everything else is the same. LDAC 990kbps is less crappy, by sheer brute force. I suspect it's not only codec but other co-factors as well (like mandatory DSP on phone side)

maep · 2024-03-05T09:50:47 1709632247

"Inferior" is relative. The main focus of LC3 was, as the name suggests, complexity.

This is hearsay: Bluetooth SIG considered Opus but rejected it because it was computationally too expensive. This came out of the hearing aid group, where battery life and complexity are a major restriction.

So when you compare codecs in this space, the metric you want to look at is quality vs. CPU cycles. In that regard LC3 outperforms many contemporary codecs.

Regarding sound quality it's simply a matter of setting the appropriate bitrate. So if Opus is transparent at 150 kbps, and LC3 at 250 kbps thats totally acceptable if that gives you more battery life.

out_of_protocol · 2024-03-05T13:21:54 1709644914

Regarding complexity, do you have any hard numbers? Can't find anything more than handwaving

maep · 2024-03-05T22:17:50 1709677070

I remember seeing published numbers based on instrumented code, but could not find it.

I did a quick test with the Google implementation (https://github.com/google/liblc3) which is about 2x faster than Opus. To be honest, I expected a bigger difference, though it's just a superfical test.

A few things that also might be of relevance why they picked one over the other:

  - suitability for DSPs
  - vendor buy-in
  - robustness
  - protocol/framing constraints
  - control

out_of_protocol · 2024-03-06T10:18:34 1709720314

Thanks for checking it, appreciated

- Well, 2x is nothing to write home about.

- DSP-compatibility probably considered but never surfaced as a reason, so hard to guess investigation results. + Pricing and availability of said DSP modules

- Robustness - well, that's one of the primary features of opus, battle tested by WebRTC, WhatsApp etc. (including packet loss concealment (PLC), Bit-Rate Redundancy (LBRR) frames)

- Algorithmic delay for opus is low, much lower than older BT codecs, so that definitely wasn't a deal breaker

- Ability to make money out of standard is definitely important thing to have

maep · 2024-03-06T17:45:14 1709747114

If used in a small device like a hearing aid, a 2x factor can have a significant impact on battery life.

VoIP in general experiences full packet loss, meaing if a single bit flips the entire packet is dropped. For radio links like Bluetooth it's possible to deal with some bit flips without throwing the entire packat away.

Until 1.5 Opus PLC was in my opinion it's biggest weakness, compared to other speech codecs like G.711 or G.722. A high compression ratio causes bit flips to be much more destructive.

As for making moeny, Bluetooth codecs have no license fees.

derf_ · 2024-03-07T11:13:02 1709809982

> For radio links like Bluetooth it's possible to deal with some bit flips without throwing the entire packat away.

Opus was intentionally designed so that the most important bits are in the front of the packet, which can be better protected by your modulation scheme (or simple FEC on the first few bits). See slide 46 of https://people.xiph.org/~tterribe/pubs/lca2009/celt.pdf#page... for some early results on the position-dependence of quality loss due to bit errors.

It is obviously never going to be as robust as G.711, but it is not hopeless, either.

giantrobot · 2024-03-05T21:59:59 1709675999

You can check out Google's version which I assume is bundled in Android: https://github.com/google/liblc3

giantrobot · 2024-03-04T22:28:29 1709591309

I've got AirPods and a Beats headset so they both support AAC and to my ear sound great. Keep in mind I went to a lot of concerts in my 20s without earplugs so my hearing isn't necessarily the greatest anymore.

AFAIK Android's AAC quality isn't that great so aptX and LDAC are the only real high quality options for Android and headphones. It's a shame as a lot of streaming is actually AAC bitstreams and can be passed directly through to headphones with no intermediate lossy re-encode.

Like I said though, to get Opus support in A2DP a BT SIG member would really have to be in love with it. Qualcomm and Sony have put forward aptX and LDAC respectively in order to get licensing money on decoders. Since no one is going to get Opus royalties there's not much incentive for anyone to push for its inclusion in A2DP.

dogma1138 · 2024-03-04T19:10:21 1709579421

Opus isn’t patent free, and what’s worse it’s not particularly clear who owns what. The biggest patent pool is currently OpusPool but it’s not the only one.

https://www.opuspool.com/

pgeorgi · 2024-03-04T20:13:06 1709583186

No codec (or any other technical development, really - edit: except for 20+ years old stuff, and only if you don't add any, even "obvious" improvements) is known patent free, or clear on "who owns what."

Folks set up pools all the time, but somehow they never offer indemnification for completeness of the pool - because they can't.

See https://en.kangxin.com/html/2/218/219/220/11565.html for a few examples how the patent pool extortion scheme already went wrong in the past.

xoa · 2024-03-04T20:31:51 1709584311

FWIW, submarine patents are long dead, so it is possible to feel assured that old enough stuff is patent free. Of course that denies a lot of important improvements, but due to diminishing returns and the ramp of tech development it's still ever more significant. A lot of key stuff is going to lose monopoly lock this decade.

pgeorgi · 2024-03-04T20:53:20 1709585600

You're right. I could still amend the post, so I added the 20+ years caveat. Thanks!

dogma1138 · 2024-03-04T20:25:24 1709583924

No one said that Opus is the only one suffering from licensing ambiguity, but comparing it to say AptX and its variants which do have a clear one stop shop for licensing (Qualcomm) it’s a much riskier venture especially when it comes to hardware.

pgeorgi · 2024-03-04T20:55:45 1709585745

A drive-by patent owner can show up on anything, and if they don't want to license to you, your entire product is bust.

Even if it's AptX and Qualcomm issues you a license in exchange for money. I wouldn't even bet on being able to claw back these license costs after being ordered to destroy your AptX-equipped product after it ran into somebody else's patent.

The risk that this happens is _exactly_ the same for Opus or AptX.

tux3 · 2024-03-04T20:43:27 1709585007

Making a patent troll is just a matter of putting up a press release and a web page.

I could claim to have a long list of patents against AptX. Anyone could.

Of course I'm not willing to disclose the list of patents at this time, but customers looking to be extorted may contact me privately.

rockdoe · 2024-03-04T23:05:30 1709593530

FhG and Dolby did eventually put up a list of patents you are licensing from them.

It makes for some funny reading if you're familiar with the field. (This should not be construed as legal advice as to the validity of the pool)

bdowling · 2024-03-04T23:09:12 1709593752

At least in the U.S., anyone can look up all the patents a person/entity owns. So, your fraud wouldn’t get very far.

https://assignment.uspto.gov/patent/index.html#/patent/searc...

pgeorgi · 2024-03-05T00:03:47 1709597027

"I represent the holders of the patents in question" is simple enough. I wonder if it's fraud, if all you're putting out is an unverifiable claim on the net. The pool operators do that all the time.

bdowling · 2024-03-05T09:16:41 1709630201

Someone who has no basis to bring a patent infringement claim selling a settlement of such a claim to an alleged infringer is clearly fraud.

It’s like someone selling a deed to land they don’t own, or leasing a property they don’t own to a tenant.

pgeorgi · 2024-03-05T10:29:41 1709634581

The point isn't to sell a settlement. It's to publish "oh, we're totally serious that there are patents in our control. We won't tell you which ones, we don't tell you what we want from you. If you're interested, reach out to us."

Few people will even bother to try, and if so, you keep them at a distance with some random bullshit (communication can break down _sooo_ easily), but it certainly poisons the well by adding a layer of FUD to the tech you're targeting with your claims.

Standard fare of patent pool operators, and it's high time to reciprocate.

rockdoe · 2024-03-04T22:10:26 1709590226

Opus isn’t patent free

The existence of a patent pool does not mean there are valid patent claims against it. But yes, you may be technically correct by saying "patent free" rather than "not infringing on any valid patents". That said historically Opus has had claims against it by patents that looked valid but upon closer investigation didn't actually cover what the codec does.

Just looks like FUD to me. Meanwhile, the patent pools of competing technologies definitely still don't offer indemnification they cover all patents, but have no problem paying a bunch of people to spew exactly this kind of FUD - they're the ones who tried to set up this "patent pool" to begin with!

brcmthrowaway · 2024-03-04T21:36:13 1709588173

This is game changing. When will H265 get a DL upgrade?

aredox · 2024-03-04T23:32:21 1709595141

>That's why most codecs have packet loss concealment (PLC) that can fill in for missing packets with plausible audio that just extrapolates what was being said and avoids leaving a hole in the audio

...How far can ML PLC "hallucinate" audio? A sound , a syllable, a whole word, half a sentence?

Can I trust anymore what I hear?

jmvalin · 2024-03-05T00:50:34 1709599834

What the PLC does is (vaguely) equivalent to momentarily freezing the image rather than showing a blank screen when packets are lost. If you're in the middle of a vowel, it'll continue the vowel (trying to follow the right energy) for about 100 ms before fading out. It's explicitly designed not to make up anything you didn't say -- for obvious reasons.

aredox · 2024-03-05T14:51:53 1709650313

Reassuring - thanks for clarifying that up.

samus · 2024-03-05T08:20:29 1709626829

You never can when lossy compression is involved. It is commonly considered good practice to verify that the communication partner understood what was said, e.g., by restating, summarizing, asking for clarification, follow-up questions etc.

xyproto · 2024-03-04T23:36:01 1709595361

It can already fill in all gaps and create all sorts of audio, but it may sound muddy and metallic. Give it a year, and then you can't trust what you hear anymore. Checking sources is a good idea in either case.

m3kw9 · 2024-03-05T01:46:09 1709603169

Some people hyping it as AGI on social media

samus · 2024-03-05T08:24:23 1709627063

Sadly, I see it even on forums where one might think people have background in technology...

m3kw9 · 2024-03-05T14:56:56 1709650616

The tech background is likely IT and not AI. They used ChatGPT and they thought it’s conscious

samus · 2024-03-05T20:36:44 1709671004

I get WH40k techpriest vibes reading these posts

WithinReason · 2024-03-04T23:15:08 1709594108

Someone should add an ML decoder to JPEG

viraptor · 2024-03-05T00:02:18 1709596938

You can't do that much on the decoding side (apart from the equivalent of passing the normally decoded result through a low percent img2img ML)

But the encoders are already there: https://medium.com/@migel_95925/supercharging-jpeg-with-mach... https://compression.ai/

WithinReason · 2024-03-05T00:05:55 1709597155

You can more accurately invert the quantisation step

viraptor · 2024-03-05T00:21:50 1709598110

You can't do it more accurately. You can make up expected details which aren't encoded in the file. But that's explicitly less accurate.

adgjlsfhk1 · 2024-03-05T01:23:37 1709601817

If the encoders know what model the decoders will be running, they can improve accuracy. You could pretty easily make a codec that doesn't encode high resolution detail if the decoder NN will interpolate it correctly.

viraptor · 2024-03-05T05:16:16 1709615776

That's changing the encoder and sure, you could do that. But that's basically a new version of the format. It's not the JPEG we're using anymore + ML in decoder. It's JPEG-ML on both the encoder and decoder side. And with the speed that we adopt new image formats... That's going to take ages :(

samus · 2024-03-05T08:14:23 1709626463

That makes sense if the goal is lossless compression. Since JPEG is lossy, it is sufficient to consider the Pareto front between quality, compressed size, and encoding/decoding performance.

mikae1 · 2024-03-04T18:32:27 1709577147

They’ll have my upvote just for writing ML instead AI. Seriously, this is very exciting developments for audio compression.

wilg · 2024-03-04T19:22:07 1709580127

This is something you really shouldn’t spend any cycles worrying about.

sergiotapia · 2024-03-04T20:18:27 1709583507

I'd just like to interject for a moment. What you're referring to as AI, is in fact, Machine Learning, or as I've recently taken to calling it, Machine Learning plus Traditional AI methods.

wilg · 2024-03-04T20:23:03 1709583783

My point is very clearly that you should not spend any time or energy thinking about about the terminology.

sergiotapia · 2024-03-04T20:55:42 1709585742

I know lol this a famous quote by ganoo loonix enthusiast Richard Stallman.

yt-anthr-acc-hn · 2024-03-05T05:54:34 1709618074

Words have meaning. People spend cycles on it because it matters and I'm glad we do.

wilg · 2024-03-05T06:50:15 1709621415

Good luck in your future endeavors!

claudiojulio · 2024-03-04T18:47:30 1709578050

Machine Learning is Artificial Intelligence. Just look at Wikipedia: https://en.wikipedia.org/wiki/Artificial_intelligence

declaredapple · 2024-03-04T19:27:09 1709580429

Many people are annoyed by the recent influx of calling everything "AI".

Machine learning, statistical models, procedural generation, literally an usage of heuristics are all being called "AI" nowadays which obfuscates the "boring" nature in favor of "exciting buzzword"

Selecting the quality of a video based on your download speed? That's "AI" now.

mook · 2024-03-04T21:24:56 1709587496

On the other hand, it means that you can assume anything mentioning AI is overhyped and probably isn't as great as they claim. That can be slightly useful at times.

sitzkrieg · 2024-03-04T20:21:24 1709583684

im quite tired of this. every snake oil shop now calls any algorithm "a i" to sound hip and sophisticated

mikae1 · 2024-03-04T20:38:49 1709584729

> Many people are annoyed by the recent influx of calling everything "AI".

Yes, that was the reason for my comment. :)

samus · 2024-03-05T08:25:09 1709627109

So are compilers and interpreters. The terminology changes, but since we still don't have a general, systematic, and precise definition of what "intelligence" means, the term AI is and always was ill-founded and a buzzword for investors. Sometimes, people get disillusioned, and that's how you get AI winters.

xcv123 · 2024-03-05T05:44:29 1709617469

Machine Learning is a subset of AI

p1esk · 2024-03-04T19:12:20 1709579540

Two inrelated “Opus” releases today, and both use ML. The other one is a new model from Anthropic.

behnamoh · 2024-03-04T19:11:43 1709579503

Isn't it a strange coincidence that this shows up on HN while Claude Opus is also announced today and is on HN front page? I mean, what are the odds of seeing the word "Opus" twice in a day on one internet page?

mattnewton · 2024-03-04T19:34:33 1709580873

Not that strange when you consider what “opus” means- product of work, with the connotation of being large and artistically important. It’s Latin, so it’s friendly phonemes to speakers of Romance languages and very scientific-and-important-sounding to English speaking ears. Basically the most generic name you can give your fine “work” in the western world.

behnamoh · 2024-03-04T19:53:51 1709582031

Thanks for the definition. I like the word! I just haven't come across it in a long time, and seeing it twice on HN frontpage is bizarre!

stevage · 2024-03-04T22:02:42 1709589762

It's funny, I was expecting the article to be about the Opus music font and was trying to figure out how ML could be involved.

declaredapple · 2024-03-04T19:13:42 1709579622

Well it was released today

Very likely a coincidence.

https://opus-codec.org/release/stable/2024/03/04/libopus-1_5...