MLow: Meta's low bitrate audio codec

lxgr · 2024-06-13T16:10:16 1718295016

All these new, low-bitrate codecs are amazing, but ironically I suspect that they won't actually be very useful in most of the scenarios Meta is using them:

To keep latency low in real-time communications, the packet rate needs to be relatively high, and at some point the overhead of UDP, IP, and lower layers starts dominating over the actual payload.

As an example, consider (S)RTP (over UDP and IP): RTP adds at least 12 bytes of overhead (let's ignore the SRTP authentication tag for now); UDP adds 8 byte, and IPv4 adds 20, for a total of 40. At at typical packet rate of 50 per second (for a serialization delay of 1/50 = 20ms), that's 16 kbps of overhead alone!

It might still be acceptable to reduce the packet rate to 25 per second, which would cut this in half for an overhead of 8 kbps, but the overhead would still be dominating the total transmission rate.

Where codecs like this can really shine, though, is circuit-switched communication (some satphones use bitrates of around 2 kbps, which currently sound awful!), or protocol-aware VoIP systems that can employ header compression such as that used by LTE and 5G in IMS (most of the 40 bytes per frame are extremely predictable).

toast0 · 2024-06-13T17:06:01 1718298361

Latency is the mind killer, but if available bandwidth is low, you save a ton of overhead by bundling 2-5 of your 20ms samples. Enough that the codec savings start to make sense, even though 100ms packets adds a ton of latency. Fancier systems can adapt codecs and samples per packet based on current conditions. The one I work on is a static codec and 60 ms of audio per packet, which isn't ideal, but allows us to run in low bandwidth much better than 20 ms per packet.

Edit to add: Meta can also afford to add a bit more sampling delay, because they've got very wide distribution of forwarding servers (they can do forwarding in their content appliances embedded in many ISPs), which reduces network delay vs competing services that have limited ability to host forwarding around the globe. Peer to peer doesn't always work and isn't always lower delay than going through a nearby forwarding server.

akira2501 · 2024-06-14T01:33:07 1718328787

> you save a ton of overhead by bundling 2-5 of your 20ms samples.

You pay that price back when the packets are dropped. We use a lot of advanced codec managers for broadcast, and while they do combining like this, they also offer the ability to repeat frames within subsequent packets. So you may get a packet with frames [1,2,3] then a packet with frames [2,3,4].

The best codecs actively monitor the connection and adjust all these parameters in real time for you.

tgtweak · 2024-06-13T16:29:26 1718296166

I think this is likely incorrect based on how much voice/audio distribution meta does today with facebook (and facebook live), instagram and whatsapp - moreso with whatsapp voice message and calling given it's considerable market share in countries with intermittent and low-reliability network connectivity. The fact it is more packet-loss robust and jitter-robust means that you can rely on protocols that have less error correction, segmenting and receive-reply overhead as well.

I don't think it's unreasonable to assume this could reduce their total audio-sourced bandwidth consumption by a considerable amount while maintaining/improving reliability and perceived "quality".

Looking at wireshark review of whatsapp on an active call there was around 380 UDP packets sent from source to recipient during a 1 minute call, and a handful of TCP packets to whatsapp's servers. That would yield a transmission overhead of about 2.2kbps.

quick edit to clarify why this is: you can see starting ptime (audio size per packet) set to 20ms here, but maxptime set to 150ms, which the clients can/will use opportunistically to reduce the number of packets being sent taking into consideration the latency between parties and bandwidth available.

(image): https://www.twilio.com/content/dam/twilio-com/global/en/blog...

lxgr · 2024-06-13T16:34:57 1718296497

What part of that calculation is incorrect in your view?

> 380 UDP packets sent from source to recipient during a 1 minute call, and a handful of TCP packets to whatsapp's servers. That would yield a transmission overhead of about 2.2kbps.

That sounds like way too many packets! 380 packets per second, at 40 bytes of overhead per packet, would be almost 120 kbps.

My calculation only assumes just 50, and that’s already at a quite high packet rate.

> you can rely on protocols that have less error correction

You could, but there's no way to get a regular smartphone IP stack running over Wi-Fi or mobile data to actually expose that capability to you. Even just getting the OS's UDP stack (to say nothing of middleboxes) to ignore UDP checksums and let you use those extra four bytes for data can be tricky.

Non-IP protocols, or even just IP or UDP header compression, are completely out of reach for an OTT application. (Networks might transparently do it; I'm pretty sure they'd still charge based on the gross data rate though, and as soon as the traffic leaves their core network, it'll be back to regular RTP over UDP over IP).

What they could do (and I suspect they might already be doing) is to compress RTP headers (or use something other than RTP) and/or pick even lower packet rates.

> I don't think it's unreasonable to assume this could reduce their total audio-sourced bandwidth consumption by a considerable amount while maintaining/improving reliability and perceived "quality".

I definitely don't agree on the latter assertion – packet loss resilience is a huge deal for perceived quality! I'm just a bit more pessimistic on the former, unless they do the other optimizations mentioned above.

markus92 · 2024-06-13T16:57:16 1718297836

I think you’re misreading OP, as he says 380 packets per minute, not second. That would give you an overhead of 253 bytes per second, sounds a lot more reasonable.

tgtweak · 2024-06-13T17:12:43 1718298763

Yes 380/min = ~6/s which is a very open ptime of >100ms, this can also be dynamic and change don the fly. It ultimately comes down to how big the packet can be before it gets split which is a function of MTU.

If you have 50ms of latency between parties, and you are sending 150ms segments, you'll have a perceived latency of ~200ms which is tolerable for voice conversations.

One other note is that this is ONLY for live voice communication like calling where two parties need to hear and respond within a resonable delay - for downloading of audio messages or audio on videos, including one-way livestreams for example, this ptime is irrelevant and you're not encapsulating with SRTP - that is just for voip-like live audio.

There is a reality in what OP posted which is that there is diminishing returns in actual gains as you get lower in the bitrate, but modern voice implementations in apps like whatsapp are using dynamic ptime and are very smart about adapting the voice stream to account for latency, packet loss and bandwidth.

markus92 · 2024-06-14T08:16:58 1718353018

In my personal experience, Whatsapp's calling is subpar compared to Facetime audio, Skype or VoWIFI even. Higher latency, lower sound quality and very sensitive to spotty connections.

lxgr · 2024-06-13T17:12:36 1718298756

Wow, that would be an extremely low packet rate indeed!

That would definitely increase the utility of low bitrate codecs by a lot, at the expense of some latency (which is probably ok, if the alternative is not having the call at all).

roman-holovin · 2024-06-13T16:56:56 1718297816

I read it as in 380 packets per whole call, which was a minute long, not 380 packets per second during 1 minute.

mikepavone · 2024-06-13T17:20:34 1718299234

That's about 160 ms of audio per packet. That's a lot of latency to add before you even hit the network

ant6n · 2024-06-13T17:42:37 1718300557

Assuming continuous sound. You don’t need to send many packets for silence.

lxgr · 2024-06-13T18:08:51 1718302131

Voice activity detection and comfort noise have been available in VoIP since the very beginning, but now I wonder if there's some clever optimization that could be done based on a semantic understanding of conversational patterns:

During longer monologues, decrease packet rates; for interruptions, send a few early samples of the interrupter to notify the speaker, and at the same time make the (former) speaker's stack flush its cache to allow "acknowledgement" of the interruption through silence.

In other words, modulate the packet rate in proportion to the instantaneous interactivity of a dialogue, which allows spending the "overhead budget" where it matters most.

newobj · 2024-06-13T17:00:25 1718298025

pretty sure they said 380 packets total in the 1 minute call (~6-7/s)

vel0city · 2024-06-13T16:36:58 1718296618

Another interesting use case for these kinds of ultra-low bitrate voice compression systems are digital radio systems. AMBE+2 and similar common voice codecs used on radio systems sound pretty miserable and don't handle dropped packets nearly as gracefully as compared to these newer codecs.

woah · 2024-06-14T02:56:56 1718333816

According to the blog post, this is all practical research that Meta is doing to improve their services. Maybe that is BS, but I kind of doubt that given that Meta is in fact one of the biggest providers of voice and video calls on low bandwidth devices.

What makes you think that they have somehow just been misleading themselves the whole time?

saurik · 2024-06-13T19:18:46 1718306326

I don't know of any setups which would support muxing in exactly the way I am thinking, but another interesting use case is if you have multiple incoming audio streams which you don't want to be mixed by the server -- potentially because they are end-to-end encrypted -- and so a single packet can contain the data from multiple streams. Doing end-to-end encrypted audio calls is finally becoming pretty widespread, and I could see Facebook being in a good position for their products to do custom muxing.

yalok · 2024-06-13T18:22:22 1718302942

this codec is for RTC comms - it supports 20ms frame rate. They did mention it's launched in their calling products:

"We have already fully launched MLow to all Instagram and Messenger calls and are actively rolling it out on WhatsApp—and we’ve already seen incredible improvement in user engagement driven by better audio quality."

dan-robertson · 2024-06-13T21:32:50 1718314370

I’m not totally certain about your argument for the specific amount of overhead (if the receiver/sender are on mobile networks, maybe something happens to the packet headers for the first/last legs before the real internet). But doesn’t the OP already give an example where the low bit-rate codec is good: if you can compress things more then you have more of an opportunity to add forward error correction, which greatly improves the quality of calls on lossy connections. I wonder if smaller packets are less likely to be lost, and if there are important cases where multiple streams may be sent like group-calls.

trustmebroo · 2024-06-14T18:17:01 1718389021

webrtc engineer here...

You are right that OPUS already hits the ball for most audio applications. 20kbps is pretty good, and 30 kbps is almost perfect.

Meta has tons of Indian and South Asian users with insanely poor network. It's already using 60ms/120ms packets (16.6/8.3 packet rate) for those people, but still, some calls are still bad.

What this new codec does is to further make the call accessible by bringing down the bitrate, which seems to help their DAU.

Nitpick: measuring packet rate on your device doesn't usually work. I think meta (as well as most other voip providers) is using DTX, so when user doesn't talk, packet rate is 2/sec.

lukevp · 2024-06-13T17:12:48 1718298768

Why would you need 50 packets per second vs 10? Is 100ms not acceptable but 20ms is?

tgtweak · 2024-06-13T17:29:39 1718299779

Default configuration for SIP used to be 20ms, the rationale behind it was actually sourced in the fact that most SIP was done on LANs and inter-campus WAN which had generally high bitrate connectivity and low latency. The lower the packet time window the sooner the recipient could "hear" your voice, and if there were to be packet loss, there would be less of an impact if that packet were dropped - you'd only lose 20ms of audio vs 100ms. The same applies for high bitrate but high latency (3g for example) connectivity - you want to take advantage of the bandwidth to mitigate some of the network level latency that would impact the audio delay - being "wasteful" to ensure lower latency and higher packet loss tolerance.

Pointedly - if you had a 75ms of one-way latency (150ms RTT) between two parties, and you used a 150ms audio segment length (ptime) you'd be getting close to the 250ms generally accepted max audio delay for smooth two-way communication. the recipient is hearing your first millisecond of audio 226ms later at best. If any packet does get lost, the recipient would lose 150ms of your message vs 20ms.

Modern voice apps and voip use dynamic ptime (usually via "maxptime" which specifies the highest/worst case) in their protocol for this reason - it allows the clients to optimize for all combinations of high/low bandwidth, high/low latency and high/low packet loss in realtime - as network conditions can often change during the course of a call especially while driving around or roaming between wifi and cellular.

lxgr · 2024-06-13T18:14:26 1718302466

> the rationale behind it was actually sourced in the fact that most SIP was done on LANs and inter-campus WAN which had generally high bitrate connectivity and low latency

In addition to that, early VoIP applications mostly used uncompressed G.711 audio, both for interoperability with circuit switched networks and because efficient voice compression codecs weren't yet available royalty-free.

G.711 is 64 kbps, so 12 kbps of overhead are less than 25% – not much point in cutting that down to, say, 10% at the expense of doubling effective latency in a LAN use case.

crazygringo · 2024-06-13T21:22:06 1718313726

> Is 100ms not acceptable but 20ms is?

Yup pretty much. Doubling it for round-trip, 200 ms is a fifth of a second which is definitely noticeable in conversation.

40 ms is a twenty-fifth of a second, or approximately a single frame of a motion picture. That's not going to be noticeable in conversation all.

Of course both of these are on top of other sources of latencies, too.

IshKebab · 2024-06-14T10:50:30 1718362230

200ms is noticeable but in conversation it's still pretty good, and certainly way better than the average WhatsApp call which is on the order of 0.5-1s.

dilyevsky · 2024-06-14T00:13:08 1718323988

Anything over 40ms will be noticeable. Just to give you an idea how sensitive our ears are - there’s a max distance certain instruments can sit away from each other in an orchestra booth or they start falling out of sync due to speed of sound delay

callalex · 2024-06-14T01:45:45 1718329545

To clarify this even further, as someone who professionally plays an instrument that is traditionally placed at the back of an orchestra, you absolutely cannot play by ear: you MUST play by watching a combination of the stick in the conductor’s hand and the bow of the first violinist and cellist at the front. If you play what sounds in sync to you, the conductor and audience will hear you too late; the round trip from the front to the back of a stage, plus the sound traveling through the brass tubes of your instrument, plus the trip from the rear of the stage to the first row of the audience simply takes so long that it will sound noticeably wrong. The same is true for the far sides of an orchestra pit underneath the stage of a musical or opera. It only takes 20 meters/yards to become an issue.

lotharcable2 · 2024-06-14T14:58:08 1718377088

It is generally said that the lowest threshold for people to perceive time delays is around 10-15ms.

Speed of sounds is roughly 343 meters per second. Which means translates we can sense the delay difference of about 4-8 meters or so.

Which 100% corresponds with what you are saying. 20 meters is a 58ms-ish delay.

A 200ms is about 70 meters. Which would be like having conversation between people using one of those accidental sound projection features that sometimes happens with large open buildings like sports stadiums.

people talk in a cadence of around 100-200 words per minute. I guess we could say that is 300-600 syllables per minute. So that is about 200-100ms per syllable.

It all kinda lines up.

NavinF · 2024-06-13T22:35:33 1718318133

Yes 100ms feels horrible. People constantly interrupting each other because they start talking at around the same time and then both say "you go first". Discord has decent latency and IMO it's a major reason behind their success

mmmBacon · 2024-06-14T04:21:25 1718338885

They do this to optimize user experience for 3G and 2G networks not LTE or 5G. A huge number of people are on those networks.

jhugo · 2024-06-14T10:01:45 1718359305

Having traveled widely in three of the four most populous countries I’m a bit dubious of this claim. Having decent 4G in the middle of nowhere in India or Indonesia is pretty standard these days. I haven’t been to Africa, but a quick search suggests more than two thirds of the population of the continent has 4G coverage, and more than 80% has 3G. Where are these huge numbers of people on 2G?

eimrine · 2024-06-14T13:13:31 1718370811

In Ukraine we use to have a lot of 2G devices: energy meters, terminals in grocery shops, panic buttons and anti-theft devices.

sjacob · 2024-06-13T20:48:54 1718311734

What skills/concepts would one need to acquire in order to fully grasp all that you've detailed here?

kragen · 2024-06-14T05:01:32 1718341292

what you're saying is that tcp/ip sucks at 9600 baud and below, nothing to do with voice really. this is a well-known fact, and solutions to the problem include cslip, vj header compression in ppp, and csp, as well as the lte and 5g stuff you mention

hubraumhugo · 2024-06-13T17:31:48 1718299908

Is it just my perception or has Meta become cool again by sharing a ton of research and open source (or open weights) work?

Facebook's reputation was at the bottom, but now it seems like they made up for it.

nine_k · 2024-06-13T17:45:08 1718300708

I have the same impression.

Facebook the social network reputation may be not shiny, but Meta the engineering company reputation is pretty high, to my mind.

It's somehow similar to IBM, who may look not stellar as a hardware or software solutions provider, but still have quite cool research and microelectronics branches.

lucb1e · 2024-06-13T20:20:59 1718310059

research department != product department

Microsoft Research also puts out some really cool stuff, but that does not mean the "same" Microsoft can't show ads in their OS' start menu for people's constant enjoyment. I noticed this interesting discrepancy in Microsoft some years ago as a teenager; it does not surprise me at all that Facebook has a division doing cool things (zstandard, etc.) and a completely separate set of people working towards completely different goals simultaneously. Probably most companies larger than a couple hundred people have such departmental discrepancies

callalex · 2024-06-14T01:56:41 1718330201

I racked my brain and couldn’t think of a single example of something released by Microsoft Research offhand, so I proceeded to visit their website. It can be found at https://www.microsoft.com/en-us/research/

I found exactly zero things that they have provided to the world. Literally none, and I really came in with an open mind. All I found was marketer drivel telling me how I should feel about the impact of the marketing copy I was currently reading. No concrete examples after fanning out 3 links deep into every article posted on that site. I must be missing something so can you point me to at least one example of work that’s come out of Microsoft Research on the same scale as Meta’s LLM models or ReactJS?

throwaway91239 · 2024-06-14T02:31:58 1718332318

This is a surprisingly negative take, you should look a little further. Microsoft Research has done an incredible amount of high quality research. I've mostly read papers from them on programming languages, they have or did employ leading researchers behind C#, F#, Typescript (of course), as well as Haskell (Simon Peyton Jones and Simon Marlow spent a long time there), F*, and Lean (built by Leonardo de Moura while at MSR). MSR's scope has been much broader than just languages, of course.

You could look at their blog: https://www.microsoft.com/en-us/research/blog/

Or their list of publications: https://www.microsoft.com/en-us/research/publications/

Here's a (no longer updated, apparently) list of awards given to researchers at Microsoft: https://www.microsoft.com/en-us/research/awards-archive/

I've heard in interviews that MSR's culture is not what it used to be (like Bell Labs, maybe), but over time they've funded a ton of highly influential research.

amenghra · 2024-06-14T09:55:00 1718358900

Leslie Lamport's TLA+ comes from MSR. You have likely directly benefited from this project, since it's been used to prove the correctness of many distributed systems, including pieces of AWS.

Z3 is a very popular open source SMT solver, also originating from MSR.

There's probably dozens of similar examples. You might not know about some of the work coming out of MSR, but it's probably impacting you indirectly.

It's similar to how we indirectly benefit from fundamental nuclear physics. Places like CERN have to solve engineering issues and those solutions trickle down to everyone.

runeblaze · 2024-06-14T05:37:26 1718343446

That page is written for marketing. My experience with industry research labs is that just like many places with somewhat misplaced incentive structures, a lot of the good research does not get marketed properly. e.g., the research marketing people will likely extoll more of the recent GenAI advances given that's more eyeball catching.

(Also on top of my head DeepSpeed is by MSR, which is used a lot in large scale ML training.)

RIMR · 2024-06-13T19:50:40 1718308240

I would say that I have a very favorable opinion of Meta in terms of how they share their research and open source their software.

I would say that I have a very unfavorable opinion of Meta in terms of their commitment to privacy, security, and social responsibility.

yard2010 · 2024-06-13T21:26:47 1718314007

Don't fall for it. They will find a way to let some powerful actors exploit the users for dimes.

rldjbpin · 2024-06-14T02:49:31 1718333371

they've had a proven history of open-sourcing systems they engineered to power their services. cassandradb and (py)torch comes to mind.

chefandy · 2024-06-13T21:45:01 1718315101

The marketing is def working. I'm sure we'd be pretty depressed by some of the projects that didn't make the blog cut.

danuker · 2024-06-13T18:00:45 1718301645

I don't think they made up for it. They are training AIs off of personal data. The open stuff are a desperate red herring.

https://www.theregister.com/2024/06/10/meta_ai_training/

jen729w · 2024-06-13T21:46:45 1718315205

> a desperate red herring.

Or you can recognise that ‘Meta’ isn’t a conscious entity, and that it’s perfectly likely that there are some people over there doing amazing open-source work, and different people over there making ethically dubious decisions when building their LLMs.

account42 · 2024-06-14T08:13:36 1718352816

The people doing amazing open-source work are also making the ethically dubious decision of supporting the other group by working for FB.

jen729w · 2024-06-14T11:11:06 1718363466

Ah I think this is bullshit to be honest.

I’ve worked for dozens of organisations in my career. Large and small, competent and not. Usually not.

In that time did I make some ‘ethically correct’ choice to leave an enormous organisation because some other part of that organisation did something that wasn’t ethically perfect?

Never. Not one time.

And I’m an ethical person. I consider this stuff deeply. Here I am bothering to have this conversation with you.

But, what, every person who has an interesting job doing good things — remember, we’re talking about engineers developing a new audio codec — so those people who have interesting jobs doing good things with a great team are expected to look over there to some distant part of the org, to teams they’re barely aware of, let alone have spoken to, and they’re expected to quit their jobs because of the ‘ethically dubious’ stuff going on over there?

Sorry. Unrealistic, idealistic bullshit. I’ve never done it and neither have you.

bishbosh · 2024-06-14T17:05:00 1718384700

Not the person you're responding to, but I guess I would just have to flip that back around and say really this is bullshit to be honest.

I guess I don't really feel that you can just say you're an ethical person and have it absolve yourself of impact of your work.

It doesn't seem a stretch to say that the goals of meta are propagated by the things meta focuses work on, and even if one isn't on the forefront of stealing data, intruding on privacy, or maximizing engagement at all costs, doesn't mean nothing they do will play a part in those teams.

At the end of the day, even accounting at the orphan crushing factory plays a part in the orphan crushing machine.

jhallenworld · 2024-06-13T19:50:51 1718308251

So this is one argument- another is that I'm impressed that they got their own LLM running and integrated into facebook messenger.

I ran across this interesting graphic recently:

https://www.theverge.com/2023/12/4/23987953/the-gpu-haves-an...

Suddenly facebook is useful as a search engine..

Also interesting is that Meta AI is much faster than ChatGPT from end-user's point of view, but results not quite as good. Here is a comparison:

https://www.youtube.com/watch?v=1vLvLN5wxS0

thenoblesunfish · 2024-06-14T07:09:20 1718348960

Is there a peer reviewed paper associated with this, or just a blog post?

mrguyorama · 2024-06-13T17:40:19 1718300419

How the hell does releasing one audio codec undo years and years of privacy nightmare, being a willing bystander in an actual genocide, experimenting with the emotions of depressed kids, and collusion to depress wages?

stuxnet79 · 2024-06-13T18:17:11 1718302631

You will need to provide citations on the last point as Facebook are widely known to have broken the gentleman's agreement between Apple and Google that was suppressing tech pay in the early 2010s.

giraffe_lady · 2024-06-13T18:34:47 1718303687

OK sure even if they didn't do that we're still left with "knowingly abetted a genocide" which no amount of open source work can ever balance out.

rylittle · 2024-06-13T18:57:43 1718305063

context?

_whiteCaps_ · 2024-06-13T19:00:51 1718305251

https://www.amnesty.org/en/latest/news/2022/09/myanmar-faceb...

robertlagrant · 2024-06-13T20:01:07 1718308867

This article seems to not really mention the "knowingly" or the "abetted". If there are people killing other people, I wouldn't say that a communication method was to blame. In Scream, Sidney didn't sue the phone company who let the killer call her from inside the house. The idea that some news feed posts whipped people up into a killing frenzy just sounds absurd.

I wish the author could see that, and if the case is valid, to provide it, instead of some pretty tenuous claims of connection strung together to lead up to a demand for money.

I did try to go to the link that evidenced the "multiple" times Facebook was contacted in a 5 year period, but I couldn't get through. How many times was it, for anyone who can?

zorked · 2024-06-14T10:11:23 1718359883

This is a very low-quality comment. Amnesty International published a substantial, well-researched, well-sourced study. Your comment is low-effort Internet skepticism based on ignorance and a straw-man argument.

Read the report, gather data, then criticize it.

robertlagrant · 2024-06-14T11:16:06 1718363766

I'm talking about the article, not the report, and I tried to make clear where I didn't know things. Not sure what else would be appropriate.

bishbosh · 2024-06-14T17:11:40 1718385100

> If there are people killing other people, I wouldn't say that a communication method was to blame. In Scream, Sidney didn't sue the phone company who let the killer call her from inside the house. The idea that some news feed posts whipped people up into a killing frenzy just sounds absurd.

This is the core of your point and a comment on the idea itself, not the way the article portrayed it. I think it's fare to characterize your dismissal as glib.

giraffe_lady · 2024-06-13T21:07:11 1718312831

The full report is linked to in the first paragraph. These points are all addressed in detail there.

upwardbound · 2024-06-13T23:37:39 1718321859

https://www.amnesty.org/en/documents/ASA16/5933/2022/en/

risho · 2024-06-13T17:46:38 1718300798

it isn't just one audio codec. they also released and continue to release the best self hostable large language model weights, they have authored many open source projects that are staples today such as zstandard, react, pytorch

pt_PT_guy · 2024-06-13T17:42:18 1718300538

they also did release LLM models, and zstd, and mold, and and and... a lot of stuff

visarga · 2024-06-13T17:49:11 1718300951

React and Pytorch

compare that to Angular and TensorFlow, such a difference in culture

hot_gril · 2024-06-13T20:03:06 1718308986

Easy vs tons-o-boilerplate.

ComputerGuru · 2024-06-13T19:01:50 1718305310

zstd and mold are personal projects regardless of employer. That said, I didn’t know mold was written by a meta guy.

lucb1e · 2024-06-13T20:26:58 1718310418

Zstd is a personal project? Surely it's not by accident in the Facebook GitHub organization? And that you need to sign a contract on code.facebook.com before they'll consider merging any contributions? That seems like an odd claim, unless it used to be a personal project and Facebook took it over

(https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.md#co...)

ComputerGuru · 2024-06-14T01:13:06 1718327586

All Google dev’s personal projects are under the Google account on GH for legal reasons. I assume the same for Facebook. I believe fb championed zstd and lets the dev work on it at work but it was a personal project iirc.

cheema33 · 2024-06-13T17:50:20 1718301020

Don't forget React. The most popular frontend stack at the moment. Been that way for some time.

And GraphQL, Relay, Stylex...

XlA5vEKsMISoIln · 2024-06-13T19:45:17 1718307917

>React

Ah yes, the <body id="app"></body> websites.

cztomsik · 2024-06-13T21:56:34 1718315794

How is that specific to React? And who would use webapp technology for a website?

XlA5vEKsMISoIln · 2024-06-14T20:32:27 1718397147

According to cheema33 React is most "popular frontend stack", but I'm not allowed ask questions of why websites demand JS to display basic content... I suppose my reply could have been more _constructive_ as to ask "by whose count" or "how does popularity correspond to quality" but that's like playing chess by one step in front of you.

Look, I'm seeing an increase of blogs made with an implication that _infinity_ amount of visits requires less resources than one and I don't just find it true.

account42 · 2024-06-14T08:07:34 1718352454

Yeah, feels more like another reason to hate them for contributing to the enshittification of the web.

account42 · 2024-06-14T08:06:22 1718352382

> mold

The linker? How is facebook involved in that?

yard2010 · 2024-06-13T21:34:11 1718314451

You forgot selling the 2016 US elections to Putin for 100k[0]

Good luck undoing that releasing codecs haha

[0] https://time.com/4930532/facebook-russian-accounts-2016-elec...

lovecg · 2024-06-14T02:36:31 1718332591

If $100k worth of internet ads can really sway an election boy are we in trouble.

gorkish · 2024-06-13T15:39:24 1718293164

The lack of any reference or comparison to Codec2 immediately leads me to question the real value and motivation of this work. The world doesn't need another IP-encumbered audio codec in this space.

muizelaar · 2024-06-13T15:45:25 1718293525

They also don't compare with Lyra (https://github.com/google/lyra)

gorkish · 2024-06-13T16:03:50 1718294630

Or speex narrowband or others. I think the tendency to pick Opus is just because it has a newer date on it -- its design goals were not necessarily to optimize for low bitrate; Opus just happened to still sound OK when the knob was turned down that far.

One other point I intended to make that is not reflected in many listening/comparison tests offered by these presentations -- in the typical applications of low bitrate codecs, they absolutely must be able to gracefully degrade. We see Mlow performing at 6kbps here; how does it perform with 5% bit errors? Can it be tuned for lower bitrates like 3kpbs? A codec with a 6kbps floor that garbles into nonsense with a single bit flip would be dead-on-arrival for most real world applications. If you have to double the bitrate with FEC to make it reliable, have you really designed a low bitrate codec? The only example we heard of mlow was 30% loss on a 14kbps stream = 9.8kbps. Getting 6kbps through such a channel is a trivial exercise.

DragonStrength · 2024-06-13T16:32:35 1718296355

My understanding was Opus was specifically developed with the idea of replacing both Speex and Vorbis. "Better quality than Speex" is literally one of their selling points, so I'd be interested to hear more details.

cageface · 2024-06-14T05:29:36 1718342976

This is why opus is kind of two different codecs bundled in one right? It switches between them depending on the content and the bitrate?

why_only_15 · 2024-06-14T00:06:14 1718323574

how often is significant numbers of bit errors a problem, or when does that come up?

they also address something similar in the body: "Here are two audio samples at 14 kbps with heavy 30 percent receiver-side packet loss."

gorkish · 2024-06-14T14:53:52 1718376832

> how often is significant numbers of bit errors a problem, or when does that come up?

It depends on the transport. If you are going over something like TCP you will have a perfect bitstream or you will have nothing, so your codec doesnt have to tolerate bit errors or loss. If you are pushing raw bits over the air with GMSK modulation with no error correction, you'll have to tolerate a lot of errors.

In real world applications you almost always have to consider the tradeoffs on what things you want to leave to the codec and what things you want to leave to the transport layer.

At very low bitrates, the overhead required to create reliability and tolerance for errors or omissions become significant enough that the entire system performance matters a great deal. That is to say that the codec and transport have to be designed to be complimentary to one another to achieve the best final result.

From the presentation they show us mlow at 6kpbs and then again at 14kbps with 30% packet loss (effective datarate 9.8kbps). They do not say if the loss is random bit errors or entire packets, but let's not worry about that. Let's just assume the result of both of these is that you get final audio of about the same quality. This means that mlow has some mechanism to deal with errors on its own, but is using an obscenely high overhead rate (133%) to accomplish it. They also dont let us hear how it actually degrades when exposed to other types of transport errors. These numbers and apparent performance just aren't very good compared to other codecs/systems in this space.

cvg · 2024-06-13T15:55:05 1718294105

Nice. Google's soundstream already has some great quality. Some examples at 6kbps here: https://google-research.github.io/seanet/soundstream/example...

doodlesdev · 2024-06-14T14:20:05 1718374805

Lyra goes against the design goals of MLow though, by using machine learning techniques, and thus possibly not running on the devices that Meta is targeting with MLow.

Google claims SoundStream can run on low-end devices, though, so indeed I would like Meta to show that it still doesn't work well enough for their usecase. Specifically, I would like to know if it's possible to get SoundStream running in very old Android versions in low end devices, before a lot of APIs related to NNs in Android came around.

Dwedit · 2024-06-13T16:43:21 1718297001

There's also the LPCNet Codec (2019), which does wideband speech at 1.6kb/s by using a recurrent neural network.

https://jmvalin.ca/demo/lpcnet_codec/

therealmarv · 2024-06-13T19:52:55 1718308375

Somebody knows if this is better compared to whatever Google Meet is using? With choppy near unusable slow Internet Google Meet still fulfils its purpose on Audio Calls where all other competitors fail (tested e.g. while being in Philippines on a remote island with very bad internet). However Google Meet's tech is not published anywhere afaik.

lucb1e · 2024-06-13T20:27:43 1718310463

Can hardly try that out if this PR piece does not contain any code. We can judge it as well as you can from the couple examples they showed off

dgmdoug · 2024-06-13T16:10:47 1718295047

They also don't do a comparison with Pied Piper.

mig39 · 2024-06-13T17:45:43 1718300743

It might have a Weissman score in the fives, but I haven't seen a real-world implementation. Does it use middle-out compression?

byteknight · 2024-06-13T21:44:40 1718315080

MIDDLE OUT!

aidenn0 · 2024-06-13T17:13:54 1718298834

Only slightly OT:

ELI5: Why is a typical phone call today less intelligible than a 8kHz 8-bit μ-law with ADPCM from the '90s did?

[edit]

s/sound worse/less intelligible/

toast0 · 2024-06-13T17:33:13 1718299993

Depends on your call; u-law has poor frequency response and reasonable dynamic range. Not great for music, but ok enough for voice, and it's very consistent. 90s calls were almost all circuit switched in the last mile, and multiplexed per sample on digital lines (T1 and up). This means very low latency and zero jitter; there would be a measurable but actually imperceptible delay versus an end to end analog circuit switched call; but digital sampling near the ends means there would be a lot less noise. Circuit switching also means you'd never get dropped samples --- the connection is made or its not, although sometimes only one-way.

Modern calls are typically using 20 ms samples, over packet switched networks, so you're adding sampling delay, and jitter and jitter buffers. The codecs themselves have encode/decode delay, because they're doing more than a ADC/DAC with a logarithm. Most of the codecs are using significantly fewer bits for the samples than u-law, and that's not for free either.

HD Voice (g.722.2 AMR-Wide Band) has a much larger frequency pass band, and sounds much better than GSM or OPUS or most of these other low bandwidth codecs. There's still delay though; even if people will tell you 20-100ms delay is imperceptible, give someone an a/b call with 0 and 20 ms delay and they'll tell you the 0 ms delay call is better.

akira2501 · 2024-06-14T01:36:22 1718328982

> and multiplexed per sample on digital lines (T1 and up).

Technically even on ISDN because you had channel bonding there. Although it's all still circuit switched. The timeslot within the channel group is reserved entirely for your use at a fixed bandwidth and has full setup and tear down that is coordinated out of band.

Dylan16807 · 2024-06-13T19:30:45 1718307045

> HD Voice (g.722.2 AMR-Wide Band) has a much larger frequency pass band, and sounds much better than GSM or OPUS or most of these other low bandwidth codecs.

At what bitrate, for the comparison to Opus?

And is this Opus using LACE/NoLACE as introduced in version 1.5?

...and is Meta using it in their comparison? It makes a huge difference.

unethical_ban · 2024-06-14T14:32:34 1718375554

That is what I was wondering. They don't share what version of Opus they are comparing against, and that new version was a huge step forward.

toast0 · 2024-06-13T19:58:55 1718308735

Yeah, I probably shouldn't have included Opus; I'm past the edit window or I'd remove it with a note. I haven't done enough comparison with Opus to really declare that part, and I don't think the circumstances were even. But I'm guessing the good HD Voice calls are at full bandwidth of ~ 24 kbps, and I'm comparing with a product that was said to be using opus at 20 kbps. Opus at 32kbps sounds pretty reasonable. And carrier supported HD voice probably has prioritization and other things going on that mean less loss and probably less jitter. Really the big issue my ear has with Opus is when there's loss.

I don't think I've been on calls with Opus 1.5 with lace/no-lace, released 3 months ago, so no, I haven't compared it with HD voice that my carrier deployed a decade ago. Seems a reasonable thing for Meta to test with, but it might be too new to be included in their comparison as well.

Dylan16807 · 2024-06-13T20:11:36 1718309496

> Really the big issue my ear has with Opus is when there's loss.

That would definitely complicate things. Going by the test results that got cited on Wikipedia, Opus has an advantage at 20-24, but that's easy enough to overwhelm.

And the Opus encoder got some other major improvements up through 2018, so I'd be interested in updated charts.

Oh and 1.5 also adds a better packet loss mechanism.

rylittle · 2024-06-13T19:04:03 1718305443

could you explain a little more in a more ELI5 please?

epanchin · 2024-06-14T09:43:33 1718358213

Old way: two cups and string.

New way: chinese whispers.

hot_gril · 2024-06-13T20:11:54 1718309514

Phone ear speakers are quieter than they used to be, so if the other person isn't talking clearly into the mic, you can't crank it up. I switched from a flip phone to an iPhone in 2013, huge difference. I had to immediately switch to using earbuds or speakerphone. Was in my teens at the time.

sva_ · 2024-06-13T18:25:58 1718303158

Hearing ability deteriorates with age.

aidenn0 · 2024-06-13T19:53:52 1718308432

Yes, but it doesn't deteriorate in such a way as to cause someone speaking to sound like gibberish and/or random medium-frequency tones, which happens in nearly every single cell phone conversation I have that lasts more than 5 minutes.

My experience is that phone calls nowadays alternate between a much wider-band (and thus often better sounding) experience and "WTF was that just now?"

kragen · 2024-06-14T05:13:41 1718342021

packet switching drops packets, circuit switching drops attempted calls (all circuits are busy)

most 90s calls didn't use adpcm, just pcm. assuming you got confused there

they also didn't use radio; solid wire from your microphone to my receiver. radio (cellphones, wifi, cordless phones) also is inherently unreliable

old phones had sidetone, but many voip apps don't

finally, speakerphone use is widespread now, and it is incompatible with sidetone and adds a lot of audio multipath fading

skygazer · 2024-06-13T17:25:34 1718299534

Does decrease in intelligibility correlate with the instance count of concert seats in front of the loud speakers back in the oughts?

chronogram · 2024-06-13T15:52:36 1718293956

No mention of NoLACE make the comparison samples a bit less useful: https://opus-codec.org/demo/opus-1.5/

sitkack · 2024-06-13T18:50:19 1718304619

This is really cool and I very very very much appreciate that xiph puts so much work into standardization. https://datatracker.ietf.org/wg/mlcodec/documents/

It would be nice if Meta donated this to the world so we have less anchors for patent trolls and can transition the future we deserve.

jamal-kumar · 2024-06-13T16:27:18 1718296038

That does sound very nice

thrtythreeforty · 2024-06-13T16:23:57 1718295837

Are they releasing this or is this just engineering braggadocio? I can't find any other references to MLow other than this blog post.

Facebook/Meta AI Research does cool stuff, and releases a substantial portion of it (I dislike Facebook but I can admit they are highly innovative in the AI space).

sllabres · 2024-06-13T17:00:20 1718298020

If you think about 'implementing then algorithm in a product' it seems so: (From the article) "We are really excited about what we have accomplished in just the last two years—from developing a new codec to successfully shipping it to billions of users around the globe"

zekica · 2024-06-13T15:39:23 1718293163

Honest question: why do we need to optimize for <10kbps? It's really impressive what they are able to achieve at 6kbps, but LTE already supports >32kbps and there we have AMR-WB or Opus (Opus even has in-band FEC at these bitrates so packet loss is not that catastrophic). Maybe it's useful in satellite direct-to-phone use-cases?

cornstalks · 2024-06-13T15:49:10 1718293750

There’s a section (“Our motivation for building a new codec”) in the article that directly addresses this. Assuming you have >32 kbps bandwidth available is a bad assumption.

nicce · 2024-06-13T17:04:09 1718298249

The best assumption would be that you either have connection available or not available.

Then, if it is available, what is the minimal data rate for connections which are available in general? If we do statistical analysis for that, is it lower that 32 kbps? How significantly?

For some reason, I would assume that if you have connection, it is faster than 2G these days.

zamadatix · 2024-06-13T18:11:00 1718302260

The question isn't really the minimal bandwidth of the PHY rate it's about the goodput for a given reliability. Regardless of your radio there will always be some point where someone is at the edge of a connection and goodput is less than minimal PHY bandwidth. The call then turns choppy/into a time stretched robot you get every other syllable from. The less data you need to transmit + the more FEC you can fit in the goodput then the better that situation becomes.

Not to mention "just because I have some minimal baseline of $x kbps doesn't mean I want $y to use all of it the entire time I'm on a call if it doesn't have to".

sangnoir · 2024-06-13T18:42:14 1718304134

> For some reason, I would assume that if you have connection, it is faster than 2G these days.

That assumption does not hold for a sizable chunk of Meta's 3.98B-strong userbase. The list of counties that switched off 2G is surprisingly short.

nicce · 2024-06-13T19:08:59 1718305739

Now that you mention it, Wikipedia seems to have interesting list about that. Seems like that by 2030, the most starts to switch off.

https://en.wikipedia.org/wiki/2G

hokumguru · 2024-06-13T15:43:41 1718293421

There exist a few billion people without LTE. Meta doesn’t only operate in the western world.

noprocrasted · 2024-06-13T15:49:25 1718293765

Are there really many situations where a 10kbps connection would actually be stable enough to be usable? Usually when you get these kinds of speeds it means the underlying connection is well and truly compromised, and any kind of real-time audio would fail anyway because you're drowning in a sea of packet loss and retransmissions.

Even in cases where you do get a stable 10kbps connection from upstream, how are you going to manage getting any usable traffic through it when everything nowadays wastes bandwidth and competes with you (just look at any iOS device's background network activity - and that's before running any apps which usually embed dozens of malicious SDKs all competing for bandwidth)?

gorkish · 2024-06-13T16:17:11 1718295431

Yes; backhaul connections in telephony applications are often very stable and are already capacity managed by tuning codec bandwidth. Say you are carrying 1000 calls with uLaw (64kbps * 1000) over a pair of links and one fails. Do you A) carry 500 calls on the remaining link B) stuff all calls onto the same link and drop 50% of the packets or C) Change to a 32kbps codec?

It seems you may be imaging the failure case where your "ISP is slow" or something like that due to congestion or packet loss -- as I posted elsewhere in the thread the bandwidth is only one aspect of how a "low bitrate" codec may be expected to perform in a real world application. How such a codec degrades when faced with bit errors or even further reduced channel capacity is often more important in the real application. These issues are normally solved with things like FEC which can be incorporated as part of the codec design itself or incorporated as part of the modem/encoding/modulation of the underlying transport.

wmf · 2024-06-13T16:53:05 1718297585

Facebook Messenger and WhatsApp don't run over TDM though. If WhatsApp is only getting ~10 kbps, that's due to extreme congestion.

gorkish · 2024-06-13T18:59:20 1718305160

Yes; but what is your point? A congested network like you describe isnt ever going to reliably carry realtime communications anyway due to latency and jitter. All you could reasonably due to 'punch through' that situation is to use dirty tricks to give your client more than its fair share of network resources.

6kbps is 10x less data to transfer than 64kbps, so for all the async aspects of Messenger or WhatsApp there is still enormous benefit to smaller data.

zeroxfe · 2024-06-13T16:11:38 1718295098

> Are there really many situations where a 10kbps connection would actually be stable enough to be usable?

Yes there are. We ran on stable low bandwidth connections for a very long time before we had stable high bandwidth connections. A large part of the underdeveloped world has very low bandwidth, and use 5 - 10 Kbps voice channels.

noprocrasted · 2024-06-13T16:21:30 1718295690

> We ran on stable low bandwidth connections

Are you talking about the general "we" or your situation in particular? For the former, yes sure we started with dial-up, then DSL, etc, but back then software was built with these limitations in mind.

Constant background traffic for "product improvement" purposes would be completely unthinkable 20 years ago; now it's the norm. All this crap (and associated TLS handshakes) quickly adds up if all you've got is kilobits per second.

dspillett · 2024-06-13T16:36:09 1718296569

> Are you talking about the general "we"

I assume the general-ish “we”, where it is general to the likes of you and I (and that zeroxfe). There are likely many in the world stuck at the end of connections run over tech that this “general subset” would consider archaic, and that zeroxfe was implying their connections, while slow, may be similarly stable to ours back then.

Also, a low bandwidth stable connection could be one of many multiplexed through a higher bandwidth stable connection.

zeroxfe · 2024-06-13T16:55:31 1718297731

Let's not move the goalposts here :-) The context is an audio codec, not heavyweight web applications, in response to your question "Are there really many situations where a 10kbps connection would actually be stable enough to be usable?" And I'm saying yes, in that context, there are many situations, like VoIP, where 10kbps is usable.

Nobody here would argue that 10kbps is usable today for the "typical" browser-based Internet use.

bogwog · 2024-06-13T15:53:55 1718294035

I don't know what you consider "stable enough", but the 30% packet loss demo in the article is pretty impressive.

meindnoch · 2024-06-13T16:15:55 1718295355

>Are there really many situations where a 10kbps connection would actually be stable enough to be usable?

Scroll to this part of the article:

>Here are two audio samples at 14 kbps with heavy 30 percent receiver-side packet loss.

dspillett · 2024-06-13T16:30:39 1718296239

> Are there really many situations where a 10kbps connection would actually be stable enough to be usable?

Yes (most likely: that was an intuited “yes” not one born of actually checking facts!). There are many places still running things over POTS rather than anything like (A)DSL, line quality issues could push that down low and even if you have a stable 28kbit/s you might want to do something with it at the same time as the audio comms.

Also, you may be trying to cram multiple channels over a relatively slow (but stable) link. Given the quality of the audio when calling some support lines I suspect this is very common.

Furthermore, you might find a much faster unstable connection with a packet-loss “correcting” transport layered on top effectively producing a stable connection of much lesser speed (though you might get periods of <10kbit here due to prolonged dropouts and/or have to institute an artificial delay if the resend latency is high).

kragen · 2024-06-14T05:21:36 1718342496

i live in a third-world country, and simplifying a bit, my cellphone plan gives me 55 megabytes a day. i get charged if i go over. that's 2 hours of 64kbps talk time on jitsi but would be 12 hours at 10kbps

treflop · 2024-06-13T17:08:43 1718298523

Even in the Western world, you can appreciate low bandwidth apps even you are a music festival or traveling through relative wilderness.

BenjiWiebe · 2024-06-14T04:53:48 1718340828

Or living in relative wilderness. Or living in a dead zone between moderately populated areas (raises hand).

lxgr · 2024-06-13T16:15:43 1718295343

Meta's use case are OTT applications on the Internet, which are usually billed per byte transmitted. Reducing the bitrate for the audio codec used lets people talk longer per month on the same data plan.

That said, returns are diminishing in that space due to the overhead of RTP, UDP and IP; see my other comment for details on that.

evandrofisico · 2024-06-13T19:27:57 1718306877

More than that, in developing countries, such as my own, Meta has peering agreements with telephony companies which allow said companies to offer basic plans where traffic to Meta applications (mostly whatsapp) is not billed. This would certainly reduce their costs immensely, considering that people use whatsapp as THE communications service.

gorkish · 2024-06-13T15:46:55 1718293615

It's useful.

AMBE currently has a stranglehold in this area and by any and every measurable metric, AMBE is terrible and should be burned in the deepest fires of hell and obliterated from all of history.

londons_explore · 2024-06-13T15:48:40 1718293720

Internet connectivity tends to have a throughput vs latency curve.

If you need reliable low latency, as you want for a phone call, you get very little throughput.

Examples of such connections are wifi near the end of the range, or LTE connections with only one signal bar.

In those cases, a speedtest might say you have multiple megabits available, but you probably only have kilobits of bandwidth if you want reliable low latency.

lxgr · 2024-06-13T16:20:49 1718295649

Load ratios of > 0.5 are definitely achievable without entering Bufferbloat territory, and even more is possible using standing queue aware schedulers such as CoDel.

Also, Bufferbloat is usually not (only) caused by you, but by people sharing the same chokepoint as you in either or both directions. But if you're lucky, the router owning the chokepoint has at least some rudimentary per-flow or per-IP fair scheduler, in which case sending less yourself can indeed help.

Still, to have that effect result in a usable data rate of kilobits on a connection that can otherwise push megabits (disregarding queueing delay), the chokepoint would have to be severely overprovisioned and/or extremely poorly scheduled.

zekica · 2024-06-13T16:12:53 1718295173

Yes, but it doesn't have to be. Have you looked into Dave Taht's crusade against buffers?

lxgr · 2024-06-13T16:23:10 1718295790

Correct buffer sizing isn't a good solution for Bufferbloat: The ideal size corresponds to the end-to-end bandwidth-delay product, but since one buffer can handle multiple flows with greatly varying latencies/delays, that number does not necessarily converge.

Queueing aware scheduling algorithms are much more effective, are readily available in Linux (tc_codel and others), and are slowly making their way into even consumer routers (or at least I hope).

BenjiWiebe · 2024-06-14T04:59:28 1718341168

Perhaps you know this already (not really clear on what your comment is saying), but Dave Taht is one of the authors of FQ-CoDel, which is what the author of CoDel recommends using when available.

kylehotchkiss · 2024-06-13T16:57:44 1718297864

Maybe something like this would be helpful for Apple to implement voice messages over satellite. Also a LOT of people in developing countries use WhatsApp voice messages with slow network speeds or expensive data. It's too easy to forget how big an audience Meta has outside the western world

sogen · 2024-06-13T16:00:02 1718294402

I'm assuming they'll just re-encode everything, for every user, to a lower bitrate using this codec.

So, with their huge user base they'll be saving a gazillion terabytes hourly, that's what I concluded from their "2 years in the making" announcement.

ajb · 2024-06-13T17:45:01 1718300701

If you mean for storage, real time codecs are actually pretty inefficient for that use case because they don't get much use of temporal redundancy. Although I'm not actually aware of a non-real time audio codec specialised for voice. They probably exist in Cheltenham and Maryland but for Meta this likely doesn't make a big enough part of their storage costs to bother

hateful · 2024-06-13T15:57:28 1718294248

It's not only about the end that's receiving, it's also the end that's transmitting 10kbps * thousands of users.

ThrowawayTestr · 2024-06-13T15:42:44 1718293364

> why do we need to optimize for <10kbps?

Because some people have really slow internet

annoyingnoob · 2024-06-13T18:59:24 1718305164

I wonder how it sounds compared to G.729.

I worked for a company 20 years ago that had a modified G.729 codec that could go below 8kbps but sounded decent. We used this for VoIP over dial-up Internet, talk about low bandwidth.

Turns out some of the more interesting bits were in the jitter buffer and ways to manage the buffer. Scrappy connections deliver packets when they can and there is an art to managing the difference between the network experience and the user experience. For communications, you really need to manage the user experience.

dbcurtis · 2024-06-13T18:50:08 1718304608

What is the license? I searched but could not find anything.

varenc · 2024-06-13T19:45:19 1718307919

It appears to be entirely closed source at the moment. This is just the announcement of development of in-house tech. Nothing public yet that could even be licensed.

mckirk · 2024-06-13T21:45:36 1718315136

Maybe it's just me (or maybe I've invested too much money into headphones), but I actually liked the Opus sound better at 6 kbps. The MLow samples had these... harsh and unnatural artifacts, whereas the Opus sound (though sounding like it came from a tin-can-and-string telephone and lacked all top-end) at least was 'smooth'. But I'm pretty sure that's because they are demonstrating here the very edge of what their codec can do, at higher bitrates the choice would probably be a lot clearer.

yobid20 · 2024-06-14T18:22:34 1718389354

Is it free and open source? Any plans on making it available for native webrtc in the browser?

theoperagoer · 2024-06-13T17:33:19 1718299999

Was hoping this would have a GitHub link ...

animanoir · 2024-06-13T16:49:31 1718297371

Nice technology, tho Opus adds that warm sound I love...

77pt77 · 2024-06-13T16:00:01 1718294401

Where is the source code?

PaulHoule · 2024-06-13T15:15:22 1718291722

Sometimes it sounds great but there are moments I think I'm listening to a harp and not somebody's voice.

plus · 2024-06-13T15:34:57 1718292897

It's not exactly reasonable to expect super high fidelity audio at the bitrate constraints they're targeting here, and it certainly sounds a lot better than the Opus examples they're comparing against.

cobbal · 2024-06-13T15:40:18 1718293218

The more complicated the codec, the more fascinating the failure modes. I love watching a digital TV with a bad signal, because the motion tracking in the codec causes people to wear previous, glitched frames as a skin while they move.

cnity · 2024-06-13T16:12:15 1718295135

Good observation, and probably part of what makes "glitchy" AI generated video so captivating to watch.

ugjka · 2024-06-13T16:10:44 1718295044

Look up datamoshing on youtube

77pt77 · 2024-06-13T16:11:04 1718295064

Are they comparing against opus using nolace?

Because that makes all the difference!

dsign · 2024-06-13T18:16:04 1718302564

Can I use this to make music?

A little bit on a tangent, a technique called Linear Predictive Coding, which was developed by telecoms in the sixties and seventies, has a calculated bandwidth of 2.5 kbit/s. The sound quality is not any good, and telephone companies of the time didn't use it for calls, but the paper I read describing the technique says the decoded speech is "understandable". LPC found its way into musical production, in a set of instruments called "vocoders" used to distort a singer's voce. There are, for example, variations of it in something called "Orange Vocoder IV".

So, now I'm wondering, can MLow be used to purposefully introduce interesting distortions in speech? Or even change a singer's voice color?

WalterSear · 2024-06-13T18:41:30 1718304090

Just use Digitalis :)

https://www.youtube.com/watch?v=bA23ysR2hAo

amelius · 2024-06-13T17:30:20 1718299820

Can't we have an audio codec that first sends a model of the particular voice, and then starts to send bits corresponding to the actual speech?

neckro23 · 2024-06-13T18:31:46 1718303506

This is actually an old idea, minus the AI angle (1930s). It’s what voders and vocoders were originally designed for, before Kraftwerk et al. found out you can use them to make cool robot voices.

roywiggins · 2024-06-13T17:37:00 1718300220

You need a bunch of bandwidth upfront for that, which you might not have, and enough compute available at the other end to reconstruct it, which you really might not have.

amelius · 2024-06-13T18:22:56 1718302976

Regarding your first point, how about reserving a small percentage of the bandwidth for a model that improves incrementally?

wildzzz · 2024-06-13T22:09:09 1718316549

You're adding more complexity to both the transmitter and receiver. I'd be pretty pissed if I had to endure unintelligible speech for a few minutes until the model was downloaded enough to be able to hear my friend. I'd also be a little pissed if I had to store massive models for everyone in my call log. Also both devices need to be able to run this model. If you are regularly talking over a shit signal, you're probably not going to be able to afford the latest flagship phone that has the hardware necessary to run it (which is exactly what the article touches on). The ideal codec takes up almost no bandwidth, sounds like you're sitting next to the caller, and runs on the average budget smartphone/PC. The issue is that you aren't going to be able to get one of these things so you choose a codec that best balances complexity, quality, and bandwidth given the situation. Link quality improves? Improve your voice quality by increasing the codec bitrate or switching to another less complex one to save battery. If both devices are capable of running a ML codec, then use that to improve quality and fit within the given bandwidth.

andrewstuart · 2024-06-14T01:17:13 1718327833

This is amazing.

So you could have about 9 voice streams over a 56K modem.

Incredible.

Great for stuffing audio recordings into little devices like greeting card audio players and microcontrollers.

justinclift · 2024-06-14T04:20:02 1718338802

Doesn't seem to have any mention of what they're going to do with this patent wise.

The one of the biggest points for Opus was it being patent free (and "good enough" audio wise), so anyone and everyone was able to use it legally.

victorp13 · 2024-06-13T18:04:23 1718301863

Does anyone happen to know if ChatGPT's voice feature uses audio compression similar to Opus? Especially the "heavy 30 percent receiver-side packet loss" example sounds a LOT like the experience I have sometimes.

vagab0nd · 2024-06-15T14:38:49 1718462329

How low can we go? Ultimately we just need to send "who's speaking" at the beginning and the content as text. So about 20 bps. Plenty of room at the bottom.

mcoliver · 2024-06-13T17:38:02 1718300282

Maybe SiriusXM can pick this up. The audio quality is generally awful but especially on news/talk channels like Bloomberg and CNBC. There is no depth or richness to the voices.

tgtweak · 2024-06-13T17:51:39 1718301099

It actually comes down to the SiriusXM receiver that is being used - I've witnessed the built-in sirius/xm on on the latest GM platform (a $100,000+ Cadillac) sounding like AM radio to immediately sitting in a better-than-apple-streaming quality rendition of the exact same channel on an older lexus a few minutes apart...

The mobile xm receivers (like ipods) that they used to sell also had very good quality and I never noticed any quality shortcomings even with good headphones.

I think the "high" quality stream is 256kbps/16k which is fairly high compared to most streaming services that come in around 128/160.

wildzzz · 2024-06-13T21:53:47 1718315627

My old Sirius portable receiver sounded like garbage despite the marketing material saying "Crystal Clear". My 2006 Infiniti Sirius receiver didn't sound any better despite being a massive space heater in the trunk. The later cars I've used it in sound good, at least good enough to sound as clear as FM radio or even HD radio. I think some of the channels are still in lower bitrates like the news channels for example, they've always sounded bad. There's something I've read about SiriusXM using terrestrial transmitters which may improve the signal whereas the satellite link may be of lower bandwidth.

sitkack · 2024-06-13T18:52:43 1718304763

I am archiving some music at 40kbps using Opus and the quality is pretty amazing. I think once things get over 20+kbps all the codecs start sounding pretty good (relative to the these low bitrates).

I still prefer flac if possible.

theoperagoer · 2024-06-13T19:12:39 1718305959

Opus is fantastic!

Tostino · 2024-06-13T15:27:43 1718292463

That is a marked improvement compared to the other examples provided. Nice to see it also has less compute resources required for that higher quality output.

FrostKiwi · 2024-06-14T00:56:13 1718326573

Not sure of the timing, is this what Meta researched when fighting audio latency for Virtual Meetings during the development of the Meta Quest 2?

arendtio · 2024-06-14T04:27:41 1718339261

Does someone know if we can expect to see similar improvements for video streaming in the near future?

jacobgorm · 2024-06-14T07:42:10 1718350930

Will the source code be available?

nickels · 2024-06-13T15:52:29 1718293949

Could it be used for voice over satellite, ie Emergency SOS via satellite on iPhones?

lxgr · 2024-06-13T16:29:27 1718296167

iPhones use Globalstar, which theoretically supports voice bitrates of (I believe) 9.6 kbps, although only using dedicated satphones with large, external antennas.

Apple's current solution requires several seconds to transmit a location stamp of only a handful of bytes, so I think we're some either iPhone or satellite upgrades away from real-time voice communication over that.

Starlink has demonstrated a direct-to-device video call already, though, so we seem to be quickly approaching that point! My strong suspicion is that Apple has bigger plans for Globalstar than just text messaging.

zekica · 2024-06-13T17:11:55 1718298715

Starlink is in a better position as their satellites are in a low earth orbit - 30 times closer than geostationary. It correlates to 1000 times (30dB) stronger signal on both sides.

lxgr · 2024-06-13T17:15:53 1718298953

Globalstar is LEO as well, although a bit higher (~1400 km) than Iridium (~780 km) and Starlink (below Iridium; various altitudes). In terms of SNR, they're very comparable.

Newer GEO direct-to-device satellites also have huge reflectors and often much higher transmit power levels that can compensate for the greater distance somewhat. Terrestar and Thuraya have had quite small phones available since the late 2000s already, and they're both (large) GEO.

ianburrell · 2024-06-13T20:14:09 1718309649

Iridium and Globalstar aren't geostationary. They are LEO not much higher than Starlink.

Starlink is doing direct-to-cell. Talking to existing phones requires a large antenna. The bandwidth for each device is slow, not enough for mobile data, but better than Iridium. I think they recently showed off voice calls.

eelioss · 2024-06-13T19:33:45 1718307225

I am curious about encoding times comparing with standard codecs using ffmpeg

lucb1e · 2024-06-13T20:15:49 1718309749

They claim it's 10% lower than Opus, specifically for the decoder iirc but since they speak of the 10-year-old hardware used to make million of WhatsApp calls daily, the encoder can't be computationally complex either

But, yeah, some actual data (if they're not willing to provide running code) would have been a welcome addition to this PR overview

snvzz · 2024-06-14T07:12:58 1718349178

Looks (and sounds) cool.

What is the license?

sidcool · 2024-06-14T09:27:31 1718357251

Meta engineering really rocks. All other controversies aside.

Rakshith · 2024-06-15T13:48:39 1718459319

they should all jump straight to vvc with lcevc to enhance it

megamix · 2024-06-14T08:28:24 1718353704

Meta..

iamnotsure · 2024-06-13T17:26:37 1718299597

Please stop lossy compression.

GrantMoyer · 2024-06-13T19:00:46 1718305246

Have you ever looked at the size of losslessly compressed video? It's huge. Lossy compression is the only practical way to store and stream video, since it's typically less than 1% of the size of uncompressed video. Lossless compression typically only gets down to about 50% of the size. It's amazing how much information you can throw out from a video, and barely be able to tell the difference.

cheema33 · 2024-06-13T17:55:58 1718301358

Lossy compression has its practical uses. Under ideal circumstances nobody is going to stop you from using FLAC.

aembleton · 2024-06-14T20:09:46 1718395786

Surely all digital audio is lossy to a certain extent as you have a finite number of quantization levels and a finite sampling frequency.

GrantMoyer · 2024-06-17T17:02:49 1718643769

It depends on what you consider lossy.

For time discretization, the Nyquist–Shannon sampling theorem[1] says a band-limited, continuous-time signal can be perfectly reconstructed from a discrete-time signal with a sufficiently high (but finite) sampling frequency. Human hearing is naturally band limited to about the range 20Hz to 20kHz, and audio recordings typically use sufficient sampling frequency to losslessly recontruct thos bandwidth.

For quantization, any recorded analog signal is a true signal plus some level of measurement noise. Quantization of a signals can also be thought of as adding some noise to a continuous amplitude signal[2]. If the "quantization noise" is much smaller than the measurement noise of the signal, then the discretization is effectively lossless. Typical audio formats have a maximum quantization error far, far smaller than typical audio recording hardware's measurment error (and the human ear's).

So typical quantized, discrete-time audio formats can be considered lossless representations of the anolog sound signals humans can hear (assuming proper capture and processing). On the other hand, no quantized or discretized signal can losslessly represent a non-band-limited, zero-noise audio signal.

[1]: https://en.wikipedia.org/w/index.php?title=Nyquist%E2%80%93S...

[2]: https://en.wikipedia.org/wiki/Quantization_(signal_processin...

barbazoo · 2024-06-13T16:16:10 1718295370

That's great, now they can reach even more developing countries and do damage the way they did for example in Myanmar [1].

[1] https://www.amnesty.org/en/latest/news/2022/09/myanmar-faceb...