Hacker News new | past | comments | ask | show | jobs | submit login

Very cool. seems like they addressed the problem of hallucination. would be interesting to see an example of it hallucinating without redundancy and corrected with redundancy



Isn't packet loss concealment (PLC) a form of hallucination? Not saying it's bad, just that it's still Making Shit Up™ in a statistically-credible way.


Well, there's different ways to make things up. We decided against using a pure generative model to avoid making up phoneme or words. Instead, we predict the expected acoustic features (using a regression loss), which means that model is able to continue a vowel. If unsure it'll just pick the "middle point", which won't be something recognizable as a new word. That's in line with how traditional PLCs work. It just sounds better. The only generative part is the vocoder that reconstructs the waveform, but it's constrained to match the predicted spectrum so it can't hallucinate either.


Any demos of this to listen to? It sounds potentially really good.


There is a demo in the link shared by OP.


That's really cool. Congratulations on the release!


The PLC intentionally fades off after around 100 ms so as not to cause misleading hallucinations. It is really just about filling small gaps.


In a broader context, though, this happens all the time. You’d be surprised what people mishear in noisy conditions. (Or if they’re hard of hearing.) The only thing for it is to ask them to repeat back what they heard, when it matters.

It might be an interesting test to compare what people mishear with and without this kind of compensation.


As part of the packet loss challenge, there was an ASR word accuracy evaluation to see how PLC impacted intelligibility. See https://www.microsoft.com/en-us/research/academic-program/au...

The good news is that we were able to improve intelligibility slightly compared with filling with zeros (it's also a lot less annoying to listen to). The bad news is that you can only do so much with PLC, which is why we then pursued the Deep Redundancy (DRED) idea.


Right, this is why the Proper radio calls for a lot of systems have mandatory read back steps, so that we're sure two humans have achieved a shared understanding regardless of how sure they are of what they heard. It not only matters whether you heard correctly, it also matters whether you understood correctly.

e.g. train driver asks for an "Up Fast" block. His train is sat on Down Fast, the Up Fast is adjacent, so then he can walk on the (now safe) railway track and inspect his train at track level, which is exactly what he, knowing the fault he's investigating, was taught to do.

Signaller hears "Up Fast" but thinks duh, stupid train driver forgot he's on Down Fast. He doesn't need a block, the signalling system knows the train is in the way and won't let the signaller route trains on that section. So the Up Fast line isn't made safe.

If they leave the call here, both think they've achieved understanding but actually there is no shared understanding and that's a safety critical mistake.

If they follow a read-back procedure they discover the mistake. "So I have my Up Fast block?" "You're stopped on Down Fast, you don't need an Up Fast block". "I know that, I need Up Fast. I want to walk along the track!" "Oh! I see now, I am filling out the paperwork for you to take Up Fast". Both humans now understand what's going on correctly.


To borrow from Joscha Bach: if you like the output, it's called creativity. If you don't, it's called a hallucination.


That sounds funny, but is it true? Certainly there's a bias that goes towards what you're quoting, but would you otherwise genuinely call the computer creative? Is that a positive aspect of a speech codec or of an information source?

Creative is when you ask a neural net to create a poem, or something else from "scratch" (meant to be unique). Hallucination is when you didn't ask it to make its answer up but to recite or rephrase things it has directly observed

That's my layman's understanding anyway, let me know if you agree


That's almost the same. You could say it's being creative by not following directions.

Creativity isn't well-defined. If you generate things at random, they are all unique. If you then filter them to remove all the bad output, the result could be just as "creative" as anything someone could write. (In principle. In practice, it's not that easy.)

And that's how evolution works. Many organisms have very "creative" designs. Filtering at scale, over a long enough period of time, is very powerful.

Generative models are sort of like that in that they often use a random number generator as input, and they could generate thousands of possible outputs. So it's not clear why this couldn't be just as creative as anything else, in principle.

The filtering step is often not that good, though. Sometimes it's done manually, and we call that cherry-picking.


Doesn't the context affect things much more than whether you like the particular results?

Either way, "creativity" in the playback of my voice call is just as bad.


I love that, what's it from? (My Google-fu failed.) Unexpected responses are often a joy when using AI in a creative context. https://www.cell.com/trends/neurosciences/abstract/S0166-223...


It was from one of his podcast appearances. Which doesn't narrow it down much, unfortunately. Most likely options:

https://www.youtube.com/watch?v=LgwjcqhkOA4

https://www.youtube.com/watch?v=sIKbp3KcS8A

https://www.youtube.com/watch?v=CcQMYNi9a2w


There something darkly funny that Opus could act psychotic because It glitched out or was fed something really complex. But you could argue transparent lossy compression at 80 ~ 320kbps is a controlled Deliriant like hallucination going by how only rare few can tell them apart from Lossless.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: