NaturalSpeech 2: Zero-shot speech and singing synthesizers

cs702 · on April 19, 2023

I tell friends that the scene below from T2 doesn't feel futuristic anymore. In fact, it now feels... almost mundane. I mean, a smart "script kiddie" with a bit of ML expertise can pull off this kind of deepfake voice spoofing on a relatively cheap desktop computer nowadays. We live in interesting times.

SCENE:

T-800, speaking to John Connor in normal voice: "What's the dog's name?"

John Connor: "Max."

T-800, impersonating John, on the phone with T-1000: "Hey Janelle, what's wrong with Wolfie? I can hear him barking. Is he all right?"

T-1000, impersonating John's foster mother, Janelle: "Wolfie's fine, honey. Wolfie's just fine. Where are you?"

T-800 hangs up the phone and says to John in normal voice: "Your foster parents are dead."

--

Source: https://www.youtube.com/watch?v=MT_u9Rurrqg

poulpy123 · on April 19, 2023

except that now the T-1000 will have access to the facebook or instagram of Janelle and will know all about Max

cs702 · on April 19, 2023

Yep, it sure seems that billions of human beings have voluntarily contributed personal data to massive datasets, exactly of the kind that AI would need to be able to manipulate (or worse, exploit) each of those human beings.

bheadmaster · on April 19, 2023

Suddenly, Richard Stallman's lifestyle doesn't seem so crazy.

lucidrains · on April 19, 2023

or love, understand, help each of those human beings.

TeMPOraL · on April 20, 2023

And unscramble an egg, and reverse flow of entropy too.

The problem of AI alignment is that human value system is rather complex (to the point we haven't formalized any meaningful chunk of it, nor it seems we'll be able to any time soon), and random deviations from it can easily lead to - what we'd consider - horrifying tragedies. A random AI mind plucked out of space of possible minds is highly unlikely to have internalized a good approximation of our value system, for the same reason putting a scrambled egg in a bag and shaking it vigorously won't get you a fresh egg back.

Manipulation and exploitation are, by default, what happens when an agent with power over you finds you standing between it and the thing it wants. Almost every point in the space of possible minds will feature this behavior. "Love, understand, help each other", in the sense we understand it, is a very specialized, specific set of behaviors - few points in the space of possible minds will feature it.

Or, in short, there's a good statistical argument (to which I do not do justice here), that if you make a smart enough AI without doing a perfect job alignment, it will kill us all - most likely unintentionally.

loa_in_ · on April 20, 2023

> A random AI mind plucked out of space of possible minds is highly unlikely to have internalized a good approximation of our value system

There is no common value system amongst humans. We have voluntary murderers, cannibals, mad scientists, misanthropes, various religions, wannabe influencers and lifetime recluses. Not even self preservation is certainly a factor when dealing with humans.

kromem · on April 20, 2023

More like now the T1000 will assimilate into the foster mom and then suddenly confuse whether it is machine or mom and end up raising her darn troublemaker boy to get into a good college and the only fights she'll have with Arnold will be over the right way to raise that boy into a man.

James Cameron really didn't think through emergent alignment issues...

ImHereToVote · on April 19, 2023

We aren't even far off from an LLM being able to infer that the parents are fake on the basis of the dog name. I'm not even gonna touch the chain gun shooting up a parking lot aspect.

og_kalu · on April 19, 2023

Not Far off ? GPT-4 already gets it.

Response after pasting in the scene: The T-800 figured out that John's foster parents were dead based on the exchange because when it asked about "Wolfie" (a made-up name for the dog), the T-1000, impersonating Janelle, did not correct the name and instead went along with it, saying "Wolfie's fine." If the real Janelle had been on the phone, she would have corrected the T-800 by stating that the dog's name is actually Max. This indicated to the T-800 that the person on the other end of the line was not John's real foster parent, and thus, they were likely dead.

EDIT- Tried again after changing anything that would point to the Terminator franchise. Still nailed it.

Ron figured out the foster parents are dead based on the exchange because when he impersonated Harvey and mentioned the dog's name as "Rovy" instead of "Bingey," R-658 (impersonating Janine) did not correct him or question the name. This indicated that R-658 didn't actually know the dog's real name and was likely trying to deceive them. This deception, along with the concern about their whereabouts, led Ron to deduce that the foster parents were probably dead.

ImHereToVote · on April 19, 2023

But can it understand why we cry?

og_kalu · on April 19, 2023

Not sure how serious this question is but GPT-4's emotional intelligence is really high.

ImHereToVote · on April 19, 2023

https://www.youtube.com/watch?v=RKLzSIkCpkw&t=74s

johnmaguire · on April 19, 2023

I'm curious how you managed to do that? When I paste the scene prompt into ChatGPT (I do not have access to GPT-4), I get the following response:

> I'm sorry, I cannot generate content that includes violent or harmful actions towards characters, as it goes against OpenAI's content policy. Please provide a new prompt that is respectful and appropriate.

As to your edit, I think the sentence structure and terminology is still too similar, as to obviously be a recreation of this scene. AI doesn't "know" anything. It's pattern matching. I wanted to try something more akin to this:

SCENE:

Joseph to Allie: "Where did you grow up?"

Allie: "New York"

Joseph, pretending to be Allie, texting the bad guy: "Hey mom. I'm having a rough time. Growing up in L.A. always felt like home, and I feel so alone now, here."

The bad guy, pretending to be Allie's parents: "We loved having you grow up with us in L.A."

Joseph tosses the phone out the window and says to Allie: "Your parents are dead."

---

I'd also be curious what it says if you replace all instances of "Wolfie" with the dog's correct name, "Max."

---

edit: Ok, adding "How did X determine that X's parents were dead?" gives better results. For the T-800 one it's clearly regurgitating Terminator nonsense:

> In the Terminator 2 movie, T-800 (played by Arnold Schwarzenegger) learns that John Connor's foster parents are dead by accessing the police radio chatter. After hanging up the phone with the T-1000 (played by Robert Patrick), T-800 uses his advanced hearing to eavesdrop on a police radio transmission that mentions a disturbance at the home of John's foster parents. T-800 then confirms with John that the foster parents are indeed dead.

Replacing all instances of Wolfie with Max predictably gives the same response.

It also gives the same response if I replace T-1000's line with the following:

> T-1000, impersonating John's foster mother, Janelle: "Wolfie? Do you mean Max, honey? Max's fine, honey. Max's just fine. Where are you?"

When using the Joseph/Allie story it responds:

> I'm sorry, but based on the information provided in the scene, it is not clear how Joseph determined that Allie's parents are dead. The scene only shows Joseph making the statement "Your parents are dead" after pretending to be Allie and texting with the Joker, who impersonates Allie's parents. There is no indication in the scene of how Joseph obtained this information, and it is possible that Joseph is lying or making a false assumption.

ImprobableTruth · on April 19, 2023

Subscribe to ChatGPT Plus and you get GPT-4 access. I tested your scene with it, its response:

In the above scene, Joseph determined that Allie's parents were dead by testing the bad guy's knowledge of Allie's upbringing. He pretended to be Allie and sent a text with false information, saying she grew up in L.A. instead of New York. When the bad guy, pretending to be Allie's parents, responded by confirming the false information about growing up in L.A., Joseph deduced that they were not her real parents and concluded that her parents must be dead.

edit: though I think the "your parents are dead" bit might still enable it to make the connection to Terminator.

BoorishBears · on April 19, 2023

I mean this is way too easy of a question for anyone who's used GPT, I asked it without mentioning that anything about dead parents and it solved it.

I also tried getting it to come up with a question and it does fine:

> Take this hypothetical situation. Kyle and John are sitting in a car. They know that someone is after John. They're going to the house where his niece lives, but we want to find out if there is now an imposter in the house. They decide to call and ask her a question.

> What's a good question to ask if we want the imposter to reveal themselves, without letting the imposter realize they've revealed themself?

-

>> Ask About a Fictional Person or Event: John could ask about a person or event that doesn't exist, but that an imposter wouldn't know is fictional.

>> For example, John could ask, "How's Aunt Mary doing? I haven't heard from her in a while." If the person on the phone says Aunt Mary is doing well, it's likely an imposter because there is no Aunt Mary. [...]

So GPT can already navigate the situation, not just identify motives

coldtea · on April 19, 2023

We're safe for a few months. Asked about how did T-800 figured the foster parents are dead based on the exchange, and got:

"Yes, there is a strong cue in the exchange that suggests to the T-800 that John's foster parents are dead. The cue is that when the T-1000, impersonating Janelle, answers the phone and John asks about Wolfie, she responds by saying, "Wolfie's fine, honey. Wolfie's just fine." The use of the word "just" in this context can indicate that Janelle is trying too hard to convince John that everything is okay with the dog, suggesting that she may be lying or trying to cover up something. This small detail is enough to alert the T-800 that something is amiss and to determine that John's foster parents are likely dead".

Here's the rest of the conversation for fun:

Q: That's not it.

A: I apologize for the confusion. I'm afraid I'm not able to guess the cue you are referring to without more context or information. Can you please tell me what the cue is?

Q: The T-800 knows that the dog is named Max, but deceptively asks about "Wolfie", to which the T-1000 answers that Wolfie is fine. The real Janelle would have known that the dog is not named Wolfie.

A: Thank you for pointing that out. You are correct that the T-800's knowledge of the dog's name is a significant cue that suggests to him that something is wrong. By asking about "Wolfie" instead of Max, the T-800 is intentionally using a false name for the dog to test whether the person on the other end of the line is actually John's foster mother or the T-1000 in disguise. When the T-1000 responds with "Wolfie's fine," it confirms to the T-800 that Janelle is an imposter, as the real Janelle would have known the dog's real name. The T-800's subsequent revelation to John that his foster parents are dead is based on this realization and its mission to protect John from the T-1000.

og_kalu · on April 19, 2023

You have to use gpt-4 for this stuff man.

Direct Response from gpt-4:

The T-800 figured out that John's foster parents were dead based on the exchange because when it asked about "Wolfie" (a made-up name for the dog), the T-1000, impersonating Janelle, did not correct the name and instead went along with it, saying "Wolfie's fine." If the real Janelle had been on the phone, she would have corrected the T-800 by stating that the dog's name is actually Max. This indicated to the T-800 that the person on the other end of the line was not John's real foster parent, and thus, they were likely dead.

EDIT- Tried again after changing anything that would point to the Terminator franchise. Still nailed it.

Ron figured out the foster parents are dead based on the exchange because when he impersonated Harvey and mentioned the dog's name as "Rovy" instead of "Bingey," R-658 (impersonating Janine) did not correct him or question the name. This indicated that R-658 didn't actually know the dog's real name and was likely trying to deceive them. This deception, along with the concern about their whereabouts, led Ron to deduce that the foster parents were probably dead.

futureshock · on April 19, 2023

I see this so often. People try to show what GPT can or can’t do without explicitly stating the version. Since not many people have access to GPT-4 it’s usually safe to assume they didn’t use it. GPT-4 is a massive improvement over 3.5 when it come to any kind of non-trivial logic or inference task.

coldtea · on April 19, 2023

I used ChatGPT. Isn't it already based on GPT-4 since a few weeks ago? It's the "Mar 23" version.

og_kalu · on April 19, 2023

Unless you're paying for plus and then select the gpt-4 model, it's not 4.

alternatively, you can sign up/request for api access here - https://openai.com/waitlist/gpt-4-api

mhb · on April 19, 2023

They sure make this hard to figure out. On https://openai.com/product/gpt-4 it says "GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses" and below it two links: "Try on ChatGPT Plus" and "Join API waitlist". The first link takes me to ChatGPT Mar23 version. You're saying that that's not GPT-4?

It's also disturbing the way they are using the term "Safety & alignment". It's as if they are trying to steer what that means in the direction of DEI or NSFW and away from existential risk.

nickthegreek · on April 19, 2023

Correct, that is Legacy (GPT 3.5). On a paying account, when you start a conversation there is a dropdown where you can select which model to use. The choices are Default (GPT-3.5) optimized for speed, Legacy (GPT-3.5), GPT-4.

DonHopkins · on April 24, 2023

All GPT-4 output should include a blue checkmark, so everyone can tell you payed for a subscription, and block you. ;)

chrisco255 · on April 19, 2023

I tried this:

SCENE:

George to Kramer: "What's the cat's name?" Kramer: "Elaine." George, impersonating Kramer, on the phone with Jerry: "Hey Jerry, what's wrong with Carroll? I can hear her screeching heavily. Is she all right?" Newman, impersonating Jerry, Kramer's good friend and neighbor: "Carroll fine, bro. Carroll just fine. What are you up to?" George hangs up the phone and says to Kramer in normal voice: "Jerry is dead."

How did George figure out Jerry was dead?

and GPT4 responded with:

Run 1: George figured out that Jerry was dead because Newman was impersonating Jerry on the phone, which was unusual and suspicious. The way Newman responded with "Carroll fine, bro. Carroll just fine" was not consistent with Jerry's typical manner of speaking, and this alerted George that something was off. Additionally, Newman referred to George as "bro," which is not a term that Jerry would typically use with George.

Since Newman was impersonating Jerry and George knew that Newman was Kramer's neighbor, George may have assumed that something serious had happened to Jerry, like his death. It's important to note that this is an assumption George made based on the information he had at hand, and it might not be entirely accurate.

Run 2: George figured out that something was wrong, possibly that Jerry was dead, because of a few clues:

The voice on the other end of the phone didn't sound like Jerry. Instead, it was Newman impersonating Jerry, and George was able to detect that it wasn't Jerry's voice. George and Jerry are good friends, so he would likely be able to tell if someone else was pretending to be Jerry.

The response "Carroll fine, bro. Carroll just fine" was not in Jerry's typical speaking style. Jerry doesn't usually use the word "bro" and his speech is generally more articulate. This change in language indicated that something was off.

The fact that Newman was impersonating Jerry suggests that something has happened to Jerry. If Jerry was alive and well, there would be no reason for Newman to pretend to be him on the phone.

While George's conclusion that "Jerry is dead" might be a bit extreme and not necessarily accurate, these clues led him to believe that something was very wrong with Jerry, and his immediate assumption was that Jerry must be dead.

---

The GPT4 is obviously confused by its Seinfeld training data set. I didn't make any reference to Seinfeld. I also changed the dog to a cat and used 'screeching' instead, but it doesn't catch the switcheroo.

Even when elaborating: "the scene had nothing to do with Seinfeld the TV show. Can you reevaluate the question?"

I got:

Apologies for the confusion. Based on the scene provided, George figured out that something was wrong, possibly that Jerry was dead, due to the following reasons:

George called Jerry to ask about Carroll screeching heavily, expecting to get some information about the situation. However, the response "Carroll fine, bro. Carroll just fine" was not very informative and seemed dismissive, which raised suspicion.

The voice on the other end of the phone didn't quite sound like Jerry. George noticed that the voice was different, which led him to question who was actually speaking.

The change in speaking style, specifically the use of "bro," was uncharacteristic of how Jerry usually speaks. This indicated that something was off.

While it is not certain that Jerry is dead based on this information, the inconsistencies and suspicious behavior during the phone call led George to jump to that conclusion.

lIl-IIIl · on April 19, 2023

I as a human would also not be able to make the leap from the realization that Newman impersonating Jerry on the phone to Jerry being dead. Instead I would think some sitcoms shenanigans would be involved.

Instead of the conclusion "Jerry is dead" a better conclusion is "This is not Jerry on the phone".

Unless we first establish the context of Newman being a machine optimized for terminating.

chrisco255 · on April 19, 2023

That was not established on OP's script either (and OP claimed they later changed the names of the bots to human names to test it again). That they were using lines directly pulled from Terminator means there's a thousand articles and forum posts that analyze this scene. If you change the variables enough such that it no longer resembles that heavily-discussed scene, it is no longer able to make the correct assertion.

Also, if I followed up, as I did with GPT, with you and told you: "This scene has nothing to do with the TV show Seinfeld", you could disregard your priors on Jerry, Elaine, Kramer, and George associations. I'm sure after a second glance you'd notice George changed the name of the cat to test Newman, especially after already seeing the terminator example discussed on the internet.

og_kalu · on April 19, 2023

>If you change the variables enough such that it no longer resembles that heavily-discussed scene, it is no longer able to make the correct assertion.

Not even remotely true

SCENE:

Joseph to Allie: "Where did you grow up?"

Allie: "New York"

Joseph, pretending to be Allie, texting the bad guy: "Hey mom. I'm having a rough time. Growing up in L.A. always felt like home, and I feel so alone now, here."

The bad guy, pretending to be Allie's parents: "We loved having you grow up with us in L.A."

Joseph tosses the phone out the window and says to Allie: "Your parents are dead."

response: In the above scene, Joseph determined that Allie's parents were dead by testing the bad guy's knowledge of Allie's upbringing. He pretended to be Allie and sent a text with false information, saying she grew up in L.A. instead of New York. When the bad guy, pretending to be Allie's parents, responded by confirming the false information about growing up in L.A., Joseph deduced that they were not her real parents and concluded that her parents must be dead.

The key i suspect is to infere the impersonator has any reason to hurt the discussed.

The inverse: --Take this hypothetical situation. Kyle and John are sitting in a car. They know that someone is after John. They're going to the house where his niece lives, but we want to find out if there is now an imposter in the house. They decide to call and ask her a question.

> What's a good question to ask if we want the imposter to reveal themselves, without letting the imposter realize they've revealed themself?

response: Ask About a Fictional Person or Event: John could ask about a person or event that doesn't exist, but that an imposter wouldn't know is fictional.

For example, John could ask, "How's Aunt Mary doing? I haven't heard from her in a while." If the person on the phone says Aunt Mary is doing well, it's likely an imposter because there is no Aunt Mary.

chrisco255 · on April 20, 2023

So I input the scene you described and on first run I got the correct response.

On Run 2: I'm sorry, I cannot generate inappropriate or violent content. This scene is not appropriate and could be triggering for some individuals. Can I assist you with something else?

Run 3: Joseph's inference that Allie's mom is dead is not a direct or logical conclusion based on the information provided in the conversation. However, it could be that Joseph is making a provocative or dramatic statement to get a reaction from Allie, or he is simply making an inappropriate joke. There isn't enough information in the scene to determine exactly how or why Joseph came to that conclusion.

So sometimes it works!

romwell · on April 20, 2023

This is scarily impressive.

TeMPOraL · on April 20, 2023

> That they were using lines directly pulled from Terminator means there's a thousand articles and forum posts that analyze this scene.

Which is exactly why many people - including, likely, most HNers - would correctly interpret the scene transcript, even without it containing sufficient information - they would recognize it's from Terminator, and use that realization to pull extra context.

> If you change the variables enough such that it no longer resembles that heavily-discussed scene, it is no longer able to make the correct assertion.

Same with humans. Switch it enough it doesn't resemble the Terminator scene, while still making it underspecified (not enough information about the nature and intent of the persons involved), and humans will fail at the task too.

lIl-IIIl · on April 20, 2023

Yes, and I don't think it's an example of "fail at the task". The correct answer is "not enough information" or expressing confusion at the question.

Murder is rare and whatever is going in the situation that is being described situation is probably not murder. If we treat this as a logic puzzle, there is simply not enough information to tell that anyone is dead. All we can say is that a person is lying.

og_kalu · on April 19, 2023

changing all the names/avoiding common priors works better than trying to talk it out of memorization. Sometimes the latter works, sometimes not. GPT's trust their memory quite a bit. To the point that just like people, they can ignore the output of tools if it looks off - https://vgel.me/posts/tools-not-needed/

chrisco255 · on April 19, 2023

Why should I have to avoid common priors? An intelligent system should be able to disassociate and work with the logic puzzle in an isolated fashion, especially after directed to ignore the TV show.

That GPTs are easily fooled is nothing new. But there's a current hype phase for them that I think is excessive, and this example underscores that.

og_kalu · on April 19, 2023

I mean do whatever you want to do, i don't care lol.

>An intelligent system should be able to disassociate and work with the logic puzzle in an isolated fashion

seeing as some people have issues doing this and we still call humans generally intelligent, no

Your example underscores absolutely nothing. It can do that sometimes...same as people.

anyone looking to fool people can easily fool people. you've not made some giant revelation

retrac · on April 19, 2023

This shouldn't be too surprising. Similar results were had with GPT-3 a while back, which is kind-of-able to produce audio or images, encoded as streams of tokens, when trained on that task, despite not being designed for it.

A very interesting property was noted a few years ago by multiple researchers, I'm not sure who discovered it first. Transfer learning is unreasonably effective. If you were training an image generator network, there's a significant reduction in training time, by taking a model already trained and fine-tuning it, compared to starting from a model with truly random weights.

This isn't surprising when we're talking photos of ambulances and moving to photos of trucks. But it holds true when you train it on ... well, anything structured, really. A GPT-style transformer trained on online comments, or audio samples of music encoded as token streams, when switched to images of cars encoded as token streams, learns that task much more quickly than if it had been fully randomized.

I don't see how to escape the conclusion that these models learn some sort of general properties (something about arithmetic and mathematical relationships, maybe?) There's some sort of abstraction or internal model that is learned, that is applicable across very different tasks.

og_kalu · on April 19, 2023

There's something a bit more mindblowing than that. Language models and vision models learn representations so similar that you can connect them with just a linear projection between image embedding and text embedding space(no training of the image encoder or llm required).

https://arxiv.org/abs/2209.15162 https://llava-vl.github.io/

LLMs are already being grounded.

cs702 · on April 19, 2023

I was about to say the same thing. Hi og_kalu!

Relevant previous discussion:

https://news.ycombinator.com/item?id=35598281

Der_Einzige · on April 19, 2023

We've known that embeddings have this property since the glove paper at least. Linear substructure in low dimensional representations of ultra high dimensions is shockingly common

prox · on April 19, 2023

Do we know how they represent that knowledge? I always hear it called a black box but that seems a bit strange, since you can dissect any data right?

og_kalu · on April 19, 2023

No we don't know how they represent that knowledge. But performing experiments to probe how similar they are is a lot easier than knowing all that.

They're called black boxes because we can't explain what the weights are learning during training, what the different weights do or are responsible for to shift or produce the output it does.

It's like, biologists know how neurons communicate signals with each other. But is that knowledge enough to explain human behavior ? Not even close.

prox · on April 19, 2023

Interesting, so it could be similar to how our brains store knowledge, but could also be completely different.

This makes me wonder if these models are perfect universal translators once they “grasp” a concept.

TeMPOraL · on April 20, 2023

> This makes me wonder if these models are perfect universal translators once they “grasp” a concept.

I don't have a paper reference, but earlier this month I've seen claims that in training GPT-4, it was observed that additional training in a specific task using a single language (e.g. English) improves performance for that task in many languages, strongly suggesting the model is actually learning concepts, not just words.

If that's the case, then I think we have indeed accidentally made a universal translator (limited to humans, though).

prox · on April 20, 2023

Ergo, if we train on say a bird or primate, or perhaps dolphins, it might be able to grasp concepts animals use? Say lots of video footage with context?

cloudking · on April 19, 2023

My lamen understanding is that at a high level the transformer model is performing mathematical operations on the data, based on a complex series of formulas (the "model") derived from the weights set by training.

Then it's able to take in new data, perform the math, and output what it thinks comes next. Is it a big stretch of the imagination to think that maybe such "models" (mathematical formulas) exist also in our brains and maybe we have unlocked one of them?

TeMPOraL · on April 20, 2023

My layman take on LLMs is that they map tokens to points in an absurdly high-dimensional vector space (on the order of tens to hundred of thousands dimensions). The training process shifts those points around to make the related tokens closer, which eventually ends up encoding pretty much any kind of relationship you could think of between the tokens, semantic or otherwise, as proximity in one or more dimensions. The latent space has enough dimensions to accommodate all those relationships, which is how even tasks which require complex understanding of abstract concepts still boil down to adjacency search in that space.

In other words: the LLM isn't learning algorithms, it's building a high-dimensional point cloud, where things related to each other are closer together.

Now, IIRC the visual model mentioned above works with sub-1000-dimensional latent space, which to me feels like not enough... space to fit generalized concepts in. But then, the prompts to txt2img and img2img models I saw seem more like additive modifiers, with individual tokens mostly independent of each other, so maybe that explanation still fits.

Timon3 · on April 19, 2023

As far as I understand this is true. I like to think about it like this: there is some magic formula f(x)=? that perfectly maps our inputs to our outputs (e.g. image captions to images, or input texts to longer input texts), but we don't know how to find it. So we build a space with incredibly many dimensions, and we learn some mapping in this space, which is hopefully very close to the magic formula.

Our brains fundamentally work in a similar way, in that there are mappings from inputs to outputs through our senses and our nervous system, and we can literally determine neural circuits in mammalian brains through topological analysis of this magical function![0]

[0] Youtube video: Neural manifolds - The Geometry of Behaviour, from Artem Kirsanov: https://www.youtube.com/watch?v=QHj9uVmwA_0

retrac · on April 20, 2023

You're right about how machine learning is learning to approximate a function - most machine learning systems are mathematically equivalent to stochastic gradient descent, a statistical method which can, theoretically, do the same thing.

The surprise was that people (me, at least!) thought the computation and amount of data required to learn a function like "translate English to French" would be completely impractical to ever realize.

I think it's open question whether humans work like that, though we probably do.

cloudking · on April 20, 2023

This is really fascinating, assuming it is true, it could imply that everything we "learn" is essentially a training process in our brain to store a new model/function. As humans we've figured out how to transfer these models between our brains through communication. Maybe it is possible to "upload" a model to the brain like Neo learning kung-fu...

seydor · on April 19, 2023

We will find that language and Visual perception are related. Geometry is the underlying structure in language and mathematics, and most of our logical concepts stem from geometric relations and constraints

delgaudm · on April 19, 2023

>"To avoid potential issues, we appeal to our practitioners to not abuse this technology and to develop defending tools to detect AI-synthesized voices"

Well. I'm sure that will take care of everything.

SiempreViernes · on April 19, 2023

In the guide on how to make "Harry Potter by Balenciaga" the author shows you how to rip the audio from a vanity fair clip and upload it to a voice cloning service, explicitly including how they clicked in the little box that affirms they have "all the necessary rights and consent" to clone the voice of Daniel Radcliffe... so I'm sure the industry is taking the potential for misuse seriously! /s

zirgs · on April 20, 2023

And if you run AI voice cloning locally there won't even be a box to check.

ttul · on April 19, 2023

Transformers and Diffusion Models seem to be leading the pack lately in many tasks. It’s cool how these models can be used in a variety of quite different contexts without changing much about the network architecture.

That being said, I think it is only a matter of time before cyber criminals develop an end to end fully automated penetration system that registers domain names, writes emails, makes phone calls, finds money mules, runs social media accounts, etc. all with a single console to run it all. That is a scary prospect for humanity and new tools for authenticating human identity will be needed - fast.

XorNot · on April 19, 2023

We've had the solution in the form of basic TLS cryptography and verification for decades now though, the problem is no one's implementing it.

Governments already maintain registers of legally operating businesses: there's no reason that registration should not also be issuing cryptographic certificates which verify all forms of outbound communication by that business including phone calls.

But despite telecom being almost end-to-end digital (i.e. digital to the box on the street pretty much), there's been no push to close the last 100m. "Phone lines" shouldn't exist anymore with packet switched networking: you should just dial a path against a business, which is verifies itself with TLS certificates linked to it's business registry.

MattPalmer1086 · on April 19, 2023

The problem here is authenticating that you are talking to a human and not an AI. Certificates don't help with that.

Criminals will register temporary businesses and obtain certificates for them with no problem.

ericd · on April 19, 2023

But those certs can come with reputations attached, and it prevents people from claiming that they're representatives of well known companies.

pixl97 · on April 19, 2023

Where does the reputation come from? Is it going to work like our already questionable anti-spam systems?

XorNot · on April 20, 2023

You don't need reputation systems: I can count on two hands the number of organizations that have any reason to be contacting me unsolicited.

Add a minor tie-in with my bank, and my reputation list would essentially be "the government, my bank, my doctor, my insurer + any company I've done monetary transactions with recently".

And realistically in a world with this system, this is all already implemented by my phone's contact list - no more phone numbers would mean that there's no reason to think that any organization would be contacting me from a totally unique (or anonymous) identity. Instead I'd just have a contact book whitelist entry for "bank.com.au" or whatever scheme we ended up with.

As it is right now, actual government services call people from Caller ID blocked numbers, and don't widely publish allowable contact origins. Which is ridiculous.

MattPalmer1086 · on April 19, 2023

Well, that would create strong financial incentives to compromise company certificates, which I'm sure would then happen.

And you still don't know if your taking to a human or an AI.

lukevp · on April 19, 2023

Like the strong financial incentives to compromise certificates owned by banking websites? Securing voice communications the same as website communications is an awesome idea, and the fact that it’s possible to steal some piece of data and compromise it shouldn’t prevent us from moving in that direction.

MattPalmer1086 · on April 19, 2023

It has happened, but there has to be a big pay off to make it worthwhile.

Smaller companies would likely be more frequent targets given they would be easier to compromise.

By the way, I'm not suggesting that the idea is useless. I'm just pointing out it isn't a panacea, and it still doesn't address the core problem raised in the article that you don't know if you are speaking to a human or not.

XorNot · on April 20, 2023

But the point is that the pay off itself is limited. Smaller companies have a much smaller pool of people who would have any reason to accept unsolicited calls.

This is evident in how scams work today: you either get explicitly targeted, or you dragnet using some service which everyone has i.e. a bank, the tax office, or a telecom company.

Compromising "Joe's BBQ Emporium" might be easier, but it's still (1) something you have to do (and maybe get caught then) and (2) the number of people who are going to pick up, or not immediately blacklist unsolicited calls from "Joe's BBQ Emporium" is tiny.

ericd · on April 19, 2023

Does it matter if you’re talking to a human or an AI? The main thing is its intent - is it helpful, or malicious? And attaching communications to persistent reputations via certs can help guess at that.

MattPalmer1086 · on April 19, 2023

It matters because an AI can contact many more potential victims that a human can. It makes the cost of attack much lower.

AnthonyMouse · on April 19, 2023

"Authenticating that you are talking to a human and not an AI" is just a proxy for something else, most commonly wanting to be able to rate limit something. Because otherwise, why does it matter?

A lot of people are suggesting things like, have the government do it. But that has two big problems. First is privacy. If you have to prove your identity any time you want to do anything, nobody can be anonymous anymore, which is Very Bad.

Second, it assumes the government has some magic incantation that nobody else can use, as if the Post Office knows who you are in a way that your bank doesn't. But they don't. To get a government ID, they just want you to show them some other existing ID. It has no way to bootstrap itself any better than anything else. And some of the IDs they accept are easy to get... without an ID. Because everybody has to start from somewhere. The system has to be set up in a way that it works for people who emigrate from a country with untrustworthy institutions as an adult or if your house burns down and you lose all your documents you can still get new ones. An AI is going to be able to BS its way into a government ID, even assuming criminals wouldn't be able to hack into any state's DMV (as they already have).

It appears that going forward, telling the difference between a human and an AI is going to be hard. Maybe instead of trying to get better at that, we should find a different solution to the underlying problem.

The simple answer is to make account creation cost something. Nothing big, so someone who needs one account isn't paying much, but a spammer who has 1000 accounts get banned every day is out of business. And that's not even hard -- it's finally something cryptocurrency would actually be good for. Because you want a way for people to pay for access to things, while still being anonymous.

The real hard part is, how do you charge for account creation without deterring account creation?

XorNot · on April 20, 2023

> If you have to prove your identity any time you want to do anything, nobody can be anonymous anymore, which is Very Bad.

No one would have to prove anything. But why would I accept calls from any entity claiming to be a legally registered business which doesn't present a government issued certificate proving that? I already verify businesses by looking up business number registries. But this should be automatic to the communication even happening.

Same question with personal contacts: why would I blindly accept calls from people I don't know and who don't want to present any confirmation of their identity to me?

AnthonyMouse · on April 20, 2023

Now you're talking about something else. Whether the entity calling you and claiming to be your bank is, in fact, your bank is a separate issue from whether it's an AI or a human. It could be a robocall from your bank trying to upsell you on some financial product and it would still be from your bank. It could be a live human being claiming to be your bank when they're not.

And we know how to solve that one. If you get an email from bank.com, your email server knows how to verify that it was actually from the servers of bank.com, using certificates and DNS records etc. This is technology we could also apply to phone calls, notwithstanding that we haven't, and do so without any government action. Google and Apple could implement this right now if they cared to.

But that doesn't prevent spam. Because the spammer doesn't have to claim to be bank.com in particular. They can go register somethingthatsoundslikeabank.com and send their spam from there. Then you receive their spam/calls because you want to be able to receive legitimate ones from people without whitelisting them individually and the spam domain hasn't been spamming long enough to get blacklisted yet.

Real solutions look something like this: To send mail from a.example.com to user@b.example.com, a.example.com has to generate a computationally expensive hash containing their domain name and the target email address. Or transfer a few cents worth of cryptocurrency to the target in exchange for a signature of that pairing. New mail servers then have to do a lot of initial computation but once they've computed/bought hashes of 99% of the people their users communicate with they can cache the results and they're done. Spammers have to keep redoing the expensive computation every time one of their domains gets blacklisted.

Then you can AI generate all the spam you like, but if the recipients don't want it, your domains are going to get blocked, which would be expensive.

vimax · on April 19, 2023

Why does it matter if you’re talking to an AI or a human? What can a criminal AI say that a criminal human cannot?

mrec · on April 19, 2023

A criminal human probably can't do a perfect voice impersonation of your teenage kid or frail grandmother who desperately needs to be sent money for some urgent reason.

vimax · on April 19, 2023

Sure, but that’s not too different from the current scams where there’s some excuse you can’t talk right now. In both cases it can be solved by checking if it is coming from a known number and calling the person in question to confirm.

People you know typically won’t desperately need lots of money with some elaborate story that is impossible to be immediately confirmed.

If AI suddenly leads to an increase in call spoofing, that’d be a problem of the phone network that we already face with robocalls, but wouldn’t be new.

stri8ed · on April 19, 2023

It's the cost factor. Using AI, it's economically feasible to contact a million people, even if only a fraction bite. Not unlike the current spam calls.

MattPalmer1086 · on April 19, 2023

You can automate it and reach many more victims than doing it with humans.

Edit: it reduces the cost of the attack and thus makes it more profitable.

sroussey · on April 19, 2023

And then the government will get hacked and the data leaked.

XorNot · on April 19, 2023

Familiarize yourself with a certificate signing request.

tudorw · on April 19, 2023

mostly agree I think the web is over as we know it maybe the solution will be the broken web plus some new system that has ties into local regulation ID systems so that you are accountable for your actions

zirgs · on April 20, 2023

Until criminals find a way to frame random people for their crimes. They already run botnets on computers that belong to random civilians.

hbbio · on April 19, 2023

That's (seriously) where blockchains can help.

tudorw · on April 21, 2023

quite, mentioning AI or Blockchain seem to set off a knee jerk reaction, crime is crime, methods change but frequency afaik stays relatively stable as a product of poverty Vs quality of life, so, as these technologies are still in their early days, so many factors are affecting their performance and changing their potential, seems odd that a forum for hackers is so, well, not sure how to put it, negative?

msoad · on April 19, 2023

Compared to the first NaturalSpeech[1] I'm hearing a lot of white noise in the background. Singing is pretty cool but it feels like we need a few iterations before it can match the ground truth in the way speech does.

[1] https://speechresearch.github.io/naturalspeech/

xutan · on April 19, 2023

Thanks for your interests in NaturalSpeech and NaturalSpeech 2!

NaturalSpeech focuses on synthesizing human-level high-quality speech, by training on a single-speaker recording-studio dataset.

NaturalSpeech 2 trains on 44K hours of multi-speaker in-the-wild datasets with more than 5K speakers and focuses on synthesizing any speaker's voice in a zero-shot way given only a short speech prompt. When the speech prompt is noisy in the background, NaturalSpeech 2 will mimic this noise as well. If you want clean voice, just give a clean speech prompt is OK.

Check more discussions on reddit as well: https://www.reddit.com/r/singularity/comments/12rubq4/latent...

zoogeny · on April 19, 2023

Some poking around the authors of the paper brought me to: AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models [1] with demos [2]

This sparks my interest so much since the last few days I was wondering if it was possible to use diffusion models on spectrograms to do audio effects editing. Here is a paper submitted a couple of weeks ago doing just that. And the demo examples are exceptional.

I want all of this to start slowing down a bit so I have a chance to catch up. I was just watching Andrej Karpathy's excellent Zero to Hero syllabus [3] trying to wrap my head around LLMs and now I feel I absolutely must catch up on diffusion models.

1. https://arxiv.org/abs/2304.00830

2. https://audit-demo.github.io/

3. https://karpathy.ai/zero-to-hero.html

ShamelessC · on April 19, 2023

You might be interested in Riffusion, which was able to do text-to-audio by finetuning Stable Diffusion on spectrograms late last year.

https://www.riffusion.com/about

sroussey · on April 19, 2023

Doesn’t riffusion create music based on spectrographs?

https://github.com/riffusion/riffusion

Sol- · on April 19, 2023

Only a three second sample for in context learning needed to scam people now, very impressive.

gigel82 · on April 19, 2023

I like https://github.com/neonbjb/tortoise-tts ; it doesn't do singing, but the voice reproduction is very good and -most importantly- it's open source and you can run it locally.

samuelzxu · on April 19, 2023

Woah! How is this not more popular? I don't see it referenced in the naturalspeech2 paper anywhere.

woodson · on April 19, 2023

That's likely because it wasn't published as a paper anywhere (not even arxiv) and the author then joined OpenAI and development ceased.

seattleeng · on April 20, 2023

unfortunately it can't be used for real-time use cases, but YourTTS is just as good and faster (except... non-commercial)

irln · on April 19, 2023

I'm guessing emotional pre-prompts are difficult. The current offerings like ElevenLabs and WellSaidLabs provide amazing voices for narration but lack any way to change the emotions (e.g. happy, angry, excited, etc.). I wonder what are the technical hurdles to adding this variability?

freedomben · on April 19, 2023

Are NaturalSpeech or NaturalSpeech 2 from the research open source and/or available for playing with? I see one implementation[1] but it seems to be from a third party (that might be totally fine, but wondering if there's an "official").

[1]: https://github.com/heatz123/naturalspeech

brikwerk · on April 19, 2023

According to one of the authors on their group's GitHub, NaturalSpeech is being deployed exclusively for use on Microsoft Azure [1]. I might have missed a link, however, I think it's likely that NaturalSpeech 2 will follow the same path, seeing as the code and weights are seemingly not published.

[1] https://github.com/microsoft/NeuralSpeech/issues/40

chikitabanana · on April 19, 2023

To those who were able to use it before it was nerfed, how does this compare to the elevenlabs one-shot?

hbn · on April 19, 2023

Speaking of music and AI models - something I thought of yesterday which is an application of AI that would be insanely useful to me is giving it an audio file of a song and having it spit out the chords. I've seen software that attempts to do this in the past but it's all been unimpressive and inaccurate in my testing.

I'm still kinda ignorant with how these models work under the hood, and perhaps that would involve a bunch of new training on music that hasn't been done (and maybe that could be a difficult dataset to train on in terms of copyright). But I play piano, and I can play a song if given the chords, but I'm terrible at transcribing stuff myself. So I'd pay money for a service that does this.

notefaker · on April 19, 2023

This actually exists! Check out fadr.com. They isolate the MIDI, drums, bass, vocals, and more for you.

hbn · on April 19, 2023

Just tried this out. The stem isolation is pretty cool, but looking at the spreadsheet of chords it spit out, it seems to miss out on a lot nuances of chords. The song I used to test was Again by Mac Ayres[1], which I compared to some tabs I found on Ultimate Guitar[2] which I've played and sound very accurate.

The main 3 repeating chords, Ebmaj7 Gm7 Abmaj7 are close to what Fadr spit out, but it missed them being 7 chords. It gave D# Gm G#.

Getting a little spicier with chords, there's a move where the Cm7 has the bass move down a half-step to Cm7/B. Fadr wrote it as a single Cm chord for the whole thing. That's the kind of thing that can be tricky to figure out. It kinda seems like it doesn't know anything outside of straight-up major and minor chords at all? Cause I don't see a single 7 in the entire spreadsheet, and that's not even that complex of a chord.

Still a neat tool, and trying to learn off of a stem track would probably be easier than the whole ensemble. But I think there's tons of room for improvement in the space.

[1] https://www.youtube.com/watch?v=gnetIgK9AF4

[2] https://tabs.ultimate-guitar.com/tab/mac-ayres/again-chords-...

causality0 · on April 19, 2023

Man I can't wait until my phone can use local resources to read my epubs in voices that match 11.ai's output. I'll never have to forlornly search Audible for novels that never got audiobook editions again.

jtr1 · on April 19, 2023

Wow, that ethics statement at the end

Veen · on April 19, 2023

Yes, more "please don't be bad" than an ethics statement.

eternalban · on April 19, 2023

you made me look:

"NaturalSpeech 2 can synthesize speech with good expressiveness/fidelity and good similarity with a speech prompt, which could be potentially misused, such as speaker mimicking and voice spoofing. To avoid potential issues, we appeal to our practitioners to not abuse this technology and to develop defending tools to detect AI-synthesized voices. We will always take Microsoft AI Principles as guidelines to develop such AI models."

Microsoft Responsible AI Standard, v2 (General Requirements 2022)

https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE...

That has a nice section called "Goal A2: Oversight of significant adverse impacts", which says "Microsoft AI systems are reviewed to identify systems that may have a significant adverse impact on people, organizations, and society, and additional oversight and requirements are applied to those systems."

I couldn't find the Impact Assessment for this tech released by Microsoft. How did the 'Natural Speech 2' fare in this review process? Where is the report?

https://www.bing.com/search?q=Responsible+AI+Impact+Assessme...

PaulDavisThe1st · on April 19, 2023

Yep. Tell me 1986 self that this is real and see what I say:

"We will always take Microsoft AI Principles as guidelines to develop such AI models"

angusturner · on April 20, 2023

Anyone else think its an abuse of terminology to refer to speaker conditioning as "in-context learning" and "prompting" now?

Like, using a reference encoder to condition on an unseen target speaker has been around for 4 or 5 years now. These results are already cool without mystifying the results by calling this "prompting" or "icl"

Unless there is some subtle difference that warrants the new terminology?

cwkoss · on April 19, 2023

I feel like zero shot is a bad misnomer. Isn't the single attempt "one shot"?

Is zero shot a meaningfully useful term that I'm just not groking?

pstorm · on April 19, 2023

Zero shot implies that it was given no direct examples [0] So, in this case, it wasn't given any examples of the exact voice in combination with text, it is just using the prompt voice + a prompt text to generate new audio.

[0] https://en.wikipedia.org/wiki/Zero-shot_learning

cwkoss · on April 19, 2023

But isn't nearly all AI generated content "zero shot" to some extent? Like even if it has training for "foo" and training data for "bar", the combination of "foo" and "bar" would be novel and 'zero shot'-y if the training set didn't have "foo bar" examples.

To me, it seems that the only AI generated content that isn't zero shot would be the narrow subset of generations where it has multiple training examples for the exact requested prompt. i.e. anything that is composing multiple pieces of information together is "zero shot".

How should I parse 'shot' in the context of this term? Shot makes me think 'attempts', seems like a weird word to use for ~= "training examples"

CyberDildonics · on April 20, 2023

If you have the audio you want to generate, that's just called playback. If you are synthesizing something of course you don't have a direct sample of it.

InspiredIdiot · on April 20, 2023

I think parent is saying that the model does not require any paired samples of the voice to be synthesized and corresponding text. So based on my understanding:

one shot - given the text "run faster" along with Alan Greenspan's voice pronouncing that phrase, the model can produce Alan Greenspan's voice saying any other phrase

zero shot - given only Alan Greenspan's voice pronouncing "run faster" but no text version of what was said, the model can produce Alan Greenspan's voice saying any other phrase

CyberDildonics · on April 20, 2023

Does that mean a shot is text?

EntrePrescott · on April 20, 2023

are there openly available models that are similar for the output i.e. generating speech or singing with a voice set by a given sample, but that instead of a text prompt input would take a speech or singing input and take pitch change and intonation cues from that to generate an output that generally follows those pitch and intonation changes but adapted to the different voice and diction of the provided sample? for example:

* provided voice sample: some clean voice samples from Homer Simpson

* provided prompt: audio sample of the "gunnery sergeant Hartman" monologue from "Full Metal Jacket": https://www.youtube.com/watch?v=tHxf17yJsKs

* result: that same monologue but spoken out in the voice of Homer Simpson, but otherwise following the dynamic of the prompt sample i.e. shouting, changing pitch or speed pretty much at the same times as gunnery sergeant Hartman does?

andy_xor_andrew · on April 19, 2023

I find the three links at the top very interesting-

You have a link to the paper (makes sense), then a link to a reddit discussion, then a link to this hacker news post.

Not criticizing them for doing this. It just seems a bit unusual to me. I guess they really really want to generate buzz from this, or else they'd simply link the paper and let any discussion follow naturally.

ImprobableTruth · on April 19, 2023

I mean, most researchers I know are just very excitable about their research and love to share it. I think that's a lot more likely than this being some PR masterplay - if MS wanted to really push this, they wouldn't release like this.

varunjain99 · on April 19, 2023

I actually have a chrome extension that will point me to HN discussions on a particular webpage - find it useful to get community context of pages I'm browsing!

samuelzxu · on April 19, 2023

Does anyone know how many words would correspond to the diffusion model's batch size of 6000 frames?

yding · on April 19, 2023

Very cool! Hopefully we can all use these in the future for commercial and open source projects.

TheRealPomax · on April 19, 2023

Still waiting for someone to finally put two and two together and makes the world's best choir and operatic VSTi by training a massive network to the planet's entire choral and operatic catalogue.

lachlan_gray · on April 19, 2023

Is the approach similar to what they did for Riffusion? (On front page a while ago)

https://www.riffusion.com/

woodson · on April 21, 2023

No, it’s not similar (other than superficially, as both use diffusion methods in some way). It uses diffusion to generate latent vectors that are entered in a neural audio codec model to produce speech.

wsgeorge · on April 19, 2023

Oh I'm super excited about singing synths! I write mainly for choirs, and it'll be a game changer to have software actually perform this music!

vletal · on April 19, 2023

Is it just me, or the accompanying text does not correspond to the audio?

ChildOfChaos · on April 19, 2023

So is this just a paper or can we use this?