More

zone411 · 2024-06-27T21:35:07

"Soon" https://x.com/LechMazur/status/1806366744706998732

zone411 · 2024-06-20T15:32:50

Slightly better on the NYT Connections benchmark (27.9) than Claude 3 Opus (27.3) but massively improved over Claude 3 Sonnet (7.8).

GPT-4o 30.7

Claude 3.5 Sonnet 27.9

Claude 3 Opus 27.3

Llama 3 Instruct 70B 24.0

Gemini Pro 1.5 0514 22.3

Mistral Large 17.7

Qwen 2 Instruct 72B 15.6

zsmizzle · 2024-06-20T18:55:15

It still fails to be the moderator of a WORDLE board. That is always the first test I do of these new models.

zone411 · 2024-05-24T08:41:32

Not only that, but it opens the project up to having to deal with a trademark cease and desist letter and then having to rebrand. Preplexity would be obligated to send one in order to protect its trademark if they become aware of this. How are seemingly decent software developers so unaware of anything besides coding?

zone411 · 2024-05-22T02:29:46

On my benchmark (NYT Connections), Phi-3 Small performs well (8.4) but Llama 3 8B Instruct is still better (12.3). Phi-3 Medium 4k is disappointing and often fails to properly follow the output format.

zone411 · 2024-05-20T23:46:31

It probably is illegal in CA: https://repository.law.miami.edu/cgi/viewcontent.cgi?article...

"when voice is sufficient indicia of a celebrity's identity, the right of publicity protects against its imitation for commercial purposes without the celebrity's consent."

charlieyu1 · 2024-05-21T00:26:53

But why? Sounds like a violation to the rights of the sound actor

whoknowsidont · 2024-05-21T00:44:59

Because it's meant to give the _appearance_ or _perception_ that a celebrity is involved. Their actions demonstrate they were both highly interested and had the expectation that the partnership was going to work out, with the express purpose of using the celebrity's identity for their own commercial purposes.

If they had just screened a bunch of voice actors and chosen the same one no one would care (legally or otherwise).

janalsncm · 2024-05-21T04:12:49

What OpenAI did here is beyond the pale. This is open and shut for me based off of the actions surrounding the voice training.

I think a lot of people are wondering about a situation (which clearly doesn’t apply here) in which someone was falsely accused of impersonation based on an accidental similarity. I have more sympathy for that.

But that’s giving OpenAI far more than just the benefit of the doubt: there is no doubt in this case.

sneak · 2024-05-21T05:41:22

I think "beyond the pale" is a bit hyperbolic. The voice actor has publicity rights, too.

whynotminot · 2024-05-21T01:05:00

Sounds like one of those situations you'd have to prove intent.

(and given the timeline ScarJo laid out in her Twitter feed, I'd be inclined to vote to convict at the present moment)

sangnoir · 2024-05-21T01:45:15

> Sounds like one of those situations you'd have to prove intent.

The discovery process may help figuring the intent - especially any internal communication before and after the two(!) failed attempts to get her sign-off, as well as any notes shared with the people responsible for casting.

jahewson · 2024-05-21T04:25:52

Not necessarily, because this would be a civil matter, the burden of proof is a preponderance of the evidence - it’s glaring obvious that this voice is emulating the movie Her and I suspect it wouldn’t be hard to convince a jury.

charlieyu1 · 2024-05-21T06:48:39

I guess the Trump lookalike satire guy would not want to go to California then

actionfromafar · 2024-05-21T10:28:44

Ah, so OpenAI does satire. That explains a lot.

csomar · 2024-05-21T03:06:27

I am guessing it's because you are trying to sell the voice as "that" actor voice. I guess if the other voice become popular on its own right (a celebrity) then there is a case to be made.

mandmandam · 2024-05-21T09:45:25

Did you read the statement? They approached Scarlett twice, including two days before launch. Sam even said himself that Sky sounds like 'HER'.

This isn't actually complicated at all. OpenAI robbed her likeness against her express will.

romwell · 2024-05-21T00:21:40

I'd be surprised if it was legal anywhere in the US, but this just puts the final nail into Sky's coffin.

zone411 · 2024-05-15T01:40:34

15.3 On NYT Connections benchmark:

GPT-4 turbo (gpt-4-0125-preview) 31.0

GPT-4o 30.7

GPT-4 turbo (gpt-4-turbo-2024-04-09) 29.7

GPT-4 turbo (gpt-4-1106-preview) 28.8

Claude 3 Opus 27.3

GPT-4 (0613) 26.1

Llama 3 Instruct 70B 24.0

Gemini Pro 1.5 19.9

Mistral Large 17.7

-----> Gemini 1.5 Flash 15.3

Mistral Medium 15.0

Gemini Pro 1.0 14.2

Llama 3 Instruct 8B 12.3

Mixtral-8x22B Instruct 12.2

ukuina · 2024-05-15T04:10:17

So many high-performing, yet poorly-named OpenAI models in that list.

zone411 · 2024-05-13T18:26:17

It doesn't improve on NYT Connections leaderboard:

GPT-4 turbo (gpt-4-0125-preview) 31.0

GPT-4o 30.7

GPT-4 turbo (gpt-4-turbo-2024-04-09) 29.7

GPT-4 turbo (gpt-4-1106-preview) 28.8

Claude 3 Opus 27.3

GPT-4 (0613) 26.1

Llama 3 Instruct 70B 24.0

Gemini Pro 1.5 19.9

Mistral Large 17.7

zone411 · 2024-05-12T22:48:59

No, Lmsys is just another very obviously flawed benchmark.

CuriouslyC · 2024-05-12T23:48:27

Flawed in some ways but still fairly hard to game and useful.

aubanel · 2024-05-12T23:22:31

Please elaborate on this: how is it flawed?

BoorishBears · 2024-05-13T04:50:33

It's horribly useless for most use cases since half of it is people probing for riddles that don't transfer to any useful downstream task, and the other half is people probing for morality. Some tiny portion is people asking for code, but every model has its own style of prompting and clarification that works best, so you're not going to be able to use a side-by-side view to get the best result.

The "will it tell me how to make meth" stuff is a huge source of noise, which you could argue is digging for refusals which can be annoying, and the benchmark claims to filter out... but in reality a bunch of the refusals are soft refusals that don't get caught, and people end up downvoting the model that's deemed "corporate".

Honestly the fact that any closed source model with guardrails can even place is a miracle, in a proper benchmark the honest to goodness gap between most closed source models and open source models would be so large it'd break most graphs.

GaggiX · 2024-05-13T08:39:16

This is so nonsensical it's hilarious, "corporate" models have always been at the top of the leaderboard.

BoorishBears · 2024-05-13T17:02:51

Maybe just more nuanced a comment than you're used to. "Corporate" models are interspersed in a way that doesn't reflect their real world performance.

There aren't nearly as many 3.5 level models as the leaderboard implies for example.

zone411 · 2024-05-12T22:46:47

Why is the link to this blog spam instead of to the paper or a better article? Hossenfelder lacks qualifications in neuroscience and is often confidently inaccurate.

zone411 · 2024-05-07T17:49:38

They claim insolvency? What are you talking about? They went through bankruptcy in 2019-2020. 4 years ago.