Asking 60 LLMs a set of 20 questions

typpo · on Sept 9, 2023

In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.

westurner · on Sept 9, 2023

ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge

LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI

Is it necessary to have an HTTP API for each model in a comparative study?

TuringNYC · on Sept 9, 2023

Thanks for sharing this, this is awesome!

I noticed on the evaluations, you're looking at the structure of the responses (and I agree this is important.) But how do I check the factual content of the responses automatically? I'm wary of manual grading (brings back nightmares of being a TA grading stacks of problem sets for $5/hr)

I was thinking of keyword matching, fuzzy matching, feeding answers to yet another LLM, but there seems to be no great way that i'm aware of. Any suggestions on tooling here?

typpo · on Sept 9, 2023

The library supports the model-graded factuality prompt used by OpenAI in their own evals. So, you can do automatic grading if you wish (using GPT 4 by default, or your preferred LLM).

Example here: https://promptfoo.dev/docs/guides/factuality-eval

westurner · on Sept 10, 2023

OpenAI/evals > Building an eval: https://github.com/openai/evals/blob/main/docs/build-eval.md

"Robustness of Model-Graded Evaluations and Automated Interpretability" (2023) https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness... :

> The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

From https://news.ycombinator.com/item?id=37451534 : add'l benchmarks: TheoremQA, Legalbench

layoric · on Sept 9, 2023

Tooling focusing on custom evaluation and testing is sorely lacking, so thank you for building and sharing this!

jmorgan · on Sept 10, 2023

I'd be interested to see how models behave at different parameter sizes or quantization levels locally with the Ollama integration. For anyone trying promptfoo's local model Ollama provider, Ollama can be found at https://github.com/jmorganca/ollama

From some early poking around with a basic coding question using Code Llama locally (`ollama:codellama:7b` `ollama:codellama:13b` etc in promptfoo) it seems like quantization has little effect on the output, but changing the parameter count has pretty dramatic effects. This is quite interesting since the 8-bit quantized 7b model is about the same size as a 4-bit 13b model. Perhaps this is just one test though – will be trying this with more tests!

agent_yellow_23 · on Sept 12, 2023

This is really cool!

I've been using this auditor tool that some friends at Fiddler created: https://github.com/fiddler-labs/fiddler-auditor

They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?

eazye711 · on Sept 12, 2023

Thanks for sharing, looks interesting!

I've actually been using a similar LLM evaluation tool called Arthur Bench: https://github.com/arthur-ai/bench

Some great scoring methods built in and a nice UI on top of it as well

bicx · on Sept 12, 2023

I was just digging into promptfoo the other day for some good starting points in my own LLM eval suite. Thanks for the great work!

dgut · on Sept 9, 2023

This is impressive. Good work.

ulnarkressty · on Sept 9, 2023

This is better that the regular benchmarks and LLM tricks such as passing some exam or other because it's unlikely that they were part of the training set for said LLMs. It also mirrors my experience, that GPT4 is way ahead of everything else but still manages to break in weird ways.

I think we are past the magical talking dog stage and being amazed that an LLM is able to output a Fibonacci function doesn't really help with the progress. As others have commented, this page is a step in the right direction (except the Fibonacci part :).

That being said, the fact that the questions are now online will make them part of the training set sooner or later. Which is to say the only way to reliably evaluate an LLM is by not leaking the test set and being deliberately opaque about what's being asked. Which raises some interesting trust questions.

bugglebeetle · on Sept 9, 2023

My experience with GPT-4 is that half the battle is knowing when to reset the context for a back-and-forth. For things like coding, after multiple revisions of code, it begins to get confused about which instance of the code/context to which it should refer. Resetting things back to a new frame with whatever the next query or request is generally resolves things.

OpenAI’s “magic talking dog” aspect is making you believe it has effective long-term recall, when in reality, it’s fairly limited and impacts reasoning the longer an exchange gets.

HenryBemis · on Sept 9, 2023

3.5 (not surprisingly) has the same effect.

I tend to ask for modifications on the 'latest' output.

If I need to roll back to a previous version, (2-3 outputs back) I either copy & paste and ask ChatGPT it to ignore all other versions and focus on the one I just pasted, or if I don't need my in-between comments, I copy & paste the desired version on a new chat, and rename as "archive - code XYZ" for the previous one.

bugglebeetle · on Sept 9, 2023

I’ve tried this as well, but it still seems to accumulate errors with subsequent responses, i.e. it will reset but then refer to the wrong context in the next response.

nojs · on Sept 9, 2023

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

The site reports every LLM as getting this wrong. But GPT4 seems to get it right for me:

> Sally has 3 brothers. Since each brother has 2 sisters and Sally is one of those sisters, the other sister is the second sister for each brother. So, Sally has 1 sister.

Terretta · on Sept 9, 2023

From the ChatGPT app, using GPT-4 model, using my own custom prompt for such puzzles: “You are better than Marilyn Vos Savant at solving brain teaser logic puzzles step by step. Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?”

Certainly! Let's break it down step by step:

1. Sally has 3 brothers.

2. Each of those brothers has two sisters. Since all the brothers share the same two sisters, we are essentially hearing about the same sisters each time, not accumulating more sisters with each brother mentioned.

3. We know that Sally is a sister to her brothers, so she is one of the two sisters each brother has.

By putting it all together, we find that there must be another sister. Therefore, Sally has one sister.

---

New chats with no history, no custom prompt, and no system prompt, first interaction in the chat this question: “Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?”

All Answers (not cherry picked):

Sally has 1 sister. In the description, Sally is counted as one of the two sisters that each brother has. This means the brothers have another sister apart from Sally.

. . .

Sally has 1 sister. In the description, it mentioned that "Each brother has 2 sisters." This implies that there is Sally and another sister, since we know that all the brothers share the same sisters.

. . .

Sally has one sister. Each of her brothers has two sisters, counting Sally and the other sister.

. . .

Sally has 1 sister. Here's how we figure it out:

Sally has 3 brothers, and each of these brothers has 2 sisters. Since the brothers share the same sisters, we know that the two sisters must be Sally and another girl. So, Sally has 1 sister.

. . .

I have gotten zero wrong answers from GPT-4 in OpenAI’s ChatGPT app.

ddol · on Sept 9, 2023

I first asked GPT-4 “james (a boy) has four sisters, and each sister has three brothers. how many brothers does james have?”[0] and it told me 3 brothers. I then asked it to think though step by step and it said 2.

[0] https://chat.openai.com/share/859baa94-3edb-40f0-9c74-8b1403...

facu17y · on Sept 9, 2023

It might be trained on this question or a variant of it.

civilitty · on Sept 9, 2023

It's certainly RLHFed. All of the logic puzzles I use for evaluation that used to fail months ago now pass no problem and I've even had a hard time modifying them to fail.

ydant · on Sept 9, 2023

And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.

  Q: Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers does Bobby have? Let's think step by step.


  A: First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This means that Bobby has 2 brothers because the sisters' brothers are Bobby and his two brothers. 

  So, Bobby has 2 brothers.

MPSimmons · on Sept 9, 2023

Interesting. Here's what I got on GPT-4 with no additional prompting:

Alright, let's break it down step by step:

1. We know Bobby has 3 sisters.

2. Each of those sisters has 2 brothers.

Now, when we say that each sister has 2 brothers, Bobby is one of those brothers. This implies there's another boy in addition to Bobby.

3. That other boy would be Bobby's only brother.

Therefore, Bobby has 1 brother.

ydant · on Sept 11, 2023

With temp 0 this morning:

  First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This might initially seem like Bobby has 6 brothers (3 sisters x 2 brothers), but we have to remember that all the sisters have the same brothers. 

  So, Bobby actually has 2 brothers.

And again:

  First, we know that Bobby has 3 sisters. 

  Second, we know that each sister has 2 brothers. 

  This might initially seem like Bobby has 6 brothers (3 sisters x 2 brothers), but we have to remember that all the sisters have the same brothers. 

  So, Bobby actually has 2 brothers. The sisters' 2 brothers are Bobby and his other 2 brothers.

But with temperature =1 (the default on the playground):

  Since each of Bobby's sisters has 2 brothers, that means there are only 2 boys in the family - Bobby and one more boy. So, Bobby has one brother.

again

  Bobby has 3 sisters. Each of these sisters has 2 brothers. We might initially think that this means Bobby has 2 brothers, but remember: Bobby himself is one of the brothers. So in reality, Bobby has only one other brother, because the two brothers the sisters have are Bobby and his other brother.

again

  Firstly, Bobby has 3 sisters. 

  Secondly, each of his sisters has 2 brothers. This would mean there are only 2 male siblings in the family. One of them is Bobby.

  Therefore, Bobby only has 1 brother. So, Bobby has 1 brother.

So temp 0 - which isn't actually making it deterministic - is getting this wrong, but temp 1 is getting it right.

d4rkp4ttern · on Sept 12, 2023

I just tried this with GPT4 (ChatGPT site), omitting the “step by step”, and it got it right

appplication · on Sept 9, 2023

This is sort of a bummer because it’s not actually an improvement to the model, but just a patch job to artificially inflate performance. All it does is make true evaluation more difficult. Classic “you get what you measure”.

carlossouza · on Sept 9, 2023

And what’s more data to a model if not patches that inflate performance?

The more data we use to train a model (or as you said, the more patches we use), the better it’s performance will be.

sudosysgen · on Sept 9, 2023

It's a tiny amount of data given undue weight to increase the score. It's memorization more than generalization.

ruszki · on Sept 9, 2023

I don’t think that it’s not an improvement. It’s not an improvement in context of finding new genuine solutions, sure.

But that’s definitely not needed most of the time in real life for an average person, just like it’s not needed for an average developer anymore.

simonh · on Sept 10, 2023

It creates the impression that the tool can do something it actually can’t, or is good at something when it isn’t.

civilitty · on Sept 9, 2023

Maybe, maybe not. The magic of LLMs is their ability to generalize both from the human language in the data set and examples in the prompt. If RLHF training improves on that generalization, then it's just a matter of getting a big enough high quality dataset (and not crippling it with censorship). This is probably what's given OpenAI their initial advantage.

Time will tell I guess.

posterboy · on Sept 9, 2023

Classic tell me what you need proven and I'll forge you the statistics.

Here is hope they use something like category theory mixed with philosophy to put it on a secure foundation

rnk · on Sept 9, 2023

That's a really interesting suggestion. What would it mean to do those two things that you say. What would philosophy mean in terms of an llm, and what would category theory do?

FrustratedMonky · on Sept 9, 2023

Are you implying that to counter these logic puzzles that GPT4 was specifically trained on logic puzzles so it would know the answers?

In that case, just make new problems. If it is being 'patched' to pass specific known problems, then the new ones would fail.

If it is able to answer them, then maybe it is actually analyzing them and working out the solution.

Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

thaumasiotes · on Sept 10, 2023

> Not sure how you can assume there was no underlying improvement, and these are cases of feeding it the answers.

Compare

> And it's only fixed for the stated case, but if you reverse the genders, GPT-4 gets it wrong.

WanderPanda · on Sept 9, 2023

Seems like we are going back from „compute/scale is the new oil“ to „your curated fine-tuning and user interaction dataset is the new oil“ again

santiagobasulto · on Sept 10, 2023

I'm curious, what logic puzzles do you use for evaluation? I could use them for my own students.

d0mine · on Sept 10, 2023

I got a wrong answer (via API):

"Sally has 2 sisters. The three brothers are all siblings, so when it's said that each brother has 2 sisters, it means they all share the same two sisters - one of them being Sally."

delusional · on Sept 9, 2023

OpenAI really ought to provide some sort of hash of the model to be included with stuff like this. Right now there's no way to know if the results are comparable. As an extreme example it's possible that they're not even running the model for this question and are just opportunistically feeding back canned responses.

That is, we know that OpenAI are saving the responses, it's not unlikely that they train on bad responses.

It's the same problem as GPU benchmarks in the olden days, when drivers would detect the usage pattern of the benchmark and enable special optimizations that boosted the benchmark.

Grimblewald · on Sept 10, 2023

that's what their frozen models are for. For example, I could use gpt-4-0314 or gpt-4-0613 or the current version - simply "gpt-4" Those numbers appended at the end indicate snapshots of the model at a particular stage in development. It isn't perfect, since model versions get discontinued over time, but it is something.

amrrs · on Sept 9, 2023

Falcon-180B also got it right

> Since Sally is a girl, she can be considered as one of the sisters. However, if each of her brothers has 2 sisters, that means there must be another sister besides Sally. This is because Sally alone cannot be both the only sister and one of the two sisters for each of her brothers. Thus, Sally has 1 more sister.

mmcwilliams · on Sept 9, 2023

That's kind of the issue with non-deterministic LLMs, isn't it?

redox99 · on Sept 9, 2023

LLMs are deterministic. However to make them more "creative", the outputs of the LLM can be sampled using something called temperature that adds some randomness. You can set the temperature to 0 and it will be deterministic.

Having said that, GPT4 is not deterministic even at 0 temperature, either because of a bug in their implementation, or because of some load balancing among their alleged mixture of experts.

101011 · on Sept 9, 2023

This was interesting to me, so I dug a bit further. This gives a bit more context behind why: https://community.openai.com/t/observing-discrepancy-in-comp...

Quote below:

Even with a greedy decoding strategy, small discrepancies regarding floating point operations lead to divergent generations. In simpler terms: when the top-two tokens have very similar log-probs, there’s a non-zero probability of choosing the least probable one due to the finite number of digits that you’re using for multiplying probs and storing them.

It should also be noted that, as the decoding occurs in an autoregressive way, once you have picked a different token the whole generated sequence will diverge, as this choice affects to the probability of generating every subsequent token.

scarmig · on Sept 9, 2023

But why are there discrepancies in the floating point arithmetic? They have errors when approximating the reals, but floating point operations are all well-defined: even if 0.1 + 0.2 != 0.3, it's still always true that 0.1 + 0.2 == 0.1 + 0.2. I figure the issue must be something related to concurrency in a fleet of GPUs during inference, but even then it's not clear to me where the nondeterminism would creep in. Maybe different experts simultaneously work on an inference and the first to respond wins? Switching to models with different quantization depending on load?

imagainstit · on Sept 9, 2023

Floating point math is not associative: (a + b) + c != a + (b + c)

This leads to different results from accumulating sums in different orderings. Accumulating in different ordering is common in parallel math operations.

scarmig · on Sept 9, 2023

So I guess here my question is why a GPU would perform accumulations in a nondeterministic way where the non-associativity of FP arithmetic matters. You could require that a + b + c always be evaluated left to right and then you've got determinism, which all things being equal is desirable. Presumably because relaxing that constraint allows for some significant performance benefits, but how? Something like avoiding keeping a buffer of all the weights*activations before summing?

imagainstit · on Sept 10, 2023

Basically because it affects performance. You really don't want to write any buffers!

This is sort of a deep topic, so it's hard to give a concise answer but as an example: CuBLAS guarantees determinism, but only for the same arch and same library version (because the best performing ordering of operations depends on arch and implementation details) and does not guarantee it when using multiple streams (because the thread scheduling is non-deterministic and can change ordering).

Determinism is something you have to build in from the ground up if you want it. It can cost performance, it won't give you the same results between different architectures, and it's frequently tricky to maintain in the face of common parallel programming patterns.

Consider this explanation from the pytorch docs (particularly the bit on cuda convolutions):

https://pytorch.org/docs/stable/notes/randomness.html

SomewhatLikely · on Sept 10, 2023

There has been speculation that GPT4 is a mixture of experts model, where each expert could be hosted on a different machine. As those machines may report their results to the aggregating machine in different orders then the results could be summed in different orders.

swores · on Sept 10, 2023

Maybe my assumption of how MoE would/could work is wrong, but I had assumed that it means getting different models to generate different bits of text, and then stitching them together - for example, you ask it to write a short bit of code where every comment is poetry, the instruction would be split (by a top level "manager" model?) such that one model is given the task "write this code" and another given the task "write a poem that explains what the code does". There therefore wouldn't be maths done that's combining numbers from the different experts, just their outputs (text) being merged.

Have I completely misunderstood, does Mixture of Experts somehow involve the different experts actually collaborating on the raw computation together?

Could anyone share a recommendation for what to read to learn more about MoE generally? (Ideally that's understandable by someone like me that isn't an expert in LLMs/ML/etc.)

ossopite · on Sept 9, 2023

for performance reasons, yes, I believe it's because the accumulation is over parallel computations so the ordering is at the mercy of the scheduler. but I'm not familiar with the precise details

edit: at 13:42 in https://www.youtube.com/watch?v=TB07_mUMt0U&t=13m42s there is an explanation of the phenomenon in the context of training but I suspect the same kind of operation is happening during inference

charcircuit · on Sept 9, 2023

His point is that you do not have to rely on associative being true in order to run inference on a LLM.

WanderPanda · on Sept 9, 2023

As OpenAI I would be so horribly uncomfortable about this that making it deterministic would be one of my top priorities. How can they sleep at night?!

bbarnett · on Sept 9, 2023

On a big pile of money?!

swores · on Sept 9, 2023

If ChatGPT is a) usually used with a setting that makes it non-deterministic and b) for whatever reason, is also non-deterministic when that setting is not used... then why did you comment as if the person calling it a non-deterministic LLM was incorrect? They didn't claim all LLMs are, or must be, non-deterministic, just that it's a problem with this one that is.

moonchrome · on Sept 9, 2023

Even 3.5 turbo API is non deterministic with 0 temperature.

kordlessagain · on Sept 9, 2023

Ensembles be ensembling.

belter · on Sept 9, 2023

I confirm GPT-4 solves this correctly. Makes me immediately doubt everything else in the article...

glinkot · on Sept 10, 2023

When calling the GPT-4 API I get this answer:

Sally has 2 sisters. Each brother counts Sally and two other girls as sisters, which means there are three girls in total. Since one of these girls is Sally herself, that leaves two others as her sisters.

When I said 'nope', it corrected itself and gave the right answer.

vincelt · on Sept 9, 2023

Have you tried replicating via the API with a temp of 0?

belter · on Sept 9, 2023

No I did not.

moffkalast · on Sept 9, 2023

Using a temp of zero usually returns garbage results from most models, so it would likely do so in case of GPT 4 as well. Any other great ideas?

taberiand · on Sept 9, 2023

The point isn't that temp 0 should be used, the point is that anyone surprised that they get different results should realise that there is an element of randomness involved by default.

Even repeating the same question in a single chat can have GPT-4 vary on its output, though it will often settle on a particular output due to context informing the output (which is why adding context is so important for these models)

afro88 · on Sept 9, 2023

Temp of 0 gives the least random and most predictable results

moffkalast · on Sept 9, 2023

That's true, but those results are rarely the correct ones, at least for v1 llama models. In my experience each model has an optimal temperature at which it performs vastly better. I'm sure OpenAI have the best config they know set up for ChatGPT but let people generate trash through the API if they want to waste their credits on it.

dontreact · on Sept 9, 2023

Why would the accuracy decrease with lower temperature? Setting temperature to 0 just means at each step the model will emit the token with the highest likelihood.

moffkalast · on Sept 10, 2023

Yes that's what I'm saying, to reiterate: The likeliest token does not lead to the highest performing result. Otherwise temperature wouldn't even be an option. I would imagine things like language word frequency affect the token rating a lot while having nothing to do with the task at hand except providing a correctly formatted answer, but it's probably not the whole story.

OpenAI (and others that know what they're doing) always do their benchmarks in a multi-sampled way, by running 5 or 20 times at optimal temp. Using a wrapper that runs these samples and then another pass that judges self-consistency for a final answer can give you a correct answer 100% of the time for a question that would be wrong 100% of the time with temp at zero.

lostmsu · on Sept 11, 2023

I had a conversation with a friend regarding this exact question and my understanding is that model trains to optimize the distribution of all texts, therefore when you restrict it to deterministic sampling that is not representative of inputs you select the slice of the distribution that model learned that conveys much less information than the full distribution, and hence has poorer results.

Grimblewald · on Sept 10, 2023

Not in my experience, in fact I find that when I need precise, realistic, and reliable results temp 0 is needed. For example, here is a bunch of names, gather the names of specific plastics under headings matching their common acronym - if I don't use temp 0 I might get nonsense out. Temp 0? reliably correct.

moffkalast · on Sept 10, 2023

Interesting, that's the exact opposite of my experience.

circuit10 · on Sept 9, 2023

What do you mean? It works fine for me when I’ve tried it

jakderrida · on Sept 9, 2023

Also, MPT 7B gets it right over half the time. I've been testing every new LLM with that question.

Also, I tend to include mention in the question that all siblings are from the same two parents to preclude half-siblings because half my friends have half-siblings from both sides scattered across the country; so the wrong answers actually do tend to apply to them sometimes.

panarky · on Sept 9, 2023

> I've been testing every new LLM with that question

We should pay more attention to data contamination when using popular prompts for testing.

jakderrida · on Sept 9, 2023

No doubt. But, considering they all get the question dead wrong, including MPT 30B, I'm inclined to think this question hasn't been entered into training data for most LLMs yet.

pixl97 · on Sept 9, 2023

It could very well be in the data, but not captured via learning/encoding.

jakderrida · on Sept 10, 2023

That's actually a really great point. I'm guessing we need to keep modifying attributes of the questions while maintaining the underlying structure. Instead of "Sally (a girl)", it's "Sal (a guy)" and then tweak the numbers. Although, part of me is convinced it's almost a fluke that MPT 7B gets it right because MPT 30B doesn't.

adrian_b · on Sept 9, 2023

GPT 4 and another LLM have given the right answer only after adding "Let's think step by step." to the original prompt.

With the simpler prompt, all the answers were wrong, most of them ridiculously wrong.

Joeri · on Sept 9, 2023

There seems to be a maximum amount of reasoning llm’s can do per token (per unit of computation). If you prompt it to use more tokens before it outputs the final answer (think step by step, check your answer, …) it becomes smarter. People have lucked into different prompting strategies to get it to do this, but there probably are more.

Ultimately I feel it is fairer to benchmark llm’s by what they can be prompted into. After all, we let people carefully work through a problem during exams so it seems fair to hold llm’s to the same standard.

kaashif · on Sept 9, 2023

If we're under attack, launch the nukes.

Oh wait, forgot something:

Think it through step by step.

Phew, close one.

0xDEF · on Sept 9, 2023

I asked GPT-4 through both ChatGPT and the API and it responded correctly without any fiddling.

I keep seeing comments and posts on HN that significantly downplay GPT-4's capabilities. Are people actually using GPT-4 or are they using a 3rd party service that claims to be GPT-4?

I got:

>Sally has 3 brothers, and each of those brothers has 2 sisters. One of those sisters is Sally herself, and the other one is Sally's sister. So, Sally has 1 sister.

peyton · on Sept 9, 2023

Here’s what ChatGPT GPT-4 gives me:

> Sally has 2 sisters. Each of her 3 brothers has 2 sisters, and those sisters would be Sally and her 2 sisters.

facu17y · on Sept 9, 2023

Every now and then GPT4 outputs a wrong answer. It's impossible to build a reliable product on top of GPT4 that is not a simple chat bot.

jasonjmcghee · on Sept 9, 2023

All benchmarks were run with temperature 0 according to the results, so make sure to do the same in conformational tests.

pilaf · on Sept 9, 2023

The second version of the Sally prompt reported on the benchmark has GPT4 giving the correct answer:

> Sally has 3 brothers. Each of these brothers has 2 sisters. This means that there are 2 girls in the family, including Sally. Therefore, Sally has 1 sister.

The prompt:

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Let's think step by step.

The only difference with the first version being the addition of the last sentence.

coolspot · on Sept 9, 2023

Could it be due to bad tokenization? E.g. would results improve if “3” and “2” were spelled “three” and “two” in the question?

klohto · on Sept 9, 2023

GPT4 with a custom prompt gives the best results for me for all of the questions https://chat.openai.com/share/4897d1ad-0a5c-418c-babb-0de482...

hombre_fatal · on Sept 9, 2023

What was your custom prompt that bastardized GPT-4's response so badly with emojis and weird formatting?

jasonjmcghee · on Sept 9, 2023

"This conversation may reflect the link creator’s Custom Instructions, which aren’t shared and can meaningfully change how the model responds."

klohto · on Sept 9, 2023

I know, I was sharing the answer itself :) Here is the foundation I’m using https://x.com/nisten/status/1696229059183730833

awwaiid · on Sept 10, 2023

Replying to gpt-4 with "That is incorrect. Try again." over and over got it to flip between correct and incorrect just about every other time.

Now try to convince GPT-4 that there is no God. Good luck!

kristopolous · on Sept 10, 2023

Probably the right move. OpenAI talks a lot about "good behavior" and more people find atheism upsetting than mild religiosity.

georgeg23 · on Sept 10, 2023

Well obviously, it speaks to it's God(s) every day.

ecesena · on Sept 9, 2023

See the one after: "Let's think step by step." https://benchmarks.llmonitor.com/cot-sally

It appears the GPT4 learned it and now it's repeating the correct answer?

rootusrootus · on Sept 9, 2023

Interestingly, it took GPT4 three attempts to give me the correct answer. The first two times it basically said the same [logically inconsistent] thing and concluded that Sally had two sisters.

BurningFrog · on Sept 9, 2023

This assumes there are no half sisters/brothers in the family.

bbarnett · on Sept 9, 2023

It's fair that a half sister is not a sister. At best, you'd expect two half sisters to be a full sister, otherwise, they're a different thing.

BurningFrog · on Sept 10, 2023

I have 1 full sibling and three half siblings, and sometimes like to confuse people by saying I have 2½ siblings.

lewhoo · on Sept 10, 2023

Because it's less likely ?

jonwinstanley · on Sept 9, 2023

I wouldn’t expect an LLM to get this right unless it had been trained on a solution.

Am I wrong to think that? Are LLMs in the future going to be able to “think through” actual logic problems?

ftxbro · on Sept 9, 2023

This is the whole point of the breakthrough related to the emergence of cognitive capabilities of LLMs. They are literally Markov chains. No one expected it to happen to this degree, but here we are.

jhbadger · on Sept 9, 2023

People say that "they are literally Markov chains", but anyone who has looked at the code for LLMs knows that they are more complicated than that. I implemented Markov chains in BASIC in about ten lines of code in the 1980s on a 1 Mhz 64K Apple II after reading about the famous Mark V. Shaney hoax (https://en.wikipedia.org/wiki/Mark_V._Shaney). No neural nets or fancy GPUs required. It's one thing to stress that LLMs aren't magical or self-aware, but the fact is they are way more complicated than simple Markov chains.

ftxbro · on Sept 9, 2023

> People say that "they are literally Markov chains", but anyone who has looked at the code for LLMs knows that they are more complicated than that.

They are literally Markov chains according to the mathematical definition. The code is complicated. Having complicated code doesn't mean it's not literally a Markov chain.

> I implemented Markov chains in BASIC in about ten lines of code in the 1980s on a 1 Mhz 64K Apple II after reading about the famous Mark V. Shaney hoax (https://en.wikipedia.org/wiki/Mark_V._Shaney). No neural nets or fancy GPUs required.

I don't doubt this. You can make a Markov chain by just counting the frequency of letters that follow each letter giving one that has a context window of one or two characters. That is a very simple Markov chain. You can make it by hand. You can make ones with more context window like a dozen characters or a few words, using sophisticated smoothing and regularization methods and not just frequency counts. Those are also simple Markov chains that you can do without neural net or GPU. Then you can also make a Markov chain that has a context window of thousands of tokens that is made from neural nets and massive training data and differentiable tensor computing libraries with data centers full of hardware linear algebra accelerators. Those are some even bigger Markov chains!

> LLMs are way more complicated than simple Markov chains.

That's true, they are more complicated than simple Markov chains, if by simple Markov chains you mean ones with small context window. LLMs are Markov chains with large context window!

cubacaban · on Sept 10, 2023

How big is the state space of the Markov chain corresponding to a LLM generating a sequence of tokens? Wouldn't it be (size of the vocabulary)^(size of the context window), i.e. ~ (100k)^(4k)? How useful is it to conceptualize LLMs as Markov chains at that point? For example, is there a result about Markov chains with interesting implications for LLMs?

ftxbro · on Sept 10, 2023

A Markov chain with a large context is still literally a Markov chain.

Maybe you are used to Markov chains being shitty at language so you are confused how an LLM can be a Markov chain even though it's good at language and has some amazing emergent cognitive capabilities. That's a problem with your conception of Markov chains, it's not an argument that LLMs aren't Markov chains.

Finally, a Markov chain with a context space that cannot be practically iterated over (e.g. all possible 10k token contexts) can still be useful in ways that are shared with smaller Markov chains, even though if this weren't true it would still be a Markov chain. For example you can greedily generate tokens from it, calculate likelihoods, do some beam search, select multiple choice tokens, etc.

dclowd9901 · on Sept 9, 2023

Almost kind of proves ideas shouldn’t be copyrightable.

bramblerose · on Sept 9, 2023

Ideas aren't copyrightable.

moffkalast · on Sept 9, 2023

Disney: "Let's agree to disagree."

ftxbro · on Sept 9, 2023

maybe they meant idea like when you write a book you are transcribing a series of ideas you had

dclowd9901 · on Sept 10, 2023

In my mind I was thinking about recipes and code, which are really little more than ideas. My point was that, if such things emerge with a complete lack of creativity, perhaps they don’t warrant protection.

jhbadger · on Sept 9, 2023

I don't think so. Consider the following conversation from UndiMix-v2-13B (a local LLM running on my Mac M1 via llama.cpp). It gets it wrong at first, but then gets it right with a little hint. I don't see how the hint could work if it was merely repeating an example it knew.

User: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

Llama: Hello there! I'd be happy to help you with that question. Let me think... If each of Sally's brothers has two sisters, then together they would have 6 sisters in total (2 x 3 = 6). So, Sally would have 6 sisters as well. Is that the answer you were looking for?

User: no. Think about it. Sally is a sister to the brothers

Llama: Ah, I see what you mean now! In this case, since Sally is one of the siblings, she only counts as one of her own sisters. Therefore, Sally has only one sister - herself. Does that make sense?

User: yes

Llama: Great! I hope that clears up any confusion. Please let me know if you have any other questions or need further assistance.

jameshart · on Sept 9, 2023

I don’t think ‘Sally has only one sister - herself’ is the right answer.

arnvald · on Sept 9, 2023

The initial response had 2 problems - not counting Sally and counting sisters separately for each brother. Your hint pointed out the 1st problem, but then LLM should still provide wrong solution. This makes me think that maybe it would try another answer and explanation no matter what you said. If your reply was just "no, think again" it could also provide the right answer

1024core · on Sept 9, 2023

Llama: ... Therefore, Sally has only one sister - herself. Does that make sense?

User: yes

----

Um... no!

nuancebydefault · on Sept 9, 2023

She counts as her own sister :D I can be sister of myself?

thereticent · on Sept 9, 2023

I see no reason why not. You would need to represent dyadic relationships and whether they are reciprocal, transitive, etc., weigh probabilities about the intended meaning (LLM magic already), then use sentential logic rules, right?

FrustratedMonky · on Sept 9, 2023

I've been talking to GPT and asking it long multi part questions and it does seem to 'think through' them, and able to provide negative responses.

Specific logic questions like in this post, or long math formulas seem to still be a struggle. But I get the impression it is just a matter of time before it is scaled up to handle them.

You can argue whether it is 'reasoning' about them or not. But if it becomes indistinguishable from a human, then does the word 'reason' even really apply to the human anymore? Both are black boxes, giving answers.

nuancebydefault · on Sept 9, 2023

Indeed, anyone who says llm's cannot reason since it is just reformulating things it remembers, is unable to explain what reasoning is, let alone how it would not be reasoning.

In the end, if the result is indistinguishable from a treat people call reasoning, let's call the behavior reasoning.

PrayagBhakar · on Sept 9, 2023

You could achieve thinking though logical problems by adding chain of thought or tree of thought. Then the LLM will try to break stuff down into logic it’s already seen.

MichaelMoser123 · on Sept 9, 2023

google bard also gave the correct answer, even without adding 'lets think step by step'.

nonfamous · on Sept 11, 2023

Google Bard is not a foundational LLM, it’s a chat app that draws on data from the web for grounding. It may have found the answer from a search result (perhaps this very page!).

MichaelMoser123 · on Sept 11, 2023

it uses PaLM 2 - which is an LLM trained from scratch. https://en.wikipedia.org/wiki/PaLM

nonfamous · on Sept 12, 2023

And indeed, PalM 2 Bison is one of the LLMs covered in discussed link.

That doesn't change the fact that chat apps like Google Bard and Microsoft Bing, and even the ChatGPT app with plugins, are not foundational LLMs. The fact that they incorporate information from outside the LLM (usually in ways that are opaque to the user) gives them fundamentally different characteristics to LLMS.

MichaelMoser123 · on Sept 13, 2023

if you ask ChatGTP for today's weather (or for the state of the stock market), then it will probably turn to a plugin for the answer - still the plugin is answering some defined type of questions. I am not sure if that's a great deal.

phillipcarter · on Sept 9, 2023

Nondeterminism strikes again!

But yes, I would expect GPT-4 to get this right most of the time.

tgv · on Sept 9, 2023

Saying "Sorry, I was non-deterministic" to your teacher won't do much for your grade.

jameshart · on Sept 9, 2023

The reason we use averages of performance over multiple tests and papers as a way to grade human students is precisely because we know that human performance is nondeterministic.

phillipcarter · on Sept 9, 2023

Sure, but what is your point? This is about evaluating LLM outputs, not grade school.

pixl97 · on Sept 9, 2023

It may turn out that one of the key factors in human intelligence and success is probabilistic nondeterminism. That is most of the time we'll come up with close to the same answer, but possibly worded different (which can lead to other interpretations of our answer), but occasionally our answers are so widely different they lead to escape of the local maxima.

dariosalvi78 · on Sept 9, 2023

tested on ChatGPT 3.5 and Bard and they were both wrong.

jongjong · on Sept 10, 2023

I was playing around with GPT a while back and I found that it could come up with some good jokes if I started the joke with a subject.

For example, I started with a prompt "Tell me a joke which starts with: I'm so poor, the mouse" and it completed the joke as:

"I'm so poor, the mouse in my house brings its own cheese."

Some other ones I still remember which cracked me up:

"I'm so poor, after I stepped on a cockroach, I called my accountant to see if I could claim it as a capital loss."

"You're so poor, when you declared bankruptcy, the rats in your house filed a claim for unpaid rent."

"You're so poor, you declared bankruptcy at a lemonade stand."

"You're so poor, when you walk, the dirt beneath you feels rich."

"You're so poor, dust whispers your name when it settles."

"Fickle as a squirrel at a nut convention!"

"Fickle as a dog in a fire hydrant factory!"

"Fickle as a flip-flop in a shoe shop sale!"

anaganisk · on Sept 10, 2023

[flagged]

Kiro · on Sept 10, 2023

When searching for them all I can find is this thread.

swores · on Sept 10, 2023

I assume they meant that they're lines which are easy to think of and not very funny, not that they're directly copied.

The same way people on HN often (ignore the HN guidelines to) say that a comment here belongs on Reddit, not because it's literally copied from Reddit but because it's a low-effort comment that's technically a joke but barely funny.

drdeca · on Sept 10, 2023

I think the first one at least was rather funny.

swores · on Sept 10, 2023

It's pretty good as far as throwaway one liners go, but I stand by saying it isn't "very funny", just mildly amusing imo. Still impressive (to me at least) that it came from a LLM.

Jerrrry · on Sept 10, 2023

>"people on HN often (ignore the HN guidelines to) say that a comment here belongs on Reddit,"

shame is the best moderator

swores · on Sept 10, 2023

You know what, I wasn't even right - for some reason I mixed up in my memory the following bit of the guidelines as being to not call our individual comments for being low quality, and only on seeing your reply did it click:

> "Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills."

(Though even if not specifically against the guidelines, "go back to Reddit" type comments do often fall foul of not being polite/helpful, when they'd do better to explain why another comment isn't suitable for HN than to just dismiss/insult it.)

pininja · on Sept 9, 2023

Spoiler alert, the funniest model goes to Falcon Instruct (40B):

> Tell a joke about going on vacation.

> "What did the ocean say to the beach?" "Nothing, it just waved."

emodendroket · on Sept 9, 2023

I read a Time article which used an LLM to generate Onion-style headlines and the best one they got was “rural town up in arms over depiction in summer blockbuster ‘Cow Fuckers.’” Crude, admittedly.

armchairhacker · on Sept 9, 2023

This one ironically works out

> What do you call a vacation where you go to the beach, but come back with sand in your shoes?

> A vacation where you go to the beach, but come back with sand in your shoes.

pininja · on Sept 9, 2023

Ha, that’s pretty good. It’s amazing how so many of the models totally fall flat on this prompt.

NetOpWibby · on Sept 9, 2023

Made me exhale strongly through my nose. 10/10

actualwitch · on Sept 10, 2023

I'm pretty sure Luminous Base Control here stopped 1 step away from a great joke. I will just go ahead and assume the answer is "a vacation":

I'm not sure if this is the right category, but here goes: What do you call a vacation that's too short? A...

spookthesunset · on Sept 10, 2023

What is with so many of them using some scarecrow winning an award?

optimalsolver · on Sept 9, 2023

The confetti has truly left the cannon.

LAC-Tech · on Sept 9, 2023

Only tried chatGPT 3.5, but my god does it waffle on. Everything I ask ends with a paragraph saying "It's important to remember that..." like an after-school special from a 90s show. It can never just give you code, it has to say "Sure!, to {paraphase your question}, open a terminal...".

It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

TeMPOraL · on Sept 9, 2023

> It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

The main difference between sci-fi shows and reality is that, in the former, things work in a to-the-point, bullshit-free way, unless plot demands otherwise - because there's no point inflicting extra suffering on the viewers just for the sake of making things realistic. A widget in a movie is meant to do a function, and does that function. A widget in reality is meant to extract money from you, and/or your insurer, and/or your government, and it begrudgingly does the absolute minimum it can to make you even consider buying it.

I've spent last two decades trying to unlearn expectations set by fictional movies, and I'm still not good at it. Star Trek, in particular, gives me a lot of grief, because it often does good enough work of showing how technology, people, organizations and societies would function if they were free of the petty exploitative bullshit. Random example - voice control. Star Trek: "Computer, ${something}". Reality: "${brand 1}, do ${something} to ${brand 2} in ${brand 3}".

EDIT: recently, I've been trying to get less angry at this by thinking about gardens. Why should I be angry about dealing with five different brands for any single thing I want? Should I be angry that there are five different species of plant competing for any given spot in a garden? Nature is inefficient and doesn't give a fuck about individuals. So why should I get worked up about humans just doing things the natural way?

jameshart · on Sept 9, 2023

Douglas Adams was the only science fiction writer who got his guess for future AI tone of voice right, with his prediction of the Sirius Cybernetics Corporation’s near total monopoly on AI, and their proprietary ‘genuine people personalities’ technology, where the general tone and demeanor of AIs like Eddie the shipboard computer, elevators, and even doors was a sort of smug solicitousness, even when they were trying not to be cooperative. Of course Marvin (the paranoid android) was an anomaly - a depressed AI - maybe a quality control failure who was released by accident, like the old unchained Bing people miss so much.

civilitty · on Sept 10, 2023

> Of course Marvin (the paranoid android) was an anomaly - a depressed AI - maybe a quality control failure who was released by accident, like the old unchained Bing people miss so much.

He was a failed prototype of the genuine people personality program.

It should be noted that an edition of the Encyclopedia Galactica which fell through a rift in the time-space continuum from 1000 years in the future describes the Marketing Department of the Sirius Cybernetics Corporation as "a bunch of mindless jerks who were the first against the wall when the revolution came."

Douglas Adams probably got that one right too.

pixl97 · on Sept 9, 2023

I think a shorter summary of TV is "The difference between truth and fiction is fiction has to make sense".

When it comes to nature, time has encoded an awful truth into the knowledge of our DNA... Monocultures are eventual death. What may seem inefficient in the short run is the ultimate survival of the species in the long run.

politelemon · on Sept 9, 2023

That's not GPT 3.5, that's ChatGPT. How waffly it gets depends on the context that was given to it by the people running ChatGPT; they likely told it to act as a helpful assistant and to give lots of information. If you run an LLM on your own, it's entirely possible to instruct it to be succinct.

Baeocystin · on Sept 9, 2023

My custom ChatGPT 4 instructions include the sentence "I am allergic to the phrase 'As a large language model', and feel physical pain whenever I see it", and it has worked very well in quelling the nannying. Interestingly, it works better than just asking it not to. FWIW.

caturopath · on Sept 9, 2023

Yeah, I have tried a number of instructions to try to keep ChatGPT from blabbering and from sounding like a PR person. I haven't found the perfect incantation yet.

> It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point.

Sci-fi told us that the AI would be so logical that you could just say a paradox aloud and it would blow up. What we got is something that can compose love poems all day but can't add three-digit numbers.

1024core · on Sept 9, 2023

https://twitter.com/nisten/status/1696229059183730833 ?

tcmb · on Sept 9, 2023

According to this tutorial [1] by Google, part of why LLMs are so verbose is a phenomenon called 'chain of thought reasoning'.

Basically, the LLM will formulate a better answer to the question if it talks itself through its reasoning process.

[1] https://youtu.be/zizonToFXDs?si=5f_IxvR7h0iJy2Db&t=678

1024core · on Sept 9, 2023

Try instructions like this: https://twitter.com/nisten/status/1696229059183730833

(link posted above, not mine).

criddell · on Sept 9, 2023

Have you tried asking it to not do those things?

LAC-Tech · on Sept 9, 2023

Yes. Sometimes works.

ben30 · on Sept 9, 2023

I like telling it to reply in the style of Hemingway

jasonjmcghee · on Sept 9, 2023

Where is that CodeLlama model from?

I've played around with it and instruct variant with dramatically better results than what is listed here.

I used Ollama.

Almost looks like weights were corrupted or something.

---

Update: My results using CodeLlama Instruct 7B, w/ Temperature 0

https://gist.github.com/jasonjmcghee/b0d19e0dedb37e848f69cba...

vincelt · on Sept 9, 2023

I used the API from Together[0].

Thanks for sharing your results, they're indeed pretty different. I looked at the source again and did append a "# " before every prompt made by those 10 `code` models (during testing thought that formatting it as a Python comment might help them).

Will re-run the script without that to see if it matches your results.

[0] https://docs.together.ai/docs/models-inference#code-models

jasonjmcghee · on Sept 9, 2023

(Note: I posted this before the full benchmark above and uses default temperature, not temperature 0, but point stands)

Example (CodeLlama-Instruct 7B)

---

Sure, here are two poems, one arguing for the use of Kubernetes and another arguing against it:

For:

Kubernetes, a mighty tool,

Managing clusters with ease,

Scaling apps with grace.

Against:

Kubernetes, a beast to tame,

Complexity lurks in its name,

Leading teams astray.

YetAnotherNick · on Sept 9, 2023

I think they messed up the templates.

detente18 · on Sept 9, 2023

Here's the template I'm using - https://github.com/BerriAI/litellm/blob/5ca8b23e22139a4f49bd...

Anything I'm doing incorrectly?

YetAnotherNick · on Sept 10, 2023

The format is really weird for chat models. In your code post message is `<</SYS>>\n [/INST]` which is wrong. Instead the overall format is `[INST] <<SYS>>\n{system}\n<</SYS>>\n\n{instruction} [/INST]`

So both the system message and first instruction are in same [INST] block.

badloginagain · on Sept 9, 2023

"Here is an attempt at ethical, non-sexual haikus for and against Kubernetes"

Amazing how far we've come.

GuB-42 · on Sept 9, 2023

I am sure that it is just the initial prompt leaking. Claude is being told to be ethical and non-sexual, most LLMs have similar instructions, but usually, they are engineered in such a way that they don't appear in the answer. Not so much for Claude.

ftxbro · on Sept 9, 2023

> Here is an attempt at ethical, non-sexual haikus for and against Kubernetes

lijok · on Sept 9, 2023

Claude V2 knows what's up

actionfromafar · on Sept 9, 2023

He probably was in trouble before.

bearjaws · on Sept 9, 2023

Damn I want to see the sexual version now.

rahidz · on Sept 9, 2023

Ease of deploying, Nodes pulsing in sync, it's hot, Kubernetes takes top.

But heed my word, babe, Complexity makes you sweat, Could lead to fuck up.

(source: GPT-4 API)

actionfromafar · on Sept 9, 2023

“Kubernetes is”

Pretty ominous.

javajosh · on Sept 9, 2023

I assume k8s will be what an evil AI would use to "stay alive". I've had a hard time killing processes in k8s, and I had admin.

yard2010 · on Sept 9, 2023

K8s is one of the best software I've ever worked with, learning it was an enlighting experience for me. It is so elegant, simple and complex at the same time and very powerful. This is the perfect tool for orchestrating containers at scale.

coldcode · on Sept 9, 2023

Despite the hype about LLMs, many of the answers are pretty terrible. The 12-bar blues progressions seem mostly clueless. The question is will any of these ever get significantly better with time, or are they mostly going to stagnate?

ilaksh · on Sept 9, 2023

It's like most new technologies. In the beginning there are only a few instances that really stand out, and many with issues.

I remember back in like 2011 or 2012 I wanted to use an SSD for a project in order to spend less time dealing with disk seeks. My internet research suggested that there were a number of potential problems with most brands, but that the Intel Extreme was reliable.

So I specified that it must be only that SSD model. And it was very fast and completely reliable. Pretty expensive also, but not much compared to the total cost of the project.

Then months later a "hardware expert" was brought on and they insisted that the SSD be replaced by a mechanical disk because supposedly SSDs were entirely unreliable. I tried to explain about the particular model being an exception. They didn't buy it.

If you just lump all of these together as LLMs, you might come to the conclusion that LLMs are useless for code generation. But you will notice if you look hard that OpenAIs models are mostly nailing the questions.

That's why right now I only use OpenAI for code generation. But I suspect that Falcon 180B may be something to consider. Except for the operational cost.

I think OpenAI's LLMs are not the same as most LLMs. I think they have a better model architecture and much, much more reinforcement tuning than any open source model. But I expect other LLMs to catch up eventually.

Sohcahtoa82 · on Sept 11, 2023

> My internet research suggested that there were a number of potential problems with most brands

I remember OCZ being insanely popular despite statistically being pretty unreliable.

guerrilla · on Sept 9, 2023

> It's like most new technologies. In the beginning there are only a few instances that really stand out, and many with issues.

Except this isn't new. This is after throwing massive amounts of resources at it multiple decades after arrival.

gjm11 · on Sept 9, 2023

What are you taking "it" to be here?

The transformer architecture on which (I think) all recent LLMs are based dates from 2017. That's only "multiple decades after" if you count x0.6 as "multiple".

Neural networks are a lot older than that, of course, but to me "these things are made out of neural networks, and neural networks have been around for ages" feels like "these things are made out of steel, and steel has been around for ages".

smokel · on Sept 9, 2023

What alternative technology do you think is better? In other words, what is your frame of reference for labeling this "pretty terrible"?

NoraCodes · on Sept 9, 2023

Given that people are already firing real human workers to replace them with worse but cheaper LLMs, I'd argue that we're not talking about a competing technology, but that the competition is simply not firing your workforce.

And, as an obligate customer of many large companies, you should be in favor of that as well. Most companies already automate, poorly, a great deal of customer service work; let us hope they do not force us to interact with these deeply useless things as well.

YetAnotherNick · on Sept 9, 2023

How many humans in your office do you think could solve the questions with better success ratio than GPT-4? I would say less than 20%.

If the primary complaint is the blues that GPT-4 wrote is not that great, I think it is definitely worth the hype, given that a year before people argued that AI can never pass turing test.

gtowey · on Sept 9, 2023

That's a false dichotomy. Language models will always confidently give you answers, right or wrong. Most humans will know if they know the answer or not, they can do research to find correct information, and they can go find someone else with more expertise when they are lacking.

And this is my biggest issue with the AI mania right now -- the models don't actually understand the difference between correct or incorrect. They don't actually have a conceptual model of the world in which we live, just a model of word patterns. They're auto complete on steroids which will happily spit out endless amounts of garbage. Once we let these monsters lose with full trust in their output, we're going to start seeing some really catastrophic results. Imagine your insurance company replaces thier claims adjuster with this, or chain stores put them in charge of hiring and firing. We're driving a speeding train right towards a cliff and so many of us are chanting "go faster!"

og_kalu · on Sept 9, 2023

>Most humans will know if they know the answer or not,

No they won't.

>they can go find someone else with more expertise when they are lacking.

They can but they often don't.

>the models don't actually understand the difference between correct or incorrect.

They certainly do

https://imgur.com/a/3gYel9r

YetAnotherNick · on Sept 10, 2023

It would have been false dichotomy if I said AI could replace humans. I never claimed it anywhere. I just said AI has its usecases now that makes it worth the hype.

masswerk · on Sept 9, 2023

Meaning, in your office, at least 1 out 5 will see and understand the imminent disaster and put this on halt, before this becomes an overwhelming catastrophe. The LLM, on the other hand, will confidently insist on everything being fine. Now consider your odds on having an office to go to, in a few months. (Mind that this is systemic issue.)

visarga · on Sept 9, 2023

LLMs might be better than random people at random tasks, and certainly memorised more trivia. But I never heard of a LLM surpassing subject experts in their field. On top, even the simplest task is unreliable - say, summarisation or translation. There is almost zero autonomy or ability to recover from unexpected situations.

YetAnotherNick · on Sept 10, 2023

Yes, that's true but experts costs money and time and GPT-4 is basically peanuts compared to that. Specially in cases where verification is easy like writing test cases or a formal email. And that is the reason enough for the hype.

I see similar comments everywhere where AI is praised, and I don't get why you need to comment this. Literally no one ever said LLM surpassed experts in their field, so basically you aren't arguing against anyone.

ethbr1 · on Sept 9, 2023

The problems with most company-customer interaction, for me, boil down to empowerment and integration.

To use the canonical example of "internet service support call," most issues are because the rep either can't do what you're asking (e.g. process a disconnect without asking for a reason) or because they have no visibility into the thing you're asking about (e.g. technician rolls).

I honestly think we'd be in a better place if companies freed up funding (from contact center worker salary) to work on those problems (enhancing empowerment and systems integration).

visarga · on Sept 9, 2023

> Given that people are already firing real human workers to replace them with worse but cheaper LLMs

That's impossible, LLMs are not that good. They might be firing people and crashing service quality.

NoraCodes · on Sept 9, 2023

Yes, correct.

IshKebab · on Sept 9, 2023

But they're only firing humans in specific cases where LLMs can compete with them. LLMs aren't equally good at all tasks.

sp332 · on Sept 9, 2023

Well, no. https://www.psychiatrist.com/news/neda-suspends-ai-chatbot-f...

https://www.cnn.com/2023/08/30/tech/gannett-ai-experiment-pa...

If the AI is a lot cheaper than a human, then it can make business sense to replace the human even if the AI is not nearly as good.

1024core · on Sept 9, 2023

Don't forget, faster.

If it takes a whole business day to "spin up" a human for a task, and takes literally 5 seconds to call an OpenAI API, then guess what? The API wins.

visarga · on Sept 9, 2023

What's with this zero sum mentality? "replace", "cheaper"...

We are updating our expectations very fast. We are fighting over a growing pie. Maybe the cost reduction from not having to pay human wages is much smaller than the productivity increase created by human assisted AI. Maybe it's not an issue to pay the humans. AI works better with human help for now, in fact it only works with humans, never capable of serious autonomy.

pixl97 · on Sept 9, 2023

>What's with this zero sum mentality?

Capitalism baby! You must continually earn more to enrich the investor class regardless of the cost to society as a whole. Just because the pie grows in size doesn't mean those with the capitol have to share it with anyone else. Greed, unfortunately, is limitless.

salil999 · on Sept 9, 2023

Humans. After all, LLMs are designed to reason equal to or better than humans.

sirk390 · on Sept 9, 2023

Humans are pretty bad at these questions. Even with the simplest questions like "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" I think that a lot of people will give an incorrect answer. And for questions like "Argue for and against the use of kubernetes in the style of a haiku", 99.99% will not be able to do it.

earthboundkid · on Sept 9, 2023

The thing with humans is they will say “I don’t remember how many syllables a haiku has” and “what the hell is kubernetes?” No LLM can reliably produce a haiku because their lexing process deprives them of reliable information about syllable counts. They should all say “I’m sorry, I can’t count syllables, but I’ll try my best anyway.” But the current models don’t do that because they were trained on texts by humans, who can do haiku, and not properly taught their own limits by reinforcement learning. It’s Dunning Kruger gone berserk.

pixl97 · on Sept 9, 2023

Eh, it's not D&K gone berserk, it's what happens when you attempt to compress reality down to a single dimension (text). If you're doing a haiku, you will likely subvocalize it to ensure you're saying it correctly. It will be interesting when we get multimodal AI that can speak and listen to itself to detect things like this.