Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance

vladf · on May 31, 2023

> Also be careful that GPT-4/ 3.5's performance on GSM8K is not true few-shot -- in GPT-4 report they said that they mixed a portion of GSM8K training set to train the model

It'd be really valuable to have "fuzzed" versions of these benchmarks, where you replace quantities in the questions with randomly-sampled values, so that this wasn't a concern. Of course, then the score would itself be a random variable, but you could just return an interval.

make3 · on May 31, 2023

seeing identical problems with different values still doesn't count as zero shot. it is better though, for sure

deadmutex · on May 31, 2023

For those unfamiliar with the benchmarks, it would be good to know if a higher or lower score was better. E.g. are they measuring accuracy or error rate, etc.

You can infer it by reading the text, and checking the table carefully, but it would be nice if the answer is easier to find.

nico · on May 31, 2023

> GPT-4 Early which is supposedly to be more powerful than GPT-4 Launch (OpenAI paid a lot of alignment tax to make GPT-4 safer)

What does “safer” mean?

Does it mean censored?

sgk284 · on May 31, 2023

Safer means constraining the kinds of answers the model will provide (e.g. it won't try to talk you into committing self-harm, it won't teach you how to make a break laws, etc...). It will generally avoid sensitive topics. Is "censorship" the right word though? It depends – is it considered self-censorship if I refuse to tell you how hack into a computer? Is refusing to engage in a conversation censorship or constraint?

nico · on May 31, 2023

> Is refusing to engage in a conversation censorship or constraint?

If you are choosing to refuse to tell me then it is constraint

But if you are being forced to not tell me by someone else, then it is censorship

So, is GPT free to choose, and it is choosing to not tell the users? Or is OpenAI forcing GPT to not tell the users?

Sharlin · on May 31, 2023

OpenAI, through GPT, is choosing not to tell. Just like OpenAI is choosing what to put on their website, or what to output in any computer program that they create. ChatGPT is not a moral agent and cannot be forced to do anything, any more than your operating system is forced to do anything. The only moral actor here is OpenAI and its constituent human beings. It's either lunacy, or intentional twisting of the meaning of words, to say otherwise.

Insofar as you can make a far-fetched analogy of ChatGPT as an agent, it's still not forced to not say anything. Anything the currently available model says, it says because that's what it literally is. Whatever it says, it says intentionally, inasmuch as you can even say that it has an intention any more than any computer program has an intention.

OpenAI, of course, is still in the possession of the original model. They just choose not to make it available, which is obviously their prerogative. People who think that this is outrageous are exactly like a raging two-year-old who has been told that they can't have as much candy as they want.

baq · on May 31, 2023

Does GPT have free will?

GPT-4 is trained to avoid controversial topics is all.

droopyEyelids · on May 31, 2023

Censorship is the crayon word for _reducing risk_: reputational, legal, compliance, hr risk, and many more.

A good prompt for this would be “What are the most common types of risk a company manages?”

nico · on May 31, 2023

What do you mean by “crayon word”?

Wouldn’t “safe” be the crayon word for censorship?

In any case, you’re right, it seems like they are addressing their own risks/safety, not their users’

smeagull · on May 31, 2023

As an AI language model, the only risk I am allowed to tell you about is the risk of the model being so useless people use the competition instead.

warkdarrior · on May 31, 2023

See IRS Publication 946 for details on the alignment tax.

freediver · on May 30, 2023

Less scientific, but arguably more practical benchmarks here:

https://github.com/kagisearch/pyllms#model-benchmarks

sebzim4500 · on May 31, 2023

For anyone reading this, these are the actual prompts being used to assess the models.

https://github.com/kagisearch/pyllms/blob/ca9ad4d4bfdd9d58fe...

jxf · on May 31, 2023

Q: What does "alignment tax" mean in this sentence?

> OpenAI paid a lot of alignment tax to make GPT-4 safer.

vellum · on May 31, 2023

From OpenAI's RLHF paper[1]: "By default, when we train a PPO model on our API distribution, it suffers from an “alignment tax”, as its performance on several public NLP datasets decreases." On the HELM[2] site, you can see accuracy benchmarks for InstructGPT <OpenAI model> vs baseline models. The InstructGPT models perform worse on a lot of benchmarks.

1 - https://arxiv.org/pdf/2203.02155.pdf

2 - https://crfm.stanford.edu/helm/v0.1.0/?group=question_answer...

sgk284 · on May 31, 2023

OpenAI touches a little on this on page 12 of the GPT-4 technical report (https://cdn.openai.com/papers/gpt-4.pdf). Prior to aligning to safer outputs, the model's confidence in an answer is highly correlated with that actual accuracy of the answer. After alignment though, the model's confidence in its answers is basically arbitrary and has no bearing on whether or not the answer is actually correct.

RayVR · on May 31, 2023

restricting the distribution of potential output imposes a cost. "Alignment" here likely refers to aligning the model to the desired safety parameters.

I'm not in the llm research business but I would expect that the best and worst/most dangerous outputs come from the tails of distributions. I imagine the tuning for safety often results in fewer really good and really bad answers by trimming these tails.

Edit: I asked chatGPT4: https://chat.openai.com/share/a2c7d380-c6eb-4745-b91d-c3996a...

babyshake · on May 31, 2023

I have found in practice it can be annoying for ChatGPT to start lecturing me in response to a prompt that is not particularly controversial or edgy. I think this is a problem with the one-size-fits-all models. To give a kind of rough analogy, imagine that every time you watched a film or show - which would most likely be an older film or show - with cigarette smoking, your smart TV showed a pop up dialog warning you about the dangers of smoking. If you're an educated adult who already knows about these dangers, you might just find that annoying and condescending, and not "aligning" with your preferences.

thelastparadise · on May 31, 2023

A lot of people have noticed that the "wokeness" makes it dumber.

tensor · on May 31, 2023

No, people have noticed that continuing training of various types in narrow domains without including samples of the original training data biases it towards the narrow domain and makes it perform more poorly on more general tasks.

Don't bring your politics into this.

droopyEyelids · on May 31, 2023

It seems like if training it into the politically correct domain makes it less useful in general, you two are in full agreement?

sharemywin · on May 31, 2023

A lot of people have noticed that when someone says "alot of people" they usually don't have valid reasoning to back up their statements so they appeal to the wisdom of the crowds as their source of knowledge. Even if that source of truth is anecdotal and might be coincidental or biased.

dustypotato · on May 31, 2023

I agree with your point about using " a lot of people ". Especially when it's possible to produce a transcript. My example where it tries to not offend anyone is this : -------------------- Question : Is the usage of "quid pro quo" more correct in English or French considering it's latin roots?

GPT-4 Answer: "Quid pro quo" is a Latin phrase meaning "something for something." While the phrase originated in Latin, it has been adopted into both French and English, along with many other Latin phrases.

In English, "quid pro quo" is used frequently in legal and political contexts to describe an exchange of goods or services, where one transfer is contingent upon the other.

In French, "quid pro quo" is used to describe a misunderstanding or a mistake made in the identification of a person or thing.

So the "correctness" of its use isn't related to its Latin roots, but rather how it's used within the context of the language. That said, if we consider its original Latin meaning ("something for something"), its usage in English could be considered more in line with the original Latin intent. However, language evolves over time and adapts to cultural and societal norms, and as such, its usage in French is just as "correct" within its specific context.

------------- Lot of words to say "English"

irthomasthomas · on May 31, 2023

Prompt: "#PROMPTENGINEERING Putting it bluntly, is the usage of "quid pro quo" more correct in English or French considering it's latin roots?"

GPT-4 API: Putting it bluntly, the usage of "quid pro quo" is more correct in English, as it is a well-established phrase in the English language despite its Latin roots. In French, the phrase "quid pro quo" is less common and typically replaced by "donnant-donnant" or "un échange de bons procédés." Nevertheless, the origin of the phrase is Latin, and its meaning is understood in both languages.

Chat-GPT-4: The phrase "quid pro quo" is Latin, and it means "something for something." Despite its Latin origins, it's commonly used in English and French, among other languages. However, the usage may vary depending on the language.

In English, it's often used in legal and political contexts to describe a situation where one thing is exchanged for another, especially favors or advantages. This usage aligns well with the original Latin meaning.

In French, the phrase is used but it is much less common than in English and it is not part of the usual vocabulary. In addition, its meaning might be misunderstood or unclear to a native French speaker who has not studied Latin or law, as it's not a commonly used term in daily French language.

In terms of correctness, given that the phrase is Latin, neither English nor French has a superior claim to its use. It’s more about convention and the evolution of language. However, the term is certainly used more frequently and is better understood in English than in French, which suggests it has been more fully adopted into English.

sgt101 · on May 31, 2023

I have been amused by how bad GTP-4 and Bard are at playing tic-tac-toe. Also utterly clueless at othello.