I don't want to speak for the OP, but one of the issues they may be poking at is...

og_kalu · on July 11, 2023

1. LLMs are general pattern machines. They can generalize and complete in context to a wide variety of complex non linguistic patterns.https://general-pattern-machines.github.io/

2. LLMs see positive transfer in multi lingual capability. For example, an LLM trained on 500B tokens of English and 10B tokens of French will not speak in french like a model trained on only 10B tokens of French. What will happen is that the model will be nearly as competent in French as it is in English https://arxiv.org/abs/2108.13349

3. Language models of code reason better even if the benchmarks have nothing to do with code.

https://arxiv.org/abs/2210.07128

bumby · on July 11, 2023

This is interesting in the context of the other response that links to poor performance in terms of counterfactuals. I wonder if it is related to how well one domain maps to another? E.g., can they transfer to english to french well because both share a similar word classes (nouns, verbs, etc.). But I believe other languages change more based on social context (e.g., Japanese) than English does. Would a LLM transfer just as well to the latter? In that case, my guess is humans would find it more difficult to transfer as well, so it may not be a good measure.

(Apologies to any linguists. Please correct anything above if I'm off).

og_kalu · on July 11, 2023

I was just using French as an example. Korean, Japanese all transfer very well. As well ? Not sure about that.

As for the other post, degraded performances are highly non trivial still. Some aren't actually poor, just worse.

Even the authors admit humans would see degraded performance on counterfactuals unless given "enough time to reason and revise", something they don't try to do with GPT-4.

Think about it. Do you genuinely belief you would score as accurately on a multiplication arithmetic test taken in base 8 ?

bumby · on July 11, 2023

>Think about it. Do you genuinely belief you would score as accurately on a multiplication arithmetic test taken in base 8 ?

No, but I believe this is a different question. I think the more relevant question is whether a human can (even with the caveat of needing more time to reason about it). The larger question for a LLM is whether it can answer it at all and interpret why, without additional training data.

The paper seems to point that the ability of LLM to transfer is related to proximity to the default case. E.g., if default is base 10, is better at base 9 than base 2. I would interpret that as indicating more simple pattern recognition than deductive reasoning. The implication being that real transference is more dependent on the latter.

og_kalu · on July 11, 2023

>No, but I believe this is a different question.

arithmetic but in a different base is one of the counterfactual examples in the paper. That's why i mentioned that. and yes it can answer them with worse performance.

You can juice arithmetic performance as is with algorithmic instructions. https://arxiv.org/abs/2211.09066. I see no reason the same for other bases wouldn't work.

Even if you gave a human substantial time for it (say a week of study), i believe he/she almost certainly reach the same accuracy unless he had access to specific instructions for working in base 8 he/she could call upon when taking the test.

bumby · on July 11, 2023

>arithmetic but in a different base is one of the counterfactual examples in the paper.

I know, that's why I referenced the proximity of bases seemingly being important to the LLM. I think this is what differentiates it.

>and yes it can answer them with worse performance.

It's accuracy is dependent on proximity to it's training set (going back to my original point). I think that points to a different mechanism than humans and that's what my last post was focusing on.

I think we agree that humans would do less well in most other bases than base-10. But that side-steps the point I was making. Will humans do worse in base-3 than base-9? I doubt it, but according the the article, it's reasonable to assume the LLM would be progressively worse. That, IMO, is an indicator that something different is going on. I.e., humans are deriving principles to work from rather than just pattern recognition. Those principles can be modified on-the-fly to adjust to novel circumstances without needing additional training data. Humans are using reasoning in addition to pattern recognition.

This is probably a clunky example, but I'll try. Suppose an autonomous vehicle is trained to recognize that when a ball rolls into the street, it needs to slow down or stop because a child may not be far behind. A human can infer that seeing a kite blow into the street may signal the same response, even though they've never witnessed a kite blow into the street. The question is: can the autonomous vehicle infer the same? (This shouldn't be conflated with the general case of "see object obstructing the street and slow down/stop." The case I'm drawing here specifically adjusts the risk by the nature of the object being a child's toy. So, can the AV not only recognize the object as a kite but also adjust the risk accordingly?) I think one of the possible pitfalls is that we solve a more simple problem like image/pattern recognition and conflate it to a more difficult problem set being solved.

Circling back to the original point, one guess is that it's not understanding context as much as merely matching patterns really, really well. That can be incredibly useful but it may be something different than what's going on in our heads and maybe would should be careful not to conflate the two. Or, it's possible that all we're doing is also matching patterns in context, and eventually LLM will get there too.

og_kalu · on July 11, 2023

>I doubt it, but according the the article, it's reasonable to assume the LLM would be progressively worse.

I genuinely don't see how that would be a reasonable assumption.

>Will humans do worse in base-3 than base-9?

Why not? If you haven't learnt base 3 but you have base 9 you'll do poorer on it.

>That, IMO, is an indicator that something different is going on.

Whether something different is going on is about as relevant as the question of whether submarines swim or plans fly or cars run.

>I.e., humans are deriving principles to work from rather than just pattern recognition.

Not really. Nearly all your brain does with sense data is predict what it should be and adjust your perception to fit. You can mold these predictions implicitly with your experiences but you're not deriving anything from first principles.

>This is probably a clunky example, but I'll try. Suppose an autonomous vehicle is trained to recognize that when a ball rolls into the street, it needs to slow down or stop because a child may not be far behind. A human can infer that seeing a kite blow into the street may signal the same response, even though they've never witnessed a kite blow into the street. The question is: can the autonomous vehicle infer the same? (This shouldn't be conflated with the general case of "see object obstructing the street and slow down/stop." The case I'm drawing here specifically adjusts the risk by the nature of the object being a child's toy. So, can the AV not only recognize the object as a kite but also adjust the risk accordingly?) I think one of the possible pitfalls is that we solve a more simple problem like image/pattern recognition and conflate it to a more difficult problem set being solved.

Casual reasoning ? all evidence points to LLMs being more than capable of that https://arxiv.org/abs/2305.00050

bumby · on July 11, 2023

>I genuinely don't see how that would be a reasonable assumption.

It's not an assumption. It's literally based on the results of your own reference:

>LM performance generally decreases monotonically with the distance

If you can't be bothered to read your own reference, I don't think additional conversation is worthwhile because it becomes apparent that it's more dogmatic than reasoned.

Your newest link is not really supportive of your "all evidence" claim. It goes into further detail about how LLM can have high accuracy while also making simple, unpredictable mistakes. That's not good evidence of a robust causal model that can extrapolate knowledge to other contexts. If I didn't know better, I'd assume you could just as well be a chat bot who only reads abstracts and replies in an overconfident manner.

og_kalu · on July 11, 2023

>It's not an assumption. It's literally based on the results of your own reference:

Human performance generally decreases with level of exposure so I figured you were talking about something else. Guess not.

>I don't think additional conversation is worthwhile because it becomes apparent that it's more dogmatic than reasoned

By all means, end the conversation whenever you wish.

>It goes into further detail about how LLM can have high accuracy while also making simple, unpredictable mistakes.

I'm well aware. So? Weird failure modes are expected. Humans make simple, unpredictable mistakes that don't make any sense without the lens of evolutionary biology. LLMs will have odd failure modes regardless of whether it's the "real deak" or not, either adopted from the data or from the training scheme itself.

>If I didn't know better, I'd assume you could just as well be a chat bot who only reads abstracts and replies in an overconfident manner.

Now you're getting it. Think on that.

bumby · on July 11, 2023

>Human performance generally decreases with level of exposure

Are you saying that as humans get more experience, they perform worse? I disagree, but irrespective of that point it’s wild that you can have this many responses while still completely bypassing the entire point I was making.

I don’t think most would argue that performance increases with experience. The point is how well can the performance be maintained when there is little or no exposure. Because that implies principled reasoning rather than simple pattern mapping. That is the entire through line behind my comments regarding context dependent language, novel driving scenarios, etc.

>Think on that

In the context of the above, I don’t think this is nearly as strong of a point as you seem to think it is. There nothing novel about a text-based discussion.

og_kalu · on July 11, 2023

1. We anchored this discussion on arithmetic so I stuck to that. If a child never learns (no exposure) how to do base 16 arithmetic then a test quizzing on base 16 arithmetic will result in zero performance.

If that child had the basic teaching most children do (little exposure) then a quiz will result in much worse performance than a base 10 equivalent test. This is very simple. I don't know what else to tell you here.

2. You must understand that a human driver that stops because a kite suddenly comes across the road doesn't do so because of any kite>child>must not hurt reasoning. Your brain doesn't even process information that quickly. The human driver stops (or perhaps he/she doesn't) and then rationalizes a reason for the decision after the fact. Humans are very good at doing this sort of thing. Except that this rationalization might not have anything at all to do with what you believe to be "truth". Just because you think or believe it is so doesn't actually mean it is so. Choices shape preferences just as much as the inverse. For all anyone knows and indeed most likely, "child" didn't even enter the equation untill well after the fact.

Now if you're asking whether LLMs a matter of principle can infere/grok these sort of casual relationships between different "objects" then yes as far as anyone is able to test.

bumby · on July 12, 2023

Your first statement seems to contradict your previous. Did you originally mistype what you meant when you said greater exposure leads to worse outcomes? Because now you’re implying more exposure has the opposite effect.

Regardless, it still misses the point. I’ve never been explicitly exposed to base-72, yet I can reason my way through it. I would argue my performance wouldn’t be any different than base-82. So I can transfer basic principles. What the LLM result you referenced shows is that it is not learning basic principles. It sure seems like you just read the abstract and ran with it.

As far as the psychology of decision making, again, I think you're speaking with greater confidence than is warranted. In time critical examples, I’m inclined to agree. And there’s certainly some notable psychologists who would expand it beyond snap judgments. But there are also some notable psychologists who tend to disagree. It’s not a settled science, despite your confidence. But again, that’s getting stuck in the limitations of the example and missing the forest for the trees. The point is not in whether decisions are made consciously or subconsciously, but rather how learning can be inferred from previous experience and transferred to novel experiences. Whether this happens consciously or not is besides the point. And you are further going down what I was explicitly taking against: confusing image/pattern recognition for contextual reason. You can see this in the relatively recent Go issue; any human could see what the issue was because they understand the contextual reasoning of the game but the AI could not and was fooled by a novel strategy. The points I’ve been making have completely flew over your head to the point where you’re shoehorning in a completely different conversation.

og_kalu · on July 12, 2023

>Did you originally mistype what you meant when you said greater exposure leads to worse outcomes? Because now you’re implying more exposure has the opposite effect.

I guess so. I've never meant to imply greater exposure leads to worse outcomes.

>I would argue my performance wouldn’t be any different than base-82.

Even if that were true and i don't know that i agree, the authors of that paper make no attempt to test in circumstances that might make this true for LLMs as it might for people. So the paper is not evidence of the claim (no basic principles) either way. For example, i reckon your performance on the proceeding 82 test will be better if taken a immediately after than if taken weeks or months later. So surrounding context is important even if you're right.

>What the LLM result you referenced shows is that it is not learning basic principles.

I disagree here and i've explained why.

>You can see this in the relatively recent Go issue; any human could see what the issue was because they understand the contextual reasoning of the game but the AI could not and was fooled by a novel strategy.

You're talking about this ? https://www.zmescience.com/future/a-human-just-defeated-an-a...

KataGo taught itself to play go by explicitly deprioritizing “losing” strategies. This means it didn’t play many amateur strategies because they were lost early in the training. This is hard for a human to understand because humans all generally share a learning curve going from beginning to amateur to expert. So all humans have more experience with “losing” techniques. Basically what I’m saying is, it might be that the training scheme of this AI explicitly prioritized having little understanding of these specific tactics, which is different than not having any understanding.

This circles back to the point I made earlier. Having failure modes humans don't or won't understand or have is not the same as a lack of "true understanding".

We have no clue what "basic principles" actually are on the low level. The less inductive bias we try to shoehorn into models, the better performing they become. Models literally tend to perform worse the more we try to bake "basic principles" in. So presence of an odd failure mode we *think* belies a lack of "basic principles" is not necessarily evidence of a lack of it.

>The points I’ve been making have completely flew over your head to the point where you’re shoehorning in a completely different conversation.

You're convinced it's just "very good pattern matching", whatever that means. I disagree.

bumby · on July 12, 2023

I think the short of it is that it seems to me that you are confusing a system having very good heuristics for having a solid understanding of principles of reality. Heuristics, without an understanding of principles, is what I mean by rote pattern matching. But heuristic break down, particularly in edge cases. Yes, humans also rely heavily on heuristics because we generally seek to use the least effort possible. But we also can mitigate against those shortcomings by reasoning about basic principles. This shortcoming is why I think both humans and AI can make seemingly stupid mistakes. The difference is, I don't think you've provided evidence that AI can have a principled understanding while we can show that humans can. Having a principled understanding is important to move from simple "cause-effect" relationships to understanding "why". This is important because the "why" can transfer to many unrelated domains or novel scenarios.

E.g., racism/sexism/...most -'isms' appear to be a general heuristics that help us make quick judgements. But we can also our decision-making process by reverting to basic principles, like the idea that humans have equal moral worth regardless of skin tone or gender. AI can even mimic these mitigations, but you haven't convinced me that it can fundamentally change away from it's training set based on an understanding of basic principles.

As for the Go example, a novice would be able to identify that somebody is drawing a circle around it's pieces; your link even states this. But you recharacterizing this as a specific strategy is weird when that strategy causes you to lose the game. It misses the entire meaning of strategy. We see the limitations of AI in it's reliance to training data from autonomous vehicles to healthcare. They range from the serious (cancer detection) to the humorous (Marines overtaking robots by hiding in boxes like in Metal Gear). The paper you referenced similarly shows it is reliant on proximity to the training set, rather than actually understanding the underlying principles.

og_kalu · on July 12, 2023

>Did you read the paper? The authors admit it is only narrowly learning and cannot transfer it's knowledge to unknown areas. From the article: "we do not expect our language model to generate proteins that belong to a completely different distribution or domain"

Good thing they don't make sweeping declarations or say anything about that meaning narrow learning without transfer. Jumping the shark yet again.

https://www.pnas.org/doi/full/10.1073/pnas.2016239118

>We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect.

From the sequences of just the proteins alone, Language Models learn underlying properties that transfer to a wide variety of use cases. So yes, they understand proteins in any definition that has any meaning.

bumby · on July 13, 2023

Wrong comment to respond to; if you can’t wait to reply, that might indicate it’s time to take a step back.

>Good thing they don't make sweeping declarations or say anything about that meaning narrow learning without transfer.

That's exactly what that previous quote means. Did you read the methodology? They train on a universal training set and then have to tune it using a closely related training set for it to work. In other words, the first step is not good enough to be transferrable and needs to be fine tuned. In that context, the quote implies the fine tuning pushes the model away from a generalizable one into a narrow model that no longer works outside that specific application. Apropos to this entire discussion, it means it doesn't perform well in novel domains. If it could truly "understand proteins in any definition", it wouldn't need to be retrained for each application. The word you used ('any') literally means "without specification"; the model needs to be specifically tuned to the protein family of interest.

You are quoting an entirely different publication in your response. You should use the paper from which I quoted to refute my statement, otherwise this is the definition of cherry picking. Can you explain why the two studies came to different conclusions? It sure seems like you're not reading the work to learn and instead just grasping at straws to be "right." I have zero interest in having a conversation where someone just jumps from one abstract to another just to argue rather than adding anything of substance.

og_kalu · on July 12, 2023

>I think the short of it is that it seems to me that you are confusing a system having very good heuristics for having a solid understanding of principles of reality.

Humans don’t have a grasp of the “principles of reasoning” and as such are incapable of distinguishing “true”, "different" or “heuristic” assuming such a distinction is even meaningful. Where you are convinced of “faulty shortcut”, I simply think “different”. Multiple ways to skin a cat. a plane's flight is as "true" as any bird. There's no "faulty shortcut" even when it fails in ways a bird will not.

You say humans are "true" and LLMs are not but you base it on factors that can be probed in humans as well so to me, your argument simply falls apart. This is where our divide stems from.

>I don't think you've provided evidence that AI can have a principled understanding while we can show that humans can.

What would be evidence to you? Let’s leave conjecture and assumptions. What evaluation exist that demonstrate this “principled understanding” in humans? and how would we create an equitable test in LLMs?

>a novice would be able to identify that somebody is drawing a circle around it's pieces; your link even states this. But you recharacterizing this as a specific strategy is weird when that strategy causes you to lose the game.

You misunderstand. I did not characterize this as a specific “strategy”. Not only do modern Go systems not learn like humans, but they also don’t learn from human data at all. KataGo didn’t create a heuristic to play like a human because it didn’t even see humans play.

>The paper you referenced similarly shows it is reliant on proximity to the training set, rather than actually understanding the underlying principles.

Even the authors make it clear this isn’t necessarily the bridge to take so it’s odd to see you die on this hill.

The counterfactual of syntax is Finding the main subject and verb of something like “Think are the best LMs they.” in verb-obj-subj order (they, think) instead of “They think LMs are the best.” in subj-verb-obj order (they, think). LLMs are not being trained on text like the former to any significant degree if at all yet the performance is fairly close. So what, it doesn’t “underlying principles of syntax” but still manages that ?

The problem is that you take a fairly reasonable conclusion from these experiments. I.e LLMs can/often also rely on narrow, non-transferable procedures for task-solving and proceed to jump the shark from there.

>but you haven't convinced me that it can fundamentally change away from it's training set based on an understanding of basic principles.

We see language models create novel functioning protein structure after training, no folding necessary.

https://www.researchgate.net/publication/367453911_Large_lan... So does it still not understand the “basic principles of protein structures”?

bumby · on July 12, 2023

Did you read the paper? The authors admit it is only narrowly learning and cannot transfer it's knowledge to unknown areas. From the article:

"we do not expect our language model to generate proteins that belong to a completely different distribution or domain"

So, no, I do not think it displays a fundamental understanding.

>What would be evidence to you?

We've already discussed this ad nauseum. Like all science, there is no definitive answer. However, when the data shows evidence that something like proximity to training data is predictive of performance, it's seems more like evidence of learning heuristics and not underlying principles.

Now, I'm open to the idea that humans just have a deeper level of heuristics rather than principled understanding. If that's the case, it's just a difference of degree rather than type. But I don't think that's a fruitful discussion because it may not be testable/provable so I would classify it as philosophy more than anything else and certainly not worthy of the confidence that you're speaking with.

skepticATX · on July 11, 2023

Exactly. Current LLMs fall over when facing counterfactuals: https://arxiv.org/abs/2307.02477.

This is why it's mostly meaningless to for a LLM to pass the bar, but not meaningless for a human to do so. We (rightly, for the most part) assume that a human who passes the bar can transfer those skills into unique and novel situations. We can't make that assumption for LLMs, because they are lacking adaptability that is needed for true intelligence.

og_kalu · on July 11, 2023

That doesn't show that they "fall over". All degraded performances are highly non trivial. And even the paper admits humans would see degraded performance on counterfactuals as well. They think humans may not only with "enough time to reason and revise", something the LLMs being evaluated don't get to do here.

If you took arithmetic tests in base 8, you wouldn't reach the same accuracy either.

skepticATX · on July 11, 2023

Well, sure, but the problem is that LLMs can’t reason and revise, architecturally. Perhaps we can chain together a system that approximates this, but it still wouldn’t be the LLM doing the reasoning itself.