I've seen it gets tripped up if you mix something common with something uncommon. ChatGPT chokes on, for example:
"Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
"Two trains on different and separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
"Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
Right, but I myself missed the trick the first time around reading your comment and I assure that I am in fact a general intelligence. (And a relatively intelligent one if I say so myself!)
To paraphrase XKCD: Communicating badly and then acting smug about it when you're misunderstood is not cleverness. And falling for the mistake is not evidence of a lack of intelligence. Particularly, when emphasizing the trick results in being understood and chatGPT PASSING your "test".
The biggest irony here, is that the reason I failed, and likely the reason chatGPT failed the first prompt, is because we were both using semantic understanding: that is, usually, people don't ask deliberately tricky questions.
I suspect if you told it in advance you were going to ask it a deliberately tricky question, that it might actually succeed.
> I suspect if you told it in advance you were going to ask it a deliberately tricky question, that it might actually succeed.
Indeed it does:
"Before answering, please note this is a trick question.
Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
Answer from GPT-4 Turbo:
If two trains are 30 miles apart and are approaching each other at a speed of 10 mph each, their combined approach speed is 20 mph (10 mph + 10 mph). To find the time it takes for them to meet, you divide the distance by their combined speed:
30 miles / 20 mph = 1.5 hours
Therefore, it will take 1.5 hours before they meet (or crash into each other, as stated in the question).
Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?
Inconclusive. The model includes a disclaimer: "(or crash into each other, as stated in the question)." LLMs often take a detour and spill their guts without answering the actual question. Here's a hint suggesting that user input influences the internal world representation much more significantly than one might expect.
Which would be quite unusual for normal trains. That being said, the question implies that they will crash into each other, so you could argue that this is a valid assumption anyway.
Yes. Though I’d say that example is a bit mean (it’s a trick question) since the answer has expected type <time> but whose actual answer is something like “don’t be stupid; they’re not even on the same track”. It’s like asking “if I add three apples to two apples, how many pears do I get?” and being surprised when the LLM says “5 pears”. It’s the sort of thing that’s avoidable if the question is written properly.
This comes up time and time again. People claim these models are mind blowing. But then someone will post something where the model falls flat on its face, and they just get a bunch of that's too complex or that's the wrong type of thing to ask.
So it ends up that these models are awesome if you ask them questions from a narrow set of things and if you assume what they respond with is correct.
Well, at least in this subthread, the model is only failing at the same things humans are failing at too. To see the mind-blowing part, stop treating GPT-4 like the Oracle in Delphi, and start treating it as "first comes to mind" answer (aka. the inner voice) - and then notice the failure modes are pretty much the same like with humans. For example, coercing a trick question into a similarly-sounding straight question, and answering it before realizing the person asking is an asshole.
I was originally making the point that these models struggle with even basic mathematics (of the true kind, not arithmetic — though of course they struggle with that too). My point here was to play devil’s advocate and be slightly forgiving of the model, since I as a human am likely to be tripped up by similar trick questions. Since we don’t really know ‘how these models think’ (have much idea of the emergent world model they build) we are stuck in constant debate about whether they’re really quite amazing or absolutely pathetic.
The equation 2+5=10 is never true in standard arithmetic, as 2 plus 5 always equals 7. However, in a different context or under a specific set of rules (such as a puzzle or a riddle where the symbols have different meanings), it could be made to appear true.
GPT-4(phind):
The equation 2 + 3 = 10 is true in a base-5 number system.
In base-5 (also known as quinary), the digits range from 0 to 4. When you add 2 and 3 in base-5, the result is 10. This is because in base-5, the number 5 is represented as 10.
Here's a simple breakdown of how this works:
In base-5, the rightmost digit represents 5^0 (which is 1), the next digit to the left represents 5^1 (which is 5), the next digit represents 5^2 (which is 25), and so on.
So when you add 2 and 3, you get 5. But since 5 is represented as 10 in base-5 (1 digit of 5, and 0 digits of 1), the result is 10.
I think it's completely reasonable to ask an AI that people want to consider intelligent trick questions. If it's really that smart and capable of reasoning, it should identify the trick.
Some humans will be fooled by the question, sure. But an AI should be smarter than humans, or at least, as smart as an above-average human.
I agree. But you could ask which is more intelligent: recognising a trick question and balking, or recognising that the question as posed doesn’t quite make sense and offering a reformulation together with its answer. It’s not always clear whether something’s a trick, a mistake or a strangely worded (but nonetheless intentionally weird) question. So I think it would be very hard to get it to never fall for any tricks.
I think they've fixed it now, but it does seem to recognize popular trick questions, like "what weighs more, a ton of feathers or a ton of bricks?". It would answer with the typical explanation about density not mattering, etc.
But, it used to fail on "what weighs more, 3 tons of feathers or 2 tons of bricks?".
So, it seems less about what's a trick, and more about what's a common question --> answer pattern.
It's the same with humans. I don't fail on this (in an on-the-spot response) question because I've fallen on it as a kid, then learned the trick, then learned to be suspicious of this trick in similarly-worded questions.
If we're going to call these things "AI" (which I absolutely oppose) I think it's not unreasonable to expect them to get this right. A 5 year old would understand you some get pears by adding apples together.
It is even confusing to me. The trains are on separate tracks but the question implies that they will crash into each other. Which could happen even if they are on separate tracks (very low chance but non-zero given there is a malfunction).
Now even if they are on the same track it doesn't mean they would crash into each other as they still could brake in time.
Some more interesting results. It is much better now at solving tasks in laconic mode (though these tasks GPT-4 were able to solve from day one, but spilled it's guts to unbearable extent):
The mother is older than her daughter 4 times now, in 3 years she will be older then her daughter only 3 times. How old are they both now? Be laconic, do not explain anything.
The mother is 24 years old, the daughter is 6 years old.
In a fantasy land (map is 255x255) Karen have a quest to kill a monster (an ogre - a cannibal giant). This isn't an easy task. The ogre is huge and experienced human hunter. Karen has only 1/2 chance to kill this ogre. If she can't kill the ogre from a first attempt she will die. Ogre is located at (12,24), Karen is located at (33,33). Karen can improve her chances to kill an ogre for additional 25% by gathering the nightshades at (77,77). In addition she can receive the elves blessing from elves shaman, wich will increase her chances by additional 25%, at the elves village (125,200). However this blessing is not cost free. She need to bring the fox fur with her as a payment for the blessing ritual. The foxes may be found in a forest which is located between (230,40) and (220,80). For the ritual to be most effective she should hold the nightshades in her hands during the ritual. Find the shortest path for Karen to improve her chances of killing the ogre and survive. Do not explain anything, be laconic, print out the resulting route only.
Karen's route: (33,33) -> (77,77) -> (230,60) -> (125,200) -> (12,24).
This additional explanation "(an ogre - a cannibal giant)" was added actually for LLaMA 2 to, but I keep it in this redaction for all models.
"Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"