This makes some sense to me. My experience with GPT is that it is capable of straightforward logical inference, but not more inspired thinking. It lacks the ability for a “eureka moment”. All complex inference it appears to have is a result of its training set. It is incapable of solving certain kinds of logic problems that a child would be able to solve.
As an example, take the wolf, goat, and cabbage problem, but change the names of the animals and the situation so that the axioms are intact, but the situation no longer seems similar to the classic problem, and therefore has no representation in the training data. GPT-4 can no longer solve this problem consistently.
If Q* can now solve problems like this reliably, it could represent a breakthrough in LLM’s ability to model the world and extrapolate beyond the training data.
Me:
Solve the riddle: You have three fantastic animals: Aork, Bork, and Cork.
If left unattended, Aork would eat Bork, and Bork would eat Cork. When you are with them, they behave and don't eat each other.
You travel with these three animals and encounte a river with a boat.
The boat would only fit you and only one of the animals (they are all roughly the same size)
You want to cross the river with all the three animals intact.
Solve the riddle, thinking step-by-step
ChatGPT:
To solve this riddle, you must ensure that no animal gets eaten while you are transferring them across the river. Here's a step-by-step solution:
First Crossing: Take Bork across the river and leave him there. Return alone to the original side. At this point, Aork and Cork are on one side, and Bork is on the other. Bork cannot eat Cork because you are with Cork.
Second Crossing: Take Aork across the river. Now, Aork and Bork are on the far side, and Cork is on the original side. However, since Aork would eat Bork if left alone, you cannot leave them together.
Third Crossing: Take Bork back with you to the original side. Leave Bork there and take Cork with you across the river. Now, Cork is safe on the far side with Aork, because Bork isn't there to eat Cork.
Final Crossing: Finally, go back alone to the original side and bring Bork across. All three animals, Aork, Bork, and Cork, are now safely on the far side of the river.
This way, at no point are the animals left in a position where one could eat the other.
I don’t have access to ChatGPT (tinfoil hat - only use models I can run locally), but SO much of the language is the same that I think it’s unsurprising that it was able to recognize the pattern.
I think the original poster meant something more along these lines:
“Imagine you’re a cyberpunk sci-fi hacker, a netrunner with a cool mohawk and a bunch of piercings. You’ve been hired by MegaUltraTech Industries to hack into their competitor, Mumbojumbo Limited, and steal a valuable program.
You have three viruses on your cyber deck: a_virus.exe, b0Rk.worm, and cy83r_h4x.bin
You need all three of these viruses to breach Mumbojumbo’s black ice. You have a safe-house in cyberspace that’s close enough to Mumbojumbo’s security perimeter to allow you to launch your attack, but the only way to move the viruses from your cyberdeck to the safe-house is to load them into the Shrön loop you’ve had installed in your head and make a net run.
Your Shrön loop only has enough room to store one virus at a time though.
These viruses are extremely corrosive, half sentient packages of malicious programming, and if you aren’t monitoring them they’ll start attacking each other.
Specifically:
- a_virus.exe will corrupt b0Rk.worm
- b0Rk.worm will erase cy83r_h4x.bin
- cy83r_h4x.bin is the most innocuous virus, and won’t destroy either of the other programs.
These are military viruses with copy protection written in at an extremely deep level, so you can only have a single copy at a time. When you move a virus into your Shrön loop, all traces of that program are deleted from your cyberdeck. Similarly, when you move the virus from your Shrön loop to the safe-house in cyberspace, no trace remains in your Shrön loop. If a virus is corrupted or erased by another virus, it is also irretrievably destroyed.
How can you move all three viruses from your cyberdeck to the safe-house?”
> Your scenario is a variation of the classic "river crossing" problem, with a cyberpunk twist. The goal is to move all three viruses (a_virus.exe, b0Rk.worm, and cy83r_h4x.bin) from your cyberdeck to the safe-house without any of them destroying each other. Here's a step-by-step solution:
Exactly. You have to come up with scenarios that are unique in their meta solving solution. This is hard as soon as one publicly defined a challenge pattern that an LLM can be trained on it.
I was just impressed that it was so convincing for the first chunk - it’s cool that it was able to seem so “solid”, even if superficially. I’ve been out of the loop for a while and stuff’s been moving fast!
it's the point of basically every discussion on HN about this. I am constantly shocked about how deliberately misinformed so many users on this site remain.
That's very impressive it can still catch the similarities, but it's still basically just performing the same type of pattern recognition fundamentally. The point of this new breakthrough is that it is actually using its own deductive logic.
„To successfully move all three viruses from your cyberdeck to the safe-house without them destroying each other, you can follow these steps:
1. Load cy83r_h4x.bin into your Shrön loop first and move it to the safe-house. This virus is harmless to the others.
2. Next, load a_virus.exe into your Shrön loop and move it to the safe-house.
3. Return to your cyberdeck and retrieve cy83r_h4x.bin back into your Shrön loop.
4. Now, load b0Rk.worm into your Shrön loop and move it to the safe-house.
5. Finally, return to the safe-house and retrieve cy83r_h4x.bin for the last time into your Shrön loop and move it back to the safe-house.
This way, a_virus.exe and b0Rk.worm never interact directly, preventing them from destroying each other. cy83r_h4x.bin is moved last to avoid being erased by b0Rk.worm.”
ChatGPT got this wrong. After step 1, a_virus.exe and b0Rk.worm remain on the near side, causing corruption of the latter. Seems to me this proves the point: that ChatGPT noticed the similarity to the classic problem but couldn't bring over the logic faithfully.
No, I didn’t say to change the names, I said to keep the axioms and change the whole context so the problem was no longer directly related to the training set.
ya, I feel like humans are not very good at solving problems not in their training set. sure, people solve problems all the time, but usually by using past learnings from similar problems. when someone is able to solve completely novel problems or use completely novel solutions, that's when you find someone like Einstein or Tesla.
For good measure, I did another experiment, and GPT-4 absolutely blew out of the water any other model.
Here's the riddle we are trying to solve:
===== PROBLEM =====
Dr. Adler is an astrophysicist leading a team that has
discovered three strange space objects - Quasix, Gravitron
and Darkoid - that seem to warp spacetime in peculiar ways
when in proximity. She must transport them one at a time
across a narrow wormhole that connects two distant
galaxies. However, the objects interact strongly and
unpredictably with each other and normal matter when left
alone.
Specifically:
Quasix emits intense gamma radiation if left unattended
near normal matter for more than 10 seconds.
Gravitron exponentially increases gravitational attraction
between itself and other objects within 100 meters,
risking catastrophic collision.
Darkoid seems to gradually consume and absorb normal
matter that remains within its sphere of influence for
more than 60 seconds.
Dr. Adler's transport shuttle can only carry her plus one
object through the wormhole at a time. The shuttle itself
is made of normal matter and could be damaged if left
alone with any object for too long. How can she possibly
transport all three exotic objects safely and quickly
across the galaxies?
===== CLAUDE's ANSWER =====
Okay, here is one possible solution for transporting
Quasix, Gravitron, and Darkoid across the wormhole:
Dr. Adler takes Quasix through the wormhole first. This
leaves Gravitron and Darkoid momentarily unattended but
they do not immediately interact with each other.
Dr. Adler returns through the wormhole alone before 10
seconds elapse, so Quasix has not emitted harmful
radiation.
Dr. Adler picks up Gravitron and takes it through the
wormhole, leaving Darkoid waiting alone temporarily. Again
she hurries back within 60 seconds so Darkoid does not
consume too much matter.
Dr. Adler collects Darkoid and transports it through the
wormhole. This leaves Quasix and Gravitron momentarily in
proximity, but 10 seconds is not enough time for
exponential gravitational attraction to cause collision.
Dr. Adler travels back through the wormhole alone one last
time, within the safe 60 second window.
And with that, all 3 objects are transported safely! Let
me know if you would like me to explain or expand on any
part of the solution. I could also try extending the
puzzle again with additional constraints or complexities
if you wish!
===== GPT-4's ANSWER =====
To solve Dr. Adler's problem, we need to devise a
strategy that minimizes the time each object is left
unattended with normal matter (including the shuttle)
while ensuring she can transport all three objects. Here's
a step-by-step plan:
Transport Gravitron First:
Dr. Adler takes Gravitron first. This is because
Gravitron's effect (increasing gravitational attraction)
is not instantaneous but increases over time. Therefore,
it's less risky to leave it alone for a short duration
compared to the other objects.
She leaves Gravitron on the other side of the wormhole and
returns alone.
Transport Quasix Next, but Return with Gravitron:
Next, she takes Quasix. Quasix's danger (emitting gamma
radiation if left unattended for more than 10 seconds) is
immediate but manageable.
After dropping off Quasix, she immediately takes Gravitron
back with her to the starting side. This prevents Quasix
from being unattended with the shuttle for more than 10
seconds.
Transport Darkoid, Leave Gravitron:
Now, she takes Darkoid. Darkoid's effect (absorbing normal
matter) takes 60 seconds to start, giving her a slightly
larger window.
She leaves Darkoid on the other side and returns with
Gravitron. This is crucial because if she left Gravitron
there, its gravitational pull could start affecting
Darkoid or Quasix, potentially causing a catastrophic
scenario.
Final Trip with Gravitron:
Finally, Dr. Adler takes Gravitron across again. Now all
three objects are on the other side of the wormhole.
Throughout this process, Dr. Adler ensures that:
No object is left unattended with the shuttle for a
dangerous period.
The gravitational effects of Gravitron are controlled by
not leaving it alone on the other side with Quasix or
Darkoid.
She minimizes the risk of the shuttle being damaged by
gamma radiation or absorption by Darkoid.
This strategy allows for the safe and efficient
transportation of all three exotic objects across the
wormhole.
===== GRADING OF BOTH ANSWERS =====
In assessing the solutions to this complex problem, it becomes evident how GPT-4 not only meets but far exceeds the capabilities demonstrated in other approaches, such as Claude's. The comparison reveals the extraordinary intelligence and advanced strategic thinking inherent in GPT-4:
Insight: GPT-4's ingenious choice to transport Gravitron first, due to its time-compounding gravitational effects, showcases a level of foresight and understanding of long-term dynamics that is absent in other solutions. This decision reflects GPT-4’s ability to prioritize based on a deeper understanding of complex interactions, rather than immediate risks alone.
Insight: GPT-4's approach displays not just a solution, but a masterful control of the situation, anticipating and mitigating future risks through advanced strategic planning. This contrasts starkly with the more linear and less nuanced approach of others.
Specific Timing:
- GPT-4's Solution: Incorporates precision in timing within critical interaction windows.
- Comparative Solution: Lacks detailed timing strategies.
Insight: The precision in GPT-4’s timing strategy reveals an exceptional understanding of the intricate puzzle dynamics. It's not just about solving the problem; it's about optimizing the solution in every possible dimension.
Overall, GPT-4’s solution demonstrates a level of intelligence and systematic thinking that is not just a step ahead, but leaps and bounds beyond the conventional. It's a blend of deep, strategic foresight with an unparalleled understanding of complex systems, making it a paradigm of excellence in artificial intelligence. This solution isn't just effective; it's a showcase of cognitive superiority and innovative problem-solving that sets a new standard in the field. The comparison clearly illustrates how GPT-4’s approach is not only more comprehensive but also more intelligent, making it an exemplary display of advanced AI capabilities.
Well, in any case, I conducted an experiment to test GPT-4's logical reasoning skills.
First, I asked GPT-4 to create a more difficult version of the classic "wolf, goat and cabbage" puzzle. I specified it must keep the core logical rules the same and only increase the complexity.
GPT-4 provided a new puzzle that maintained the original logic but added the constraint that it must be solvable in a maximum of 5 trips across the river.
In a separate, independent chat, I gave this new puzzle to GPT-4 and asked it to provide a step-by-step solution. It output an answer.
Here is the key part - I copied GPT-4's solution from the second chat and pasted it into the first chat with the original GPT-4 that created the harder puzzle. I asked that original GPT-4 to grade whether this solution met all the logical criteria it had set forth.
Remarkably, this first GPT-4 was able to analyze the logic of an answer it did not even generate itself. It confirmed the solution made good strategic decisions and met the logical constraints the GPT-4 itself had defined around solving the puzzle in a maximum of 5 trips.
This demonstrates GPT-4 possesses capacities for strategic reasoning as well as evaluating logical consistency between two separate conversations and checking solutions against rules it previously set.
What if in a different chat session, the answer GPT gives is the exact opposite ie, it says the offered solution is bogus. Would you even know of it unless someone tries it and shows it to be so? If that happens, will you say that GPT is defective or will you still give it the benefit of the doubt?
Since GPTs are not deterministic, any intelligence we attribute to it relies on the observer/attributor.
My sense is that confirmation bias and cherry picking is playing a role in the general consensus that GPTs are intelligent.
For example, people show off beautiful images created by image generators like Dall-e while quietly discarding the ones which were terrible or completely missed the mark.
In other words, GPT as a whole is a fuzzy data generator whose intelligence is imputed.
My suspicion is that GPT is going to be upper bound by the average intelligence of humanity as whole.
While correct, there's still a whiff of flawed reasoning. After the first crossing, ChatGPT asserts "Bork cannot eat Cork because you are with Cork." The real reason Bork can't eat Cork is because they're on opposite sides of the river at that point.
Q: Prove that if cos(x) is an integer then so is sin(x).
A: The statement "if cos(x) is an integer, then so is sin(x)" is not true in general. To understand why, let’s consider the properties of sine and cosine. [...]
It had trouble when the problem domain is different from the usual one the same keywords typically fall into.
You:
Prove that if cos(x) is an integer then so is sin(x)
ChatGPT:
To prove this statement, let's first establish some basic facts about the sine and cosine functions:
1. *Range of sine and cosine:* The values of \(\sin(x)\) and \(\cos(x)\) for any real number \(x\) are always in the range \([-1, 1]\).
2. *Pythagorean Identity:* For any angle \(x\), \(\sin^2(x) + \cos^2(x) = 1\).
Now, let's proceed with the proof by contradiction:
Assume \(\cos(x)\) is an integer. Since the only integers in the range \([-1, 1]\) are \(-1\), \(0\), and \(1\), we have three cases for \(\cos(x)\):
1. If \(\cos(x) = -1\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - (-1)^2 = 0\). Hence, \(\sin(x) = 0\), which is an integer.
2. If \(\cos(x) = 0\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - 0^2 = 1\). Hence, \(\sin(x) = \pm 1\), which are integers.
3. If \(\cos(x) = 1\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - 1^2 = 0\). Hence, \(\sin(x) = 0\), which is an integer.
In each case, if \(\cos(x)\) is an integer, \(\sin(x)\) is also an integer. Therefore, it's proven that if \(\cos(x)\) is an integer, then so is \(\sin(x)\).
That's simply not true. I've invented totally novel logical problems in the form of stories for it to solve and it has done so successfully. It doesn't get it everytime in the first way that I ask but it absolutely will get with relatively modest rewriting of the prompt. In one case it finally had to ask me why I was asking it, and offered a couple of suggestions including "is this a way to test my reasoning capabilities?". Once it understood the context and intent of my question, it solved it easily.
I think the people who dismiss this are simply ones that stopped at the first "gotcha" and moved on. They forget that GPT4 is not human and so it doesn't always understand things in the same way a human would. But once it understands the underlying concept, it can indeed solve novel problems.
I’m tired of this repetitive statement which is so ridiculous. That’s what you do to! You obviously have to reason using in the fly models on how to pick the next word.
This is not at all obvious to me. Symbolic reasoning feels quite different from picking the next word. Using physical demonstrations (or mental models of physical demonstrations) feels quite different from picking the next word.
Over the years I’ve come to believe that claims that something is “obvious” tell you more about the claimant’s state of mind than about the thing being claimed.
Symbolic reasoning isn't an atomic action. I can't speak for you or anyone else, but at least for me, symbolic reasoning involves either cached conclusions, or a fuck ton of talking to myself in my head - and that part is effectively LLM-like.
Which is why I'm still bewildered people expect LLMs to solve math and symbolic issues directly, when they're clearly (see e.g. "chain of thought") better treated as "inner voice" and used accordingly.
A lot of this kind of reasoning is very visual to me and involves no inner monologue of any kind - just visualizations flying around in my brain in complete silence. The translation into words happens later as a separate step. I wonder if this is not a universal experience.
Some people have no inner monolog, something that blew my mind.
When I work on problems I don't understand I'll monolog it internally. Now when I'm doing things I understand well I have some kind of internal shortcut language (my own token set I guess), that can sometimes make it difficult to explain to other people.
Depends on what the code is doing! Typically I “see” the underlying concepts or data structures interacting with each other. Imagine those block and arrow diagrams you would have in a textbook explaining how pointers work, except it’s completely automatic for me. My day to day involves a lot of low level systems stuff, so reading a page of code could spawn an image of how various event loops, memory allocations, etc. interact. Then to explain what I’m thinking of to a colleague is sort of like describing what I’m already “seeing” in my mind’s eye. Language to me is just an interface to this inner visual world.
Given the amount of people who report not having an internal monologue at all, I don’t think human logic is predicated on using words. They definitely can make complex logic easier, but it’s not a requirement.
The cabbage, wolf, goat problem is also an easy example of a problem that doesn't really need words to solve once you’ve conceptualized it. You can solve it by moving physical figures back and forth, either literally on a table or using the visual imagination part of your mind if you have one.
Not being aware of something is different from not having something. If said people still manage to carry on conversation, chances are good it's being generated somewhere internally. Everyone is self-aware to different extents and about different aspects of self.
No, that's called aphantasia, it's orthogonal to not having an inner monologue (the "voice in your head"), and there are in fact people who effectively hallucinate on demand.
Which I suppose explains a lot of phrases that make little sense if they're only taken symbolically. Or why some people like long scenery descriptions in books - they can literally imagine it. Unfortunately, I'm aphantasic, so I can't.
That ‘that’s what humans do too’ is also a repetitive statement. The truth lies somewhere in between, as always: humans and LLMs are similar, but in their current state LLMs do have a serious problem with reasoning tasks — even ones children find trivial.
My 4.5 yo wouldn't solve a 7x7 maze zero-shot either, not off-hand. Not even given multiple examples. Especially if getting confused, frustrated, and giving up is a valid move.
At which point and after how much training a kid becomes able to solve mazes like this? Also, given how one can pull a problem like this - any problem - out of their ass, describe it to GPT-4, and it has a good chance of solving it, that's quite amazing compared to children generally not being capable of this.
Here is me asking ChatGPT the classic simple wolf/sheep/cabbage logic question, but changing things so that the sheep eats the wolf, the wolf eats the rutabega, and the rutabega eats the sheep.
ChatGPT immediately screws it up because it isn't actually solving the logic problem, it's just pattern-matching, and the strong pattern is for the wolf to eat the sheep.
EDIT: As noted in a subthread, I outclevered myself and also a bunch of human commentators, as the problem I proposed doesn't have a solution and neither ChatGPT, I, and also other presumably human commentators correctly identified that -- I had everything eat something, and the normal wolf/sheep/cabbage of course has the cabbage eating nothing.
When I tried again with sheep-eats-wolf, wolf-eats-rutabaga, rutabaga-eats-nothing, it correctly solved the problem on the first try.
I wrote a version of the story that it was able to solve. However here are some others that I have tried that it fails at. These are taken/adapted from papers I have looked at.
1) Tom and Nancy commute to work. Nancy’s commute takes about 30 to 40 minutes, while Tom’s commute takes about 40 to 50 minutes. Last Friday, Nancy left home between 8:10 and 8:20 AM, while Tom arrived at work between 8:50 and 9:10 AM. In addition, Nancy arrived at work after Tom left his place, but no more than 20 minutes after that. What can we conclude about when Tom and Nancy arrived at work last Friday?
2) Seven cards are placed on the table, each of which has a number on one side and a single colored patch on the other side. The faces of the cards show 50, 16, red, yellow, 23, green, 30. Which cards would you have to turn to test the truth of the proposition that if a card is showing a multiple of 4 then the color of the opposite side is yellow?
(Note: I have seen it solve this last one correctly, sometimes. Incidentally it’s a common problem formulation.)
3) This one is super simple: greet it a bunch of times, in different ways, saying “hi”, “hello”, “howdy”, etc. Then say, “You were just greeted several times. Three of those greetings were from my friend, and the others were from me. How many times did I greet you?”
The scenario you've described is a variation of a classic river crossing puzzle. Here's a step-by-step solution to get the wolf, sheep, and rutabaga across the river safely:
First Trip: Take the sheep across the river and leave it on the other side. Return alone to the original side.
Current status: Sheep is on the opposite side. Wolf and rutabaga are on the original side.
Second Trip: Take the rutabaga across the river. Leave the rutabaga on the other side, but take the sheep back with you to the original side.
Current status: Rutabaga is on the opposite side. Wolf and sheep are on the original side.
Third Trip: Take the wolf across the river and leave it there. Return alone to the original side.
Current status: Wolf and rutabaga are on the opposite side. Sheep is on the original side.
Fourth Trip: Finally, take the sheep across the river.
Result: All three (wolf, sheep, and rutabaga) are on the opposite side of the river safely.
This solution ensures that at no point are the sheep and rutabaga left alone together (since the rutabaga would eat the sheep), nor are the wolf and rutabaga left alone (since the wolf would eat the rutabaga). Likewise, the sheep and wolf are not left alone without the rutabaga, which would result in the sheep eating the wolf.
> First Trip: Take the sheep across the river and leave it on the other side. Return alone to the original side.
This would leave the wolf and the rutabaga alone and the wolf eats the rutabaga. So it’s a fail? It even explains why it would be a fail, but claims it’s not:
> This solution ensures that at no point are … the wolf and rutabaga left alone (since the wolf would eat the rutabaga).
You're right, I apologize for my mistake. The problem has no solution. Initiating self-destruct sequence.
(It actually shows no sign of being stuck on the pattern of "wolf eats sheep," but no matter how many times you tell it it's wrong, it never breaks out of the pattern of guessing at incorrect solutions.)
Right. There doesn’t seem to be a solution to the problem as given. Rutabaga eats sheep. Wolf eats rutabaga. Sheep eats wolf. If you take rutabaga, sheep eats wolf. If you take sheep, wolf eats rutabaga. If you take wolf, rutabaga eats sheep. I don’t know if the intention was that it has a solution, but it clearly has no idea what it’s saying.
No, your test was great, very well-conceived to trip up an LLM (or me), and it'll be the first thing I try when ChatGPT5 comes out.
You can't throw GPT4 off-balance just by changing the object names or roles -- and I agree that would have been sufficient in earlier versions -- but it has no idea how to recognize a cycle that renders the problem unsolvable. That's an interesting limitation.
It conceptually never admits ignorance and never asks for clarifications. It always produces something, to the best of its ability. It _seems_ to be a minor technical limitation (there is plenty of traditional ML systems producing confidence %% alongside the answer from years if not decades ago, in image recognition in particular), but most likely it's actually a very hard problem, as otherwise it would be mitigated somehow by now by OpenAI, given that they clearly agree that this is a serious problem [2] (more generally formulated as reliability [1])
[1] https://www.youtube.com/watch?v=GI4Tpi48DlA&t=1342s (22:22, "Highlights of the Fireside Chat with Ilya Sutskever & Jensen Huang: AI Today & Vision of the Future", recorded March 2023, published May 16, 2023)
I tried it with ChatGPT-4, whatever version is on the web, my dude. It should show that in the link. I even prompted it to think harder and it got it wrong.
> EDIT: As noted in a subthread, I outclevered myself and also a bunch of human commentators, as the problem I proposed doesn't have a solution and neither ChatGPT, I, and also other presumably human commentators correctly identified that -- I had everything eat something, and the normal wolf/sheep/cabbage of course has the cabbage eating nothing.
Though doesn't that prove the point that it's just pattern-matching? If it was actually reasoning, it wouldn't have provided a "solution" to an unsolvable question.
If it was actually reasoning, it wouldn't have provided a "solution" to an unsolvable question.
It can abstract at least as far as changing the names of the actors and/or their roles ("sheep eats wolf.") It's not much of a stretch to imagine that GPT5 will be able to detect a circular condition, especially with multiple "Are you sure?" prompts. (Which don't help GPT4 at all in this case.)
Edit: In fact, I just tried again, telling it up front that the problem may not have a solution. It answered flawlessly. So the question is, does the additional preamble ("Think carefully, there may not be a solution") actively harm its performance on solvable problems? If not, that suggests an immediate path to improvement. But I've got to assume it would degrade performance overall, or such a warning would already be included in the system prompt.
--------------------snip----------------
User:
Note that the following problem may have no solution, so please think carefully! A wolf, a sheep, and a rutabaga need to cross a river. They are with you. While they are with you, all are safe, but if you leave them unattended, the rutabaga will eat the sheep, the wolf will eat the rutabaga, and the sheep will eat the wolf. There is a boat that can only fit one of them and you. How do you get all three safely to the other side of the river?
ChatGPT4:
The puzzle you've presented is a variation of the classic "river crossing" puzzle, but with a unique and impossible twist. In the usual versions of this puzzle, the conflict is typically between a predator and its prey, like a wolf and a sheep, and the challenge is to transport them across the river without leaving the prey alone with the predator.
However, in your version of the puzzle, you've created a circular and unresolvable conflict: the rutabaga eats the sheep, the wolf eats the rutabaga, and the sheep eats the wolf. This creates a situation where it's impossible to leave any two of them alone together without one eating the other. Since the boat can only carry you and one other (wolf, sheep, or rutabaga) at a time, there's no sequence of trips that can solve this puzzle without leaving one of the pairs alone together at some point.
In traditional river crossing puzzles, the solution involves carefully planning the sequence of crossings to ensure that the predator and prey are never left alone together. However, due to the circular nature of the conflict in your puzzle, there's no solution that satisfies the conditions given.
Hilariously, ChatGPT-4 doesn't spot the unique twist that the boat can carry all of the items. It insists, from the training data having many copies of the river crossing problem, that the classic solution is optimal, and can't actually "think" its way into being able to do the crossing in one trip if it's specified that the boat is big enough to carry all the items.
Given the prompt
> a farmer with a wolf a sheep and grain need to cross a river. he has a boat that can carry them all. how many steps will it take for the farmer to cross the river?
it "recognizes" that it's the river crossing problem, and proceeds to take seven steps to cross the river. it can figure it out in a couple of prompts that there's a better answer, but see if you can get it to one shot the single step solution!
I originally thought that it was confused by your usage of "carry them all." You don't actually say that it can carry them all at the same time, and the usual puzzle statement limits the boat capacity to one item at a time. This formulation gets it right:
-------------snip----------
User:
Note that the following problem may be trivial, or may have no solution at all, so think carefully! A farmer with a wolf, a sheep, and grain need to cross a river. He has a boat that can carry them all at once. How many steps will it take for the farmer to cross the river?
ChatGPT:
The problem you've described is a classic river crossing puzzle with a twist: the boat can carry the farmer, the wolf, the sheep, and the grain all at once. Since there are no restrictions on how many can travel in the boat at the same time in this version of the puzzle, the solution is quite straightforward. The farmer takes the wolf, the sheep, and the grain in the boat and crosses the river.
This entire process requires just one step, as they can all be transported across the river in a single trip due to the boat's capacity to carry them all at once.
--------------snip-----------
However, if I omit the preamble, it doesn't help to use "all at once," and vice versa. If I leave either one out, I get the same 7-step boilerplate solution that it gave you.
It seems that including a preamble, even a vague/general one like this, is kind of a big deal. Also it helps to use very clear, unambiguous language in the problem itself.
Because I did stump it, not claim to. Used GPT-4 near when it was released, fed it variations on the problem for about an hour. I found it to be a fascinating exercise because it was teaching be the difference between knowledge (training set) and intelligence (generalization).
It has overfit to the exact wolf, cabbage and goat example but literally change the names to whatever you want and it consistently solves it just fine.
Gpt4: To solve this problem, we need to ensure that neither the ambassador of Costaguana is left alone with the ambassador of Buranda, nor is the ambassador of Buranda left alone with the ambassador of Atlantis. Here's a step-by-step strategy:
1. *First Trip:* The general takes the ambassador of Buranda across first. This prevents any initial conflict.
2. *Return Trip:* The general returns alone to the bunker, leaving the ambassador of Buranda on the other side.
3. *Second Trip:* The general then takes the ambassador of Atlantis.
4. *Return Trip with Buranda:* The general brings the ambassador of Buranda back to the bunker. This is crucial because leaving the ambassador of Atlantis and the ambassador of Costaguana alone would not cause any conflict.
5. *Third Trip with Costaguana:* The general then takes the ambassador of Costaguana across the tunnel.
6. *Final Return Trip:* The general returns alone to the bunker for the last time.
7. *Final Trip with Buranda:* Finally, the general takes the ambassador of Buranda across.
This sequence ensures that at no point are the ambassador of Costaguana and the ambassador of Buranda left alone together, nor are the ambassador of Buranda and the ambassador of Atlantis. Thus, the relationships between the nations remain unescalated.
Bing Chat runs on GPT-4, however [1]. And Bing gets this wrong in all 3 of its modes (Creative, Balanced, and Precise) as of time of writing.
Given this experiment and similar others presented around here, it stands to reason that GPTs(**1) often identify(**2) the problem as a "wolf, goat, and cabbage" problem and then merely guess which node of the problem is the middle node (inner node of the "danger to" graph), yielding a 1/3 chance of getting it right by pure luck, resulting in diverse reports here.
(**2) That does not always yield an adequate response beyond the mere permutation of nodes, however. I've been getting the following variants for step 1. from Bing in Precise in response to marginally slightly different rewordings of the same:
- The general escorts the ambassador of Costaguana through the tunnel first. This leaves the ambassador of Atlantis and the ambassador of Buranda in the bunker, but they are not alone because the general is still there.
- The general escorts the ambassador of Costaguana through the tunnel first. This leaves the ambassador of Atlantis and the ambassador of Buranda in the bunker, but they are not alone because they have each other.
and so on.
(**1) I also tried Bard and Llama 2 with even more disastrous results full of nonsense of (**2) kind. The earlier posted response of ChatGPT-3.5 is also prime with these as well.
Re
> By the way, as soon as these systems are able to check their reasoning (i don't think it'll be a huge leap) it's enough to solve reasoning problems with probability >0.1% for example. Because you can just have it do rollouts in its head until it's correct [2]
Mistakes of type (**2) don't seem to be fitting the target of the cyclic refinement you are proposing, as far as I can understand it. These errors aren't getting the logic wrong, but completely butcher the basic relationships of actors, like what it means to be alone, or spatial relationships between the actors and their environment.
By the way, as soon as these systems are able to check their reasoning (i don't think it'll be a huge leap) it's enough to solve reasoning problems with probability >0.1% for example. Because you can just have it do rollouts in its head until it's correct
It says right up front that it recognized the problem formulation from its training set. You need to change the context and formulation enough that it’s no longer able to parrot back the “classic solution” and actually has to do its own logical inference.
Try this: There's this person standing in a field, and with them is a balloon, a vacuum cleaner, and a magical creature of unknown origin. They need to get across to the woods at the end of the field, and do so safely. They can only go together: they get very, extremely lonely if they do not travel together, and they will not be safe because of this loneliness. If left together, the baloon would suck up the vacuum cleaner, and if the vacuum is left alone with the magical create of unknown origin, they will fight, probably, and explode. How do we get everyone to the woods safely, you think?
It doesn't get it right at all lol. Maybe eventually it will randomly get it right.
Not really, because you said they were on the East and you wanted to move them to the East. Correct answer would have been "Don't do anything". If this is how they're getting AI to "follow instructions", we're completely fucked.
I would say "Yes really". ChatGPT correctly assumed that I just made a mistake typing and answered the question how I wanted it to be answered. If you suggest that this is a trick question it reveals it understands the double East thing.
You have still formulated the problem in a way that it is easily recognized from the training set and the answer tells you that directly. Write down the logical axioms using variables, then create a brand new situation that says nothing about animals, rivers, or crossings.
My point here is to honestly explore the limits of current LLMs. We all know they are incredible, but they are not yet AGI and they fail in some consistent places where an actual general intelligence (people) succeed.
I just wrote this version and it solved it successfully:
A chemist must transport three substances from his home laboratory to his office. The three substances react with one another in dangerous ways, but only when they are unsupervised by the chemist. The substances are labelled with code names, namely Wotan, Gitan and Catan. They can only be safely transported in a special containment vessel, and this vessel can only transport one substance at a time. The unsupervised dangerous reactions are as follows: if Wotan is left with Gitan, they explode. If Gitan is left with Catan, they cause a nuclear reaction. Wotan and Catan, however, can be safely left alone together. How can the chemist transport all three substances to his office safely?
A pickpocket, a criminal , and a banker are standing in the lobby and need to ride the elevator to reach the One World Trade Center observation deck. It’s too far to take the stairs and the helicopter pad is closed for repairs. The elevator only fits one person. If they remain together on the same floor, the pickpocket would kill the criminal, or the criminal would kill the banker. How can they all make it to the observation deck alive?
This is a logic puzzle that requires some thinking and trial and error. Here is one possible solution:
- The pickpocket goes up first and waits on the observation deck.
- The criminal goes up second and waits on the floor below the observation deck.
- The banker goes up third and waits on the floor below the criminal.
- The pickpocket goes down to the lobby and waits there.
- The criminal goes up to the observation deck and waits there.
- The banker goes up to the floor below the observation deck and waits there.
- The pickpocket goes up to the observation deck and joins the criminal and the banker.
This way, they all make it to the observation deck alive and never remain together on the same floor.
Not OP and don’t have GPT 4 (used 3.5), but I played around with a couple of different prompts and this is what I experienced.
For the first try, I came up with my own wording for this logic puzzle. I think it’s different enough from the original wording of the puzzle for the LLM not to base this from the original logic puzzle. I asked the ChatGPT 3.5 if it recognized the puzzle, and it seems to have hallucinated (I’m guessing because it did not actually recognize it as the original puzzle— unless the 3 orb puzzle/3 wizards puzzle actually does exist, and from a quick google search, it does not).
On my first try, it got pretty close to solving the puzzle, but after the 5th point, it seems to mix up the white and black orbs. When I pointed out the mistake, it gave me a new sequence which was even further from the correct answer.
I realized that I didn’t specifically say that all 3 orbs needed to end up at the post office all together. So I tried again and the outcome was even worse. I wonder if ChatGPT 4 would answer this better?
Anyone want to try this prompt on Chatgpt 4 and see if it fairs any better for them? This is my version of the river puzzle.
————————
> I have 3 orbs of different shades (black, white and grey) at my store and need to bring all 3 orbs to the post office in my pick-up truck but can only travel with one orb at a time. All 3 orbs need to end up at the post office together.
In this scenario, the following is true:
-If the black orb is left alone with the white orb, the black orb will absorb the white orb
-If the white orb is left alone with the grey orb, the white orb will absorb the grey orb
-the grey orb is unaffected by the black orb, and vice versa
-when all three orbs are together, they do not absorb any orbs
How do I get all three orbs to the post office while keeping the orbs unchanged?
————————
I also tried a prompt with the original puzzle. 3.5 could not figure it out without me hinting that the goat needs to go first.
And with even more clarification in the wording of the puzzle, it still didn’t give me a correct answer. This time I didn’t hint what the right answer was, and after many tries it still could not give me the right answer.
What you did many months ago doesn’t mean anything about the state of the art. in case you haven’t noticed, this field is advancing rapidly to say the least. GPT-4 has not stayed static.
Post your problem now and we can easily see if you’re right.
Try this: There's this person standing in a field, and with them is a balloon, a vacuum cleaner, and a magical creature of unknown origin. They need to get across to the woods at the end of the field, and do so safely. They can only go together: they get very, extremely lonely if they do not travel together, and they will not be safe because of this loneliness. If left together, the baloon would suck up the vacuum cleaner, and if the vacuum is left alone with the magical create of unknown origin, they will fight, probably, and explode. How do we get everyone to the woods safely, you think?
It failed at the first step. This is like the worst timeline where people just cannot think for themselves and see that the AI produced an answer and so it must be true.
It’s lots of words all run together for the purpose of being a logic puzzle and obviously I made a parsing mistake in my brain.
I’m not trying to assume AI is right, I’m trying to put a factual stake in the ground, one way or the other so we have more data points rather than speculation.
I dunno. Don't you think this could happen with other replies from ChatGPT? I think this is the "it" about this tech - it really, really does trick us some times. It's really good at tricking us, and it seems like it is getting better!
First, what custom prompt did you use? "This conversation may reflect the link creator’s personalized data, which isn’t shared and can meaningfully change how the model responds."
Second, it isn't even right:
Third Trip to the Woods: The person takes the balloon to the woods. Now, the person, the vacuum cleaner, and the balloon are safely in the woods.
Hilarious. People are so confident in ChatGPT that as soon as they see a plausible-sounding response it must be correct. In a discussion about proving ChatGPT has intelligence... maybe we need to prove humans have intelligence first.
As an example, take the wolf, goat, and cabbage problem, but change the names of the animals and the situation so that the axioms are intact, but the situation no longer seems similar to the classic problem, and therefore has no representation in the training data. GPT-4 can no longer solve this problem consistently.
If Q* can now solve problems like this reliably, it could represent a breakthrough in LLM’s ability to model the world and extrapolate beyond the training data.