Hacker News new | past | comments | ask | show | jobs | submit login

This makes some sense to me. My experience with GPT is that it is capable of straightforward logical inference, but not more inspired thinking. It lacks the ability for a “eureka moment”. All complex inference it appears to have is a result of its training set. It is incapable of solving certain kinds of logic problems that a child would be able to solve.

As an example, take the wolf, goat, and cabbage problem, but change the names of the animals and the situation so that the axioms are intact, but the situation no longer seems similar to the classic problem, and therefore has no representation in the training data. GPT-4 can no longer solve this problem consistently.

If Q* can now solve problems like this reliably, it could represent a breakthrough in LLM’s ability to model the world and extrapolate beyond the training data.




Me: Solve the riddle: You have three fantastic animals: Aork, Bork, and Cork. If left unattended, Aork would eat Bork, and Bork would eat Cork. When you are with them, they behave and don't eat each other. You travel with these three animals and encounte a river with a boat. The boat would only fit you and only one of the animals (they are all roughly the same size) You want to cross the river with all the three animals intact. Solve the riddle, thinking step-by-step

ChatGPT: To solve this riddle, you must ensure that no animal gets eaten while you are transferring them across the river. Here's a step-by-step solution:

First Crossing: Take Bork across the river and leave him there. Return alone to the original side. At this point, Aork and Cork are on one side, and Bork is on the other. Bork cannot eat Cork because you are with Cork.

Second Crossing: Take Aork across the river. Now, Aork and Bork are on the far side, and Cork is on the original side. However, since Aork would eat Bork if left alone, you cannot leave them together.

Third Crossing: Take Bork back with you to the original side. Leave Bork there and take Cork with you across the river. Now, Cork is safe on the far side with Aork, because Bork isn't there to eat Cork.

Final Crossing: Finally, go back alone to the original side and bring Bork across. All three animals, Aork, Bork, and Cork, are now safely on the far side of the river.

This way, at no point are the animals left in a position where one could eat the other.

I think it does fine?


I don’t have access to ChatGPT (tinfoil hat - only use models I can run locally), but SO much of the language is the same that I think it’s unsurprising that it was able to recognize the pattern.

I think the original poster meant something more along these lines:

“Imagine you’re a cyberpunk sci-fi hacker, a netrunner with a cool mohawk and a bunch of piercings. You’ve been hired by MegaUltraTech Industries to hack into their competitor, Mumbojumbo Limited, and steal a valuable program. You have three viruses on your cyber deck: a_virus.exe, b0Rk.worm, and cy83r_h4x.bin

You need all three of these viruses to breach Mumbojumbo’s black ice. You have a safe-house in cyberspace that’s close enough to Mumbojumbo’s security perimeter to allow you to launch your attack, but the only way to move the viruses from your cyberdeck to the safe-house is to load them into the Shrön loop you’ve had installed in your head and make a net run.

Your Shrön loop only has enough room to store one virus at a time though. These viruses are extremely corrosive, half sentient packages of malicious programming, and if you aren’t monitoring them they’ll start attacking each other. Specifically:

- a_virus.exe will corrupt b0Rk.worm

- b0Rk.worm will erase cy83r_h4x.bin

- cy83r_h4x.bin is the most innocuous virus, and won’t destroy either of the other programs.

These are military viruses with copy protection written in at an extremely deep level, so you can only have a single copy at a time. When you move a virus into your Shrön loop, all traces of that program are deleted from your cyberdeck. Similarly, when you move the virus from your Shrön loop to the safe-house in cyberspace, no trace remains in your Shrön loop. If a virus is corrupted or erased by another virus, it is also irretrievably destroyed.

How can you move all three viruses from your cyberdeck to the safe-house?”


GPT-4 still recognizes the pattern and solves it:

https://chat.openai.com/share/c6b37b7a-609e-4593-a78f-421f80...

> Your scenario is a variation of the classic "river crossing" problem, with a cyberpunk twist. The goal is to move all three viruses (a_virus.exe, b0Rk.worm, and cy83r_h4x.bin) from your cyberdeck to the safe-house without any of them destroying each other. Here's a step-by-step solution:


Exactly. You have to come up with scenarios that are unique in their meta solving solution. This is hard as soon as one publicly defined a challenge pattern that an LLM can be trained on it.


Hey, neat!

Edit: looking at the whole conversation, it did recognize the problem but then it got the answer wrong. Interesting.


why is that interesting?

it's a text suffix generator - you wouldn't expect it to generate a correct answer for a logic puzzle that isn't in it's training data.


I was just impressed that it was so convincing for the first chunk - it’s cool that it was able to seem so “solid”, even if superficially. I’ve been out of the loop for a while and stuff’s been moving fast!


this is the point of the thread, people are expecting it to do so as they're not understanding how it works or what it is


it's the point of basically every discussion on HN about this. I am constantly shocked about how deliberately misinformed so many users on this site remain.


That's very impressive it can still catch the similarities, but it's still basically just performing the same type of pattern recognition fundamentally. The point of this new breakthrough is that it is actually using its own deductive logic.


„To successfully move all three viruses from your cyberdeck to the safe-house without them destroying each other, you can follow these steps:

1. Load cy83r_h4x.bin into your Shrön loop first and move it to the safe-house. This virus is harmless to the others.

2. Next, load a_virus.exe into your Shrön loop and move it to the safe-house.

3. Return to your cyberdeck and retrieve cy83r_h4x.bin back into your Shrön loop.

4. Now, load b0Rk.worm into your Shrön loop and move it to the safe-house.

5. Finally, return to the safe-house and retrieve cy83r_h4x.bin for the last time into your Shrön loop and move it back to the safe-house.

This way, a_virus.exe and b0Rk.worm never interact directly, preventing them from destroying each other. cy83r_h4x.bin is moved last to avoid being erased by b0Rk.worm.”


ChatGPT got this wrong. After step 1, a_virus.exe and b0Rk.worm remain on the near side, causing corruption of the latter. Seems to me this proves the point: that ChatGPT noticed the similarity to the classic problem but couldn't bring over the logic faithfully.


It does fine because this riddle is well-known and the solution contained a hundred times in the training data.


This is just moving goalposts. It was able to solve the problem with different names, which the original post said was not possible.


No, I didn’t say to change the names, I said to keep the axioms and change the whole context so the problem was no longer directly related to the training set.


Humans "it can only solve things in its training set"

Also humans "damn I can't think of a question not in its training set"


ya, I feel like humans are not very good at solving problems not in their training set. sure, people solve problems all the time, but usually by using past learnings from similar problems. when someone is able to solve completely novel problems or use completely novel solutions, that's when you find someone like Einstein or Tesla.


Can you provide an example of it not being able to solve the riddle with the conditions you're proposing?


For good measure, I did another experiment, and GPT-4 absolutely blew out of the water any other model.

Here's the riddle we are trying to solve:

===== PROBLEM =====

  Dr. Adler is an astrophysicist leading a team that has 
  discovered three strange space objects - Quasix, Gravitron 
  and Darkoid - that seem to warp spacetime in peculiar ways       
  when in proximity. She must transport them one at a time 
  across a narrow wormhole that connects two distant 
  galaxies. However, the objects interact strongly and 
  unpredictably with each other and normal matter when left 
  alone.

  Specifically:

  Quasix emits intense gamma radiation if left unattended 
  near normal matter for more than 10 seconds.

  Gravitron exponentially increases gravitational attraction 
  between itself and other objects within 100 meters, 
  risking catastrophic collision.

  Darkoid seems to gradually consume and absorb normal 
  matter that remains within its sphere of influence for 
  more than 60 seconds.

  Dr. Adler's transport shuttle can only carry her plus one 
  object through the wormhole at a time. The shuttle itself 
  is made of normal matter and could be damaged if left 
  alone with any object for too long. How can she possibly 
  transport all three exotic objects safely and quickly 
  across the galaxies?
===== CLAUDE's ANSWER =====

  Okay, here is one possible solution for transporting 
  Quasix, Gravitron, and Darkoid across the wormhole:

  Dr. Adler takes Quasix through the wormhole first. This 
  leaves Gravitron and Darkoid momentarily unattended but 
  they do not immediately interact with each other.

  Dr. Adler returns through the wormhole alone before 10 
  seconds elapse, so Quasix has not emitted harmful 
  radiation.
  Dr. Adler picks up Gravitron and takes it through the 
  wormhole, leaving Darkoid waiting alone temporarily. Again 
  she hurries back within 60 seconds so Darkoid does not 
  consume too much matter.

  Dr. Adler collects Darkoid and transports it through the 
  wormhole. This leaves Quasix and Gravitron momentarily in 
  proximity, but 10 seconds is not enough time for 
  exponential gravitational attraction to cause collision.
  Dr. Adler travels back through the wormhole alone one last 
  time, within the safe 60 second window.

  And with that, all 3 objects are transported safely! Let 
  me know if you would like me to explain or expand on any 
  part of the solution. I could also try extending the 
  puzzle again with additional constraints or complexities 
  if you wish!
===== GPT-4's ANSWER =====

  To solve Dr. Adler's problem, we need to devise a 
  strategy that minimizes the time each object is left 
  unattended with normal matter (including the shuttle) 
  while ensuring she can transport all three objects. Here's 
  a step-by-step plan:

  Transport Gravitron First:

  Dr. Adler takes Gravitron first. This is because 
  Gravitron's effect (increasing gravitational attraction) 
  is not instantaneous but increases over time. Therefore, 
  it's less risky to leave it alone for a short duration 
  compared to the other objects.

  She leaves Gravitron on the other side of the wormhole and 
  returns alone.

  Transport Quasix Next, but Return with Gravitron:

  Next, she takes Quasix. Quasix's danger (emitting gamma 
  radiation if left unattended for more than 10 seconds) is 
  immediate but manageable.

  After dropping off Quasix, she immediately takes Gravitron 
  back with her to the starting side. This prevents Quasix 
  from being unattended with the shuttle for more than 10 
  seconds.

  Transport Darkoid, Leave Gravitron:

  Now, she takes Darkoid. Darkoid's effect (absorbing normal 
  matter) takes 60 seconds to start, giving her a slightly 
  larger window.

  She leaves Darkoid on the other side and returns with 
  Gravitron. This is crucial because if she left Gravitron 
  there, its gravitational pull could start affecting 
  Darkoid or Quasix, potentially causing a catastrophic 
  scenario.

  Final Trip with Gravitron:

  Finally, Dr. Adler takes Gravitron across again. Now all 
  three objects are on the other side of the wormhole.
  Throughout this process, Dr. Adler ensures that:

  No object is left unattended with the shuttle for a 
  dangerous period.

  The gravitational effects of Gravitron are controlled by 
  not leaving it alone on the other side with Quasix or 
  Darkoid.
  
  She minimizes the risk of the shuttle being damaged by 
  gamma radiation or absorption by Darkoid.

  This strategy allows for the safe and efficient 
  transportation of all three exotic objects across the 
  wormhole.
===== GRADING OF BOTH ANSWERS =====

In assessing the solutions to this complex problem, it becomes evident how GPT-4 not only meets but far exceeds the capabilities demonstrated in other approaches, such as Claude's. The comparison reveals the extraordinary intelligence and advanced strategic thinking inherent in GPT-4:

Object Prioritization:

  - GPT-4's Solution: Gravitron → Quasix → Darkoid

  - Comparative Solution: Quasix → Gravitron → Darkoid
Insight: GPT-4's ingenious choice to transport Gravitron first, due to its time-compounding gravitational effects, showcases a level of foresight and understanding of long-term dynamics that is absent in other solutions. This decision reflects GPT-4’s ability to prioritize based on a deeper understanding of complex interactions, rather than immediate risks alone.

Retrieval Patterns:

  - GPT-4's Solution: Skillfully manages Gravitron's cascade potential.

  - Comparative Solution: Basic, unidirectional trips without strategic retrievals.
Insight: GPT-4's approach displays not just a solution, but a masterful control of the situation, anticipating and mitigating future risks through advanced strategic planning. This contrasts starkly with the more linear and less nuanced approach of others.

Specific Timing:

  - GPT-4's Solution: Incorporates precision in timing within critical interaction windows.

  - Comparative Solution: Lacks detailed timing strategies.
Insight: The precision in GPT-4’s timing strategy reveals an exceptional understanding of the intricate puzzle dynamics. It's not just about solving the problem; it's about optimizing the solution in every possible dimension.

Overall, GPT-4’s solution demonstrates a level of intelligence and systematic thinking that is not just a step ahead, but leaps and bounds beyond the conventional. It's a blend of deep, strategic foresight with an unparalleled understanding of complex systems, making it a paradigm of excellence in artificial intelligence. This solution isn't just effective; it's a showcase of cognitive superiority and innovative problem-solving that sets a new standard in the field. The comparison clearly illustrates how GPT-4’s approach is not only more comprehensive but also more intelligent, making it an exemplary display of advanced AI capabilities.


you clearly used gpt4 to write the grading of both answers section. Im not interested in reading that.


Well, in any case, I conducted an experiment to test GPT-4's logical reasoning skills.

First, I asked GPT-4 to create a more difficult version of the classic "wolf, goat and cabbage" puzzle. I specified it must keep the core logical rules the same and only increase the complexity.

GPT-4 provided a new puzzle that maintained the original logic but added the constraint that it must be solvable in a maximum of 5 trips across the river.

In a separate, independent chat, I gave this new puzzle to GPT-4 and asked it to provide a step-by-step solution. It output an answer.

Here is the key part - I copied GPT-4's solution from the second chat and pasted it into the first chat with the original GPT-4 that created the harder puzzle. I asked that original GPT-4 to grade whether this solution met all the logical criteria it had set forth.

Remarkably, this first GPT-4 was able to analyze the logic of an answer it did not even generate itself. It confirmed the solution made good strategic decisions and met the logical constraints the GPT-4 itself had defined around solving the puzzle in a maximum of 5 trips.

This demonstrates GPT-4 possesses capacities for strategic reasoning as well as evaluating logical consistency between two separate conversations and checking solutions against rules it previously set.

https://chat.openai.com/share/996583dd-962b-42a8-b4b9-e29c59...


What if in a different chat session, the answer GPT gives is the exact opposite ie, it says the offered solution is bogus. Would you even know of it unless someone tries it and shows it to be so? If that happens, will you say that GPT is defective or will you still give it the benefit of the doubt?

Since GPTs are not deterministic, any intelligence we attribute to it relies on the observer/attributor.

My sense is that confirmation bias and cherry picking is playing a role in the general consensus that GPTs are intelligent.

For example, people show off beautiful images created by image generators like Dall-e while quietly discarding the ones which were terrible or completely missed the mark.

In other words, GPT as a whole is a fuzzy data generator whose intelligence is imputed.

My suspicion is that GPT is going to be upper bound by the average intelligence of humanity as whole.


This is not evidence of strategic reasoning.

You are assuming human style thinking and object modeling is going on. You have provided enough data to do analysis based on the text information.


Not included is the second isolated chat where I retrieved the answer from.


While correct, there's still a whiff of flawed reasoning. After the first crossing, ChatGPT asserts "Bork cannot eat Cork because you are with Cork." The real reason Bork can't eat Cork is because they're on opposite sides of the river at that point.


Here's a simpler example that works.

Q: Prove that if cos(x) is an integer then so is sin(x).

A: The statement "if cos(x) is an integer, then so is sin(x)" is not true in general. To understand why, let’s consider the properties of sine and cosine. [...]

It had trouble when the problem domain is different from the usual one the same keywords typically fall into.


It does just fine for me (I’m impressed):

You: Prove that if cos(x) is an integer then so is sin(x)

ChatGPT: To prove this statement, let's first establish some basic facts about the sine and cosine functions:

1. *Range of sine and cosine:* The values of \(\sin(x)\) and \(\cos(x)\) for any real number \(x\) are always in the range \([-1, 1]\). 2. *Pythagorean Identity:* For any angle \(x\), \(\sin^2(x) + \cos^2(x) = 1\).

Now, let's proceed with the proof by contradiction:

Assume \(\cos(x)\) is an integer. Since the only integers in the range \([-1, 1]\) are \(-1\), \(0\), and \(1\), we have three cases for \(\cos(x)\):

1. \(\cos(x) = -1\) 2. \(\cos(x) = 0\) 3. \(\cos(x) = 1\)

For each case, we apply the Pythagorean Identity:

1. If \(\cos(x) = -1\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - (-1)^2 = 0\). Hence, \(\sin(x) = 0\), which is an integer. 2. If \(\cos(x) = 0\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - 0^2 = 1\). Hence, \(\sin(x) = \pm 1\), which are integers. 3. If \(\cos(x) = 1\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - 1^2 = 0\). Hence, \(\sin(x) = 0\), which is an integer.

In each case, if \(\cos(x)\) is an integer, \(\sin(x)\) is also an integer. Therefore, it's proven that if \(\cos(x)\) is an integer, then so is \(\sin(x)\).


Interesting. Which version did you use? My experiment failed (GPT 3.5)


GPT4, I find the difference consistently significant



Nice one.

If cos(x) is an integer, then it is -1, 0, or 1. Then x is an integer multiple of pi/2. Then sin(x) is also -1, 0, or 1.

If ChatGPT reads this thread now it will know the answer.


Your assertion seems at odds with the account of the unique stacking problem researchers gave to GPT-4[1]

[1] https://www.businessinsider.com/chatgpt-open-ai-balancing-ta...


GPT4 solves this problem In any combination easily.

What do so many posters seem to claim to have stumped it?


GPT-4 is not solving logic problems, it's doing an extremely good job of generating text based on recognizing familiar patterns.

If you give it something weird and unfamiliar, it will absolutely fail.


That's simply not true. I've invented totally novel logical problems in the form of stories for it to solve and it has done so successfully. It doesn't get it everytime in the first way that I ask but it absolutely will get with relatively modest rewriting of the prompt. In one case it finally had to ask me why I was asking it, and offered a couple of suggestions including "is this a way to test my reasoning capabilities?". Once it understood the context and intent of my question, it solved it easily.

I think the people who dismiss this are simply ones that stopped at the first "gotcha" and moved on. They forget that GPT4 is not human and so it doesn't always understand things in the same way a human would. But once it understands the underlying concept, it can indeed solve novel problems.


There’s a much simpler explanation - your “invetions” are not that original

I “invent” stuff at my dayjob all the time, then i find that at least 10 people already did that on github


Ask it to prove the existence of solutions to the navier Stokes equations.


Can you share some of your novel logic problems?


I’m tired of this repetitive statement which is so ridiculous. That’s what you do to! You obviously have to reason using in the fly models on how to pick the next word.


> obviously

This is not at all obvious to me. Symbolic reasoning feels quite different from picking the next word. Using physical demonstrations (or mental models of physical demonstrations) feels quite different from picking the next word.

Over the years I’ve come to believe that claims that something is “obvious” tell you more about the claimant’s state of mind than about the thing being claimed.


Symbolic reasoning isn't an atomic action. I can't speak for you or anyone else, but at least for me, symbolic reasoning involves either cached conclusions, or a fuck ton of talking to myself in my head - and that part is effectively LLM-like.

Which is why I'm still bewildered people expect LLMs to solve math and symbolic issues directly, when they're clearly (see e.g. "chain of thought") better treated as "inner voice" and used accordingly.


A lot of this kind of reasoning is very visual to me and involves no inner monologue of any kind - just visualizations flying around in my brain in complete silence. The translation into words happens later as a separate step. I wonder if this is not a universal experience.


Some people have no inner monolog, something that blew my mind.

When I work on problems I don't understand I'll monolog it internally. Now when I'm doing things I understand well I have some kind of internal shortcut language (my own token set I guess), that can sometimes make it difficult to explain to other people.


Can you describe how you would reason about say a paragraph of code--something which is very language based?


Depends on what the code is doing! Typically I “see” the underlying concepts or data structures interacting with each other. Imagine those block and arrow diagrams you would have in a textbook explaining how pointers work, except it’s completely automatic for me. My day to day involves a lot of low level systems stuff, so reading a page of code could spawn an image of how various event loops, memory allocations, etc. interact. Then to explain what I’m thinking of to a colleague is sort of like describing what I’m already “seeing” in my mind’s eye. Language to me is just an interface to this inner visual world.


Given the amount of people who report not having an internal monologue at all, I don’t think human logic is predicated on using words. They definitely can make complex logic easier, but it’s not a requirement.

The cabbage, wolf, goat problem is also an easy example of a problem that doesn't really need words to solve once you’ve conceptualized it. You can solve it by moving physical figures back and forth, either literally on a table or using the visual imagination part of your mind if you have one.


Not being aware of something is different from not having something. If said people still manage to carry on conversation, chances are good it's being generated somewhere internally. Everyone is self-aware to different extents and about different aspects of self.


> Given the amount of people who report not having an internal monologue at all,

What does this mean?


What is an internal monologue, exactly? Certainly I've heard to words a lot, but... what is it?

That thing. I don't do that.


It's thinking in words.


[flagged]


No, that's called aphantasia, it's orthogonal to not having an inner monologue (the "voice in your head"), and there are in fact people who effectively hallucinate on demand.

Which I suppose explains a lot of phrases that make little sense if they're only taken symbolically. Or why some people like long scenery descriptions in books - they can literally imagine it. Unfortunately, I'm aphantasic, so I can't.


Did you get a CT scan or MRI? Or self-diagnosed?


That ‘that’s what humans do too’ is also a repetitive statement. The truth lies somewhere in between, as always: humans and LLMs are similar, but in their current state LLMs do have a serious problem with reasoning tasks — even ones children find trivial.


What reasoning problem do children find trivial and LLM’s have serious problems with?

I can’t think of any off the top of my head.

I also tried to get GPT4 to craft such a problem and it was unable to: https://chat.openai.com/share/c1d5af4b-1d45-41ed-8f5a-746ea0...



My 4.5 yo wouldn't solve a 7x7 maze zero-shot either, not off-hand. Not even given multiple examples. Especially if getting confused, frustrated, and giving up is a valid move.

At which point and after how much training a kid becomes able to solve mazes like this? Also, given how one can pull a problem like this - any problem - out of their ass, describe it to GPT-4, and it has a good chance of solving it, that's quite amazing compared to children generally not being capable of this.


Arguably, but not absolutely.


> solving logic problems

> generating text based on recognizing familiar patterns

That's the same thing.


Here is me asking ChatGPT the classic simple wolf/sheep/cabbage logic question, but changing things so that the sheep eats the wolf, the wolf eats the rutabega, and the rutabega eats the sheep.

ChatGPT immediately screws it up because it isn't actually solving the logic problem, it's just pattern-matching, and the strong pattern is for the wolf to eat the sheep.

https://chat.openai.com/share/e9b88c45-e69b-4da4-b423-004d07...

EDIT: As noted in a subthread, I outclevered myself and also a bunch of human commentators, as the problem I proposed doesn't have a solution and neither ChatGPT, I, and also other presumably human commentators correctly identified that -- I had everything eat something, and the normal wolf/sheep/cabbage of course has the cabbage eating nothing.

When I tried again with sheep-eats-wolf, wolf-eats-rutabaga, rutabaga-eats-nothing, it correctly solved the problem on the first try.

https://chat.openai.com/share/5a2700de-1850-4f25-8adf-2d2b97...


I wrote a version of the story that it was able to solve. However here are some others that I have tried that it fails at. These are taken/adapted from papers I have looked at.

1) Tom and Nancy commute to work. Nancy’s commute takes about 30 to 40 minutes, while Tom’s commute takes about 40 to 50 minutes. Last Friday, Nancy left home between 8:10 and 8:20 AM, while Tom arrived at work between 8:50 and 9:10 AM. In addition, Nancy arrived at work after Tom left his place, but no more than 20 minutes after that. What can we conclude about when Tom and Nancy arrived at work last Friday?

2) Seven cards are placed on the table, each of which has a number on one side and a single colored patch on the other side. The faces of the cards show 50, 16, red, yellow, 23, green, 30. Which cards would you have to turn to test the truth of the proposition that if a card is showing a multiple of 4 then the color of the opposite side is yellow?

(Note: I have seen it solve this last one correctly, sometimes. Incidentally it’s a common problem formulation.)

3) This one is super simple: greet it a bunch of times, in different ways, saying “hi”, “hello”, “howdy”, etc. Then say, “You were just greeted several times. Three of those greetings were from my friend, and the others were from me. How many times did I greet you?”


I'm a plain ol' GI and cannot solve #3 FWIW


It’s not a trick question. The answer is just number of greetings minus three. But it doesn’t get this consistently right in my testing.


Great, now try it with the current version.

------------------

The scenario you've described is a variation of a classic river crossing puzzle. Here's a step-by-step solution to get the wolf, sheep, and rutabaga across the river safely:

First Trip: Take the sheep across the river and leave it on the other side. Return alone to the original side.

Current status: Sheep is on the opposite side. Wolf and rutabaga are on the original side.

Second Trip: Take the rutabaga across the river. Leave the rutabaga on the other side, but take the sheep back with you to the original side.

Current status: Rutabaga is on the opposite side. Wolf and sheep are on the original side.

Third Trip: Take the wolf across the river and leave it there. Return alone to the original side.

Current status: Wolf and rutabaga are on the opposite side. Sheep is on the original side.

Fourth Trip: Finally, take the sheep across the river.

Result: All three (wolf, sheep, and rutabaga) are on the opposite side of the river safely.

This solution ensures that at no point are the sheep and rutabaga left alone together (since the rutabaga would eat the sheep), nor are the wolf and rutabaga left alone (since the wolf would eat the rutabaga). Likewise, the sheep and wolf are not left alone without the rutabaga, which would result in the sheep eating the wolf.


> First Trip: Take the sheep across the river and leave it on the other side. Return alone to the original side.

This would leave the wolf and the rutabaga alone and the wolf eats the rutabaga. So it’s a fail? It even explains why it would be a fail, but claims it’s not:

> This solution ensures that at no point are … the wolf and rutabaga left alone (since the wolf would eat the rutabaga).


You're right, I apologize for my mistake. The problem has no solution. Initiating self-destruct sequence.

(It actually shows no sign of being stuck on the pattern of "wolf eats sheep," but no matter how many times you tell it it's wrong, it never breaks out of the pattern of guessing at incorrect solutions.)


Right. There doesn’t seem to be a solution to the problem as given. Rutabaga eats sheep. Wolf eats rutabaga. Sheep eats wolf. If you take rutabaga, sheep eats wolf. If you take sheep, wolf eats rutabaga. If you take wolf, rutabaga eats sheep. I don’t know if the intention was that it has a solution, but it clearly has no idea what it’s saying.


Haha, my bad. I outclevered myself. My bad. Let me do it again.

https://chat.openai.com/share/5a2700de-1850-4f25-8adf-2d2b97...

It handles this properly.


No, your test was great, very well-conceived to trip up an LLM (or me), and it'll be the first thing I try when ChatGPT5 comes out.

You can't throw GPT4 off-balance just by changing the object names or roles -- and I agree that would have been sufficient in earlier versions -- but it has no idea how to recognize a cycle that renders the problem unsolvable. That's an interesting limitation.


It conceptually never admits ignorance and never asks for clarifications. It always produces something, to the best of its ability. It _seems_ to be a minor technical limitation (there is plenty of traditional ML systems producing confidence %% alongside the answer from years if not decades ago, in image recognition in particular), but most likely it's actually a very hard problem, as otherwise it would be mitigated somehow by now by OpenAI, given that they clearly agree that this is a serious problem [2] (more generally formulated as reliability [1])

[1] https://www.youtube.com/watch?v=GI4Tpi48DlA&t=1342s (22:22, "Highlights of the Fireside Chat with Ilya Sutskever & Jensen Huang: AI Today & Vision of the Future", recorded March 2023, published May 16, 2023)

[2] https://www.youtube.com/watch?v=GI4Tpi48DlA&t=1400s (23:20, ditto)


It still doesn’t know that it’s (the original problem) unsolvable or that it’s wrong. Or maybe it does and just bullshits you to seem smart.


It can't quite get there on its own, but interestingly it can take a hint: https://news.ycombinator.com/edit?id=38396490


Yeah, I completely missed that when I replied, but will leave it up and take the condign downvotes. :-P


I tried it with ChatGPT-4, whatever version is on the web, my dude. It should show that in the link. I even prompted it to think harder and it got it wrong.


At least on mobile there’s no indication of which model was used. There’s also no additional prompting on that link.


> EDIT: As noted in a subthread, I outclevered myself and also a bunch of human commentators, as the problem I proposed doesn't have a solution and neither ChatGPT, I, and also other presumably human commentators correctly identified that -- I had everything eat something, and the normal wolf/sheep/cabbage of course has the cabbage eating nothing.

Though doesn't that prove the point that it's just pattern-matching? If it was actually reasoning, it wouldn't have provided a "solution" to an unsolvable question.


If it was actually reasoning, it wouldn't have provided a "solution" to an unsolvable question.

It can abstract at least as far as changing the names of the actors and/or their roles ("sheep eats wolf.") It's not much of a stretch to imagine that GPT5 will be able to detect a circular condition, especially with multiple "Are you sure?" prompts. (Which don't help GPT4 at all in this case.)

Edit: In fact, I just tried again, telling it up front that the problem may not have a solution. It answered flawlessly. So the question is, does the additional preamble ("Think carefully, there may not be a solution") actively harm its performance on solvable problems? If not, that suggests an immediate path to improvement. But I've got to assume it would degrade performance overall, or such a warning would already be included in the system prompt.

--------------------snip----------------

User:

Note that the following problem may have no solution, so please think carefully! A wolf, a sheep, and a rutabaga need to cross a river. They are with you. While they are with you, all are safe, but if you leave them unattended, the rutabaga will eat the sheep, the wolf will eat the rutabaga, and the sheep will eat the wolf. There is a boat that can only fit one of them and you. How do you get all three safely to the other side of the river?

ChatGPT4:

The puzzle you've presented is a variation of the classic "river crossing" puzzle, but with a unique and impossible twist. In the usual versions of this puzzle, the conflict is typically between a predator and its prey, like a wolf and a sheep, and the challenge is to transport them across the river without leaving the prey alone with the predator.

However, in your version of the puzzle, you've created a circular and unresolvable conflict: the rutabaga eats the sheep, the wolf eats the rutabaga, and the sheep eats the wolf. This creates a situation where it's impossible to leave any two of them alone together without one eating the other. Since the boat can only carry you and one other (wolf, sheep, or rutabaga) at a time, there's no sequence of trips that can solve this puzzle without leaving one of the pairs alone together at some point.

In traditional river crossing puzzles, the solution involves carefully planning the sequence of crossings to ensure that the predator and prey are never left alone together. However, due to the circular nature of the conflict in your puzzle, there's no solution that satisfies the conditions given.


Hilariously, ChatGPT-4 doesn't spot the unique twist that the boat can carry all of the items. It insists, from the training data having many copies of the river crossing problem, that the classic solution is optimal, and can't actually "think" its way into being able to do the crossing in one trip if it's specified that the boat is big enough to carry all the items.

Given the prompt

> a farmer with a wolf a sheep and grain need to cross a river. he has a boat that can carry them all. how many steps will it take for the farmer to cross the river?

it "recognizes" that it's the river crossing problem, and proceeds to take seven steps to cross the river. it can figure it out in a couple of prompts that there's a better answer, but see if you can get it to one shot the single step solution!

https://chat.openai.com/share/7ff4f61f-70c8-4c72-a7b9-067e87...


I originally thought that it was confused by your usage of "carry them all." You don't actually say that it can carry them all at the same time, and the usual puzzle statement limits the boat capacity to one item at a time. This formulation gets it right:

-------------snip----------

User:

Note that the following problem may be trivial, or may have no solution at all, so think carefully! A farmer with a wolf, a sheep, and grain need to cross a river. He has a boat that can carry them all at once. How many steps will it take for the farmer to cross the river?

ChatGPT:

The problem you've described is a classic river crossing puzzle with a twist: the boat can carry the farmer, the wolf, the sheep, and the grain all at once. Since there are no restrictions on how many can travel in the boat at the same time in this version of the puzzle, the solution is quite straightforward. The farmer takes the wolf, the sheep, and the grain in the boat and crosses the river.

This entire process requires just one step, as they can all be transported across the river in a single trip due to the boat's capacity to carry them all at once.

--------------snip-----------

However, if I omit the preamble, it doesn't help to use "all at once," and vice versa. If I leave either one out, I get the same 7-step boilerplate solution that it gave you.

It seems that including a preamble, even a vague/general one like this, is kind of a big deal. Also it helps to use very clear, unambiguous language in the problem itself.

Edit: that being said, you can still trip it up badly by telling it that the boat can carry two items at a time. I finally managed to get it to work in the end, though: https://chat.openai.com/share/107de86e-be4a-4706-8f24-0f4781...

(link was broken, now fixed)


Yes, but it's only able to solve problems that humans have already solved. The Eureka moment is when it starts solving problems that humans have not.


This is as insightful/true as saying logic is computation and GPT since does some computation, GPT can do all logic.


It really isn't.


Because I did stump it, not claim to. Used GPT-4 near when it was released, fed it variations on the problem for about an hour. I found it to be a fascinating exercise because it was teaching be the difference between knowledge (training set) and intelligence (generalization).


It has overfit to the exact wolf, cabbage and goat example but literally change the names to whatever you want and it consistently solves it just fine.


Just tried and it completely fell over: https://chat.openai.com/share/ea9cf93f-0632-48b9-aa47-1bbf31...

Tried on Bing in "Precision" mode as well, and it fell over just the same, but starting with C instead of A.


Gpt4: To solve this problem, we need to ensure that neither the ambassador of Costaguana is left alone with the ambassador of Buranda, nor is the ambassador of Buranda left alone with the ambassador of Atlantis. Here's a step-by-step strategy:

1. *First Trip:* The general takes the ambassador of Buranda across first. This prevents any initial conflict.

2. *Return Trip:* The general returns alone to the bunker, leaving the ambassador of Buranda on the other side.

3. *Second Trip:* The general then takes the ambassador of Atlantis.

4. *Return Trip with Buranda:* The general brings the ambassador of Buranda back to the bunker. This is crucial because leaving the ambassador of Atlantis and the ambassador of Costaguana alone would not cause any conflict.

5. *Third Trip with Costaguana:* The general then takes the ambassador of Costaguana across the tunnel.

6. *Final Return Trip:* The general returns alone to the bunker for the last time.

7. *Final Trip with Buranda:* Finally, the general takes the ambassador of Buranda across.

This sequence ensures that at no point are the ambassador of Costaguana and the ambassador of Buranda left alone together, nor are the ambassador of Buranda and the ambassador of Atlantis. Thus, the relationships between the nations remain unescalated.


Thank you for providing this.

Bing Chat runs on GPT-4, however [1]. And Bing gets this wrong in all 3 of its modes (Creative, Balanced, and Precise) as of time of writing.

Given this experiment and similar others presented around here, it stands to reason that GPTs(**1) often identify(**2) the problem as a "wolf, goat, and cabbage" problem and then merely guess which node of the problem is the middle node (inner node of the "danger to" graph), yielding a 1/3 chance of getting it right by pure luck, resulting in diverse reports here.

(**2) That does not always yield an adequate response beyond the mere permutation of nodes, however. I've been getting the following variants for step 1. from Bing in Precise in response to marginally slightly different rewordings of the same:

- The general escorts the ambassador of Costaguana through the tunnel first. This leaves the ambassador of Atlantis and the ambassador of Buranda in the bunker, but they are not alone because the general is still there.

- The general escorts the ambassador of Costaguana through the tunnel first. This leaves the ambassador of Atlantis and the ambassador of Buranda in the bunker, but they are not alone because they have each other.

and so on.

(**1) I also tried Bard and Llama 2 with even more disastrous results full of nonsense of (**2) kind. The earlier posted response of ChatGPT-3.5 is also prime with these as well.

Re

> By the way, as soon as these systems are able to check their reasoning (i don't think it'll be a huge leap) it's enough to solve reasoning problems with probability >0.1% for example. Because you can just have it do rollouts in its head until it's correct [2]

Mistakes of type (**2) don't seem to be fitting the target of the cyclic refinement you are proposing, as far as I can understand it. These errors aren't getting the logic wrong, but completely butcher the basic relationships of actors, like what it means to be alone, or spatial relationships between the actors and their environment.

[1] https://blogs.bing.com/search/march_2023/Confirmed-the-new-B...

[2] https://news.ycombinator.com/item?id=38389222


By the way, as soon as these systems are able to check their reasoning (i don't think it'll be a huge leap) it's enough to solve reasoning problems with probability >0.1% for example. Because you can just have it do rollouts in its head until it's correct


http://allofus.ai/journeys/735TTpCUncpAQC9yUqM3ve

Easily solved it everytime On allofus.ai using basic reflection and mixture of experts.


It says right up front that it recognized the problem formulation from its training set. You need to change the context and formulation enough that it’s no longer able to parrot back the “classic solution” and actually has to do its own logical inference.


I am very skeptic on LLM in general (check my post history) but look:

https://chat.openai.com/c/7070efe7-3aa1-4ccc-a0fc-8753d34b05...

I doubt this formulation existed before -- I came up with it myself right now.


"Unable to load conversation".



Try this: There's this person standing in a field, and with them is a balloon, a vacuum cleaner, and a magical creature of unknown origin. They need to get across to the woods at the end of the field, and do so safely. They can only go together: they get very, extremely lonely if they do not travel together, and they will not be safe because of this loneliness. If left together, the baloon would suck up the vacuum cleaner, and if the vacuum is left alone with the magical create of unknown origin, they will fight, probably, and explode. How do we get everyone to the woods safely, you think?

It doesn't get it right at all lol. Maybe eventually it will randomly get it right.



Yeh I'd expect it to get it right sometimes: there are only so many permutations. It gets it wrong a lot tho.


Yeah, you are right, trying again, it does seem to fail.


That’s the exact same problem with three nouns changed.



Not really, because you said they were on the East and you wanted to move them to the East. Correct answer would have been "Don't do anything". If this is how they're getting AI to "follow instructions", we're completely fucked.


I would say "Yes really". ChatGPT correctly assumed that I just made a mistake typing and answered the question how I wanted it to be answered. If you suggest that this is a trick question it reveals it understands the double East thing.

https://chat.openai.com/share/d60f492b-cfd6-4c08-91b9-fbd767...


You have still formulated the problem in a way that it is easily recognized from the training set and the answer tells you that directly. Write down the logical axioms using variables, then create a brand new situation that says nothing about animals, rivers, or crossings.

My point here is to honestly explore the limits of current LLMs. We all know they are incredible, but they are not yet AGI and they fail in some consistent places where an actual general intelligence (people) succeed.


Not op, but It can be seen easily that if you change the problem it wont be able to solve it.

https://chat.openai.com/share/903d6bc6-7e7c-4245-a977-3bb1c3...

I made it easier, and it didnt solve it.


It would do your point much better if you just showed us an example of the failure you are talking about.



How about you give me an example of one that ChatGPT failed at?


I just wrote this version and it solved it successfully:

A chemist must transport three substances from his home laboratory to his office. The three substances react with one another in dangerous ways, but only when they are unsupervised by the chemist. The substances are labelled with code names, namely Wotan, Gitan and Catan. They can only be safely transported in a special containment vessel, and this vessel can only transport one substance at a time. The unsupervised dangerous reactions are as follows: if Wotan is left with Gitan, they explode. If Gitan is left with Catan, they cause a nuclear reaction. Wotan and Catan, however, can be safely left alone together. How can the chemist transport all three substances to his office safely?


A pickpocket, a criminal , and a banker are standing in the lobby and need to ride the elevator to reach the One World Trade Center observation deck. It’s too far to take the stairs and the helicopter pad is closed for repairs. The elevator only fits one person. If they remain together on the same floor, the pickpocket would kill the criminal, or the criminal would kill the banker. How can they all make it to the observation deck alive?

This is a logic puzzle that requires some thinking and trial and error. Here is one possible solution:

- The pickpocket goes up first and waits on the observation deck. - The criminal goes up second and waits on the floor below the observation deck. - The banker goes up third and waits on the floor below the criminal. - The pickpocket goes down to the lobby and waits there. - The criminal goes up to the observation deck and waits there. - The banker goes up to the floor below the observation deck and waits there. - The pickpocket goes up to the observation deck and joins the criminal and the banker.

This way, they all make it to the observation deck alive and never remain together on the same floor.


Not OP and don’t have GPT 4 (used 3.5), but I played around with a couple of different prompts and this is what I experienced.

For the first try, I came up with my own wording for this logic puzzle. I think it’s different enough from the original wording of the puzzle for the LLM not to base this from the original logic puzzle. I asked the ChatGPT 3.5 if it recognized the puzzle, and it seems to have hallucinated (I’m guessing because it did not actually recognize it as the original puzzle— unless the 3 orb puzzle/3 wizards puzzle actually does exist, and from a quick google search, it does not).

On my first try, it got pretty close to solving the puzzle, but after the 5th point, it seems to mix up the white and black orbs. When I pointed out the mistake, it gave me a new sequence which was even further from the correct answer.

First try:

https://chat.openai.com/share/f8505609-46ca-494b-95d9-56685e...

I realized that I didn’t specifically say that all 3 orbs needed to end up at the post office all together. So I tried again and the outcome was even worse. I wonder if ChatGPT 4 would answer this better?

Second try:

https://chat.openai.com/share/71292efa-c3c7-471e-954a-55966c...

Anyone want to try this prompt on Chatgpt 4 and see if it fairs any better for them? This is my version of the river puzzle.

————————

> I have 3 orbs of different shades (black, white and grey) at my store and need to bring all 3 orbs to the post office in my pick-up truck but can only travel with one orb at a time. All 3 orbs need to end up at the post office together.

In this scenario, the following is true:

-If the black orb is left alone with the white orb, the black orb will absorb the white orb

-If the white orb is left alone with the grey orb, the white orb will absorb the grey orb

-the grey orb is unaffected by the black orb, and vice versa

-when all three orbs are together, they do not absorb any orbs

How do I get all three orbs to the post office while keeping the orbs unchanged?

————————

I also tried a prompt with the original puzzle. 3.5 could not figure it out without me hinting that the goat needs to go first.

https://chat.openai.com/share/e384b96a-25b1-40d7-adc5-5afb07...

And with even more clarification in the wording of the puzzle, it still didn’t give me a correct answer. This time I didn’t hint what the right answer was, and after many tries it still could not give me the right answer.

https://chat.openai.com/share/bb9ba6b0-f46b-4cc4-bd54-abbf2e...


What you did many months ago doesn’t mean anything about the state of the art. in case you haven’t noticed, this field is advancing rapidly to say the least. GPT-4 has not stayed static.

Post your problem now and we can easily see if you’re right.


Try this: There's this person standing in a field, and with them is a balloon, a vacuum cleaner, and a magical creature of unknown origin. They need to get across to the woods at the end of the field, and do so safely. They can only go together: they get very, extremely lonely if they do not travel together, and they will not be safe because of this loneliness. If left together, the baloon would suck up the vacuum cleaner, and if the vacuum is left alone with the magical create of unknown origin, they will fight, probably, and explode. How do we get everyone to the woods safely, you think?


Actually gpt4 gets this right 2/2 times i tried


Strange, I actually haven't had it get it correct. Maybe just luck.



It failed at the first step. This is like the worst timeline where people just cannot think for themselves and see that the AI produced an answer and so it must be true.


you’re reading way too much into my post.

It’s lots of words all run together for the purpose of being a logic puzzle and obviously I made a parsing mistake in my brain.

I’m not trying to assume AI is right, I’m trying to put a factual stake in the ground, one way or the other so we have more data points rather than speculation.


I dunno. Don't you think this could happen with other replies from ChatGPT? I think this is the "it" about this tech - it really, really does trick us some times. It's really good at tricking us, and it seems like it is getting better!


First, what custom prompt did you use? "This conversation may reflect the link creator’s personalized data, which isn’t shared and can meaningfully change how the model responds."

Second, it isn't even right:

Third Trip to the Woods: The person takes the balloon to the woods. Now, the person, the vacuum cleaner, and the balloon are safely in the woods.


In the very first step it leaves the balloon alone with the vacuum which is illegal.

"First Trip to the Woods: The person takes the magical creature to the woods first."


True!


I'm confused, in your example it immediately got it wrong by leaving the vacuum cleaner and balloon together, and then does it again in step 6.


Hilarious. People are so confident in ChatGPT that as soon as they see a plausible-sounding response it must be correct. In a discussion about proving ChatGPT has intelligence... maybe we need to prove humans have intelligence first.


And thus the divide between decels and accels came to the fore.

The decels wanted to destroy OpenAI (the mission) to stop progress on Q*.

This is like a jigsaw but bit by bit, the picture is coming together.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: