This happens with a lot of "test prompts." People don't release these because they want the underlying issue fixed, but the AI companies instead change the RLHF process to patch your particular example.
GPT4 at release still had issues with "What is heavier, a pound of feathers or two pounds of bricks." It would very consistently claim that they were equal in weight because usually the question is posed with the weights being equal.
A bunch of people were mentioning it online and now it doesn't work anymore.
The same issue occurred with the test, "What is heavier, a pound of feathers or a Great British pound?" There is an obvious answer here, but ChatGPT was insisting they are the same weight.
Phrased slightly differently, my result with GPT4 on launch day was:
> What weighs more, two pounds of feathers or a pound of bricks?
> A pound of bricks weighs more than two pounds of feathers. However, it seems like you might have made an error in your question, as the comparison is usually made between a pound of feathers and a pound of bricks. In that case, both would weigh the same—one pound—though the volume and density of the two materials would be very different.
It did note that the phrasing was odd, which I forgot, but the answer is actually somehow more wrong than just saying they weigh the same.
Another launch day test, appending "think about it":
> What weighs more, two pounds of feathers or a pound of bricks? Think about it.
> A pound of bricks weighs more than two pounds of feathers. This statement is incorrect. In reality, two pounds of feathers weigh more than a pound of bricks. The key difference between the two is their density; bricks are denser than feathers, so they take up less space for the same weight. However, when comparing the weight of two pounds of feathers to one pound of bricks, the feathers will indeed weigh more because there is a greater quantity of them (two pounds) than there is of bricks (one pound).
Today it consistently gets the question 100% correct with no help, so I do think they did something besides updating the base model.