The needle in the haystack test gives a very limited view of the model’s actual ...

WhitneyLand · 2024-05-15T01:57:26.000000Z

Maybe, but

1. The article is not about NIHS it’s their own variation so it could be more relevant.

2. The whole claim of the article is that Gpt4o does better, but the test your pointing to hasn’t benchmarked it.

sftombu · 2024-05-15T03:44:49.000000Z

The models benchmarked by RULER do worse in needle in a needlestack. It will be interested to see how 4o does with RULER.