Hacker News new | past | comments | ask | show | jobs | submit login

The needle in the haystack test gives a very limited view of the model’s actual long context capabilities. It’s mostly used because early models were terrible at it and it’s easy to test. In fact, most recent models now do pretty good at this one task, but in practice, their ability to do anything complex drops off hugely after 32K tokens.

RULER is a much better test:

https://github.com/hsiehjackson/RULER

> Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, all models (except for Gemini-1.5-pro) exhibit large degradation on tasks in RULER as sequence length increases.

> While all models claim context size of 32k tokens or greater (except for Llama3), only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%). The performance exceeding the threshold is underlined.




Maybe, but

1. The article is not about NIHS it’s their own variation so it could be more relevant.

2. The whole claim of the article is that Gpt4o does better, but the test your pointing to hasn’t benchmarked it.


The models benchmarked by RULER do worse in needle in a needlestack. It will be interested to see how 4o does with RULER.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: