More

sftombu · 2024-05-15T03:44:49

The models benchmarked by RULER do worse in needle in a needlestack. It will be interested to see how 4o does with RULER.

sftombu · 2024-05-14T23:43:59

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419s

sumedh · 2024-05-15T12:37:26

No such item.

sftombu · 2024-05-14T22:43:41

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419

causal · 2024-05-15T03:30:02

Your test is a good one but the point still stands that a novel dataset is the next step to being sure.

dontupvoteme · 2024-05-15T09:02:13

One could also programmatically (e.g. with nltk or spacy, replace nouns, named entities, etc) modify the dataset, even up to the point that every test run is unique.

You could also throw in vector similarity if you wanted to keep words as more synonyms or antonyms.

sftombu · 2024-05-14T22:41:45

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419

sftombu · 2024-05-14T22:33:21

Interesting!

sftombu · 2024-05-14T22:32:40

I tested the LLMs to make sure they could not answer the questions unless the limerick was given to them. Other than 4o, they do very badly on this benchmark, so I don't think the test is invalidated by their training.

cma · 2024-05-14T23:37:43

Why wouldn't it still be invalidated by it if it was indeed trained on it? The others may do worse and may or may not have been trained on it, but them failing on ititself doesn't imply 4o can do this well without the task being present in the corpus.

djsjajah · 2024-05-15T02:37:49

A better test would be to see if it can still answer the question if you just exclude the limerick for that answer. Having a bunch of limericks in the context window will make it "think" about all of the limericks it "knows".

sftombu · 2024-05-14T23:41:12

It can't answer the questions without the limericks in the prompt. The benchmark is to establish how well it uses the context window. For example, I just asked it "What is sought by the English top brass?". The answer from the limerick is "Cranberry glass" and 4o answers correctly when given the associated limerick once out of 2500+ limericks.

However, without the limerick, 4o responded with: "The term "English top brass" typically refers to high-ranking officials or leaders within the British government, military, or other institutions. What they seek can vary widely depending on the context and the specific goals of their roles. Here are some general pursuits that might be sought by such individuals:

National Security: Ensuring the safety and security of the United Kingdom from internal and external threats is a primary concern. This involves defense strategies, intelligence operations, and counter-terrorism efforts.

Economic Stability: High-ranking officials often focus on policies and initiatives aimed at maintaining and improving the country’s economic health. This includes managing inflation, unemployment, trade relations, and economic growth.

Political Influence: Top brass often seek to maintain or expand their influence both domestically and internationally. This can involve diplomacy, forming alliances, and participating in international organizations like the United Nations or NATO.

Social Cohesion: Ensuring social stability and addressing issues such as inequality, healthcare, education, and social services are critical. This can involve implementing policies that promote social welfare and cohesion.

Public Policy Implementation: Leaders are responsible for developing and implementing policies that reflect the government’s priorities. This includes legislation, regulatory frameworks, and public administration.

Technological Advancement: Keeping the nation at the forefront of technological innovation is often a priority. This includes investments in research and development, supporting tech industries, and ensuring cybersecurity.

Environmental Sustainability: Addressing climate change and promoting sustainable practices are increasingly important. This includes policies aimed at reducing carbon emissions, protecting natural resources, and transitioning to renewable energy sources.

Cultural and Heritage Preservation: Protecting and promoting the country’s cultural heritage and national identity can also be a focus. This includes supporting the arts, preserving historical sites, and promoting cultural initiatives.

These pursuits are shaped by the current political climate, global trends, and the specific priorities of the leaders in question. Would you like more detailed information on any of these areas?"

furyofantares · 2024-05-15T01:03:33

This sounds dumb - but what if you give it all the limericks MINUS the one you want it to answer about?

I think it will fail, but this actually seems like the cleanest way to demonstrate it.

cma · 2024-05-15T05:00:02

Still not enough to rule out training on the data in the task affecting the task. It may be that it couldn't find it without it appearing in the training data, but even with that it also needs it in its context window to bridge enough connections from the training or whatever to do well on the task.

Aeolun · 2024-05-15T00:46:12

Maybe if you tell it to pull the answer from a limerick instead of generally asking?

Edit: Ok no, I tried giving it a whole bunch of hints, and it was just making stuff up that was completely unrelated. Even directly pointing it at the original dataset didn’t help.

causal · 2024-05-15T03:27:30

Yeah I also tried to get it to complete some limericks from the dataset. Curiously it believed it had heard of the limerick but would then recite a hallucination.

So the good news is that the NIAN score might be real, bad news is you can't rely on it to know what it knows.

seanhunter · 2024-05-15T18:21:34

If you ask it to complete a limerick and it finishes it differently from the original, but it still works as a limerick is that really a hallucination?

EGreg · 2024-05-15T01:49:31

Come on guys, it’s already far beyond superhuman if it’s able to do that and so quickly. So if it’s not able to do that, what’s the big deal? If you’re asking for AG.I., then it seems that the model performs beyond it in these areas.

Aeolun · 2024-05-16T05:15:47

We were mainly trying to determine if there was a reasonable chance that the model was trained on a certain dataset, nothing else.

cma · 2024-05-15T05:02:13

> It can't answer the questions without the limericks in the prompt.

Maybe I can't solve a bunch of mostly memorized math problems without a visual mnemonic aid. Someone seeing me fail the problems without the visual aid doesn't rule out me having partly memorized solutions.

dontupvoteme · 2024-05-15T08:51:46

It would be interesting to know how it acts if you ask it about one that isn't present, or even lie to it (e.g. take a limerick that is present but change some words and ask it to complete it)

Maybe some models hallucinate or even ignore your mistake vs others correcting it (depending on the context ignoring or calling out the error might be the more 'correct' approach)

Using limericks is a very nifty idea!

sftombu · 2024-05-14T22:02:02

They come from a database of 98k limericks -- https://zenodo.org/records/5722527

sftombu · 2024-05-14T22:00:45

That is an interesting idea

sftombu · 2024-05-14T21:21:35

The reason I made Needle in a needlestack is the LLMs are getting to good at needle in a haystack. Until GPT-4o, no model was good at the NIAN benchmark.

sftombu · 2024-05-14T20:55:31

If you ask the questions without providing the limerick first, it never gets the right answer. When the LLM gets the wrong answer, it is usually because it reverts to its training data and gives a generic answer that doesn't apply to the limerick.

trifurcate · 2024-05-15T02:24:12

Why are you ruling out the possibility that training on the material may confer an advantage when the data is presented, even if the advantage may not be strong enough to pass the test without the data present in the context window?