This is based on a limericks dataset published in 2021. https://zenodo.org/recor...

sftombu · 2024-05-14T22:32:40

I tested the LLMs to make sure they could not answer the questions unless the limerick was given to them. Other than 4o, they do very badly on this benchmark, so I don't think the test is invalidated by their training.

cma · 2024-05-14T23:37:43

Why wouldn't it still be invalidated by it if it was indeed trained on it? The others may do worse and may or may not have been trained on it, but them failing on ititself doesn't imply 4o can do this well without the task being present in the corpus.

djsjajah · 2024-05-15T02:37:49

A better test would be to see if it can still answer the question if you just exclude the limerick for that answer. Having a bunch of limericks in the context window will make it "think" about all of the limericks it "knows".

sftombu · 2024-05-14T23:41:12

It can't answer the questions without the limericks in the prompt. The benchmark is to establish how well it uses the context window. For example, I just asked it "What is sought by the English top brass?". The answer from the limerick is "Cranberry glass" and 4o answers correctly when given the associated limerick once out of 2500+ limericks.

However, without the limerick, 4o responded with: "The term "English top brass" typically refers to high-ranking officials or leaders within the British government, military, or other institutions. What they seek can vary widely depending on the context and the specific goals of their roles. Here are some general pursuits that might be sought by such individuals:

National Security: Ensuring the safety and security of the United Kingdom from internal and external threats is a primary concern. This involves defense strategies, intelligence operations, and counter-terrorism efforts.

Economic Stability: High-ranking officials often focus on policies and initiatives aimed at maintaining and improving the country’s economic health. This includes managing inflation, unemployment, trade relations, and economic growth.

Political Influence: Top brass often seek to maintain or expand their influence both domestically and internationally. This can involve diplomacy, forming alliances, and participating in international organizations like the United Nations or NATO.

Social Cohesion: Ensuring social stability and addressing issues such as inequality, healthcare, education, and social services are critical. This can involve implementing policies that promote social welfare and cohesion.

Public Policy Implementation: Leaders are responsible for developing and implementing policies that reflect the government’s priorities. This includes legislation, regulatory frameworks, and public administration.

Technological Advancement: Keeping the nation at the forefront of technological innovation is often a priority. This includes investments in research and development, supporting tech industries, and ensuring cybersecurity.

Environmental Sustainability: Addressing climate change and promoting sustainable practices are increasingly important. This includes policies aimed at reducing carbon emissions, protecting natural resources, and transitioning to renewable energy sources.

Cultural and Heritage Preservation: Protecting and promoting the country’s cultural heritage and national identity can also be a focus. This includes supporting the arts, preserving historical sites, and promoting cultural initiatives.

These pursuits are shaped by the current political climate, global trends, and the specific priorities of the leaders in question. Would you like more detailed information on any of these areas?"

furyofantares · 2024-05-15T01:03:33

This sounds dumb - but what if you give it all the limericks MINUS the one you want it to answer about?

I think it will fail, but this actually seems like the cleanest way to demonstrate it.

cma · 2024-05-15T05:00:02

Still not enough to rule out training on the data in the task affecting the task. It may be that it couldn't find it without it appearing in the training data, but even with that it also needs it in its context window to bridge enough connections from the training or whatever to do well on the task.

Aeolun · 2024-05-15T00:46:12

Maybe if you tell it to pull the answer from a limerick instead of generally asking?

Edit: Ok no, I tried giving it a whole bunch of hints, and it was just making stuff up that was completely unrelated. Even directly pointing it at the original dataset didn’t help.

causal · 2024-05-15T03:27:30

Yeah I also tried to get it to complete some limericks from the dataset. Curiously it believed it had heard of the limerick but would then recite a hallucination.

So the good news is that the NIAN score might be real, bad news is you can't rely on it to know what it knows.

seanhunter · 2024-05-15T18:21:34

If you ask it to complete a limerick and it finishes it differently from the original, but it still works as a limerick is that really a hallucination?

EGreg · 2024-05-15T01:49:31

Come on guys, it’s already far beyond superhuman if it’s able to do that and so quickly. So if it’s not able to do that, what’s the big deal? If you’re asking for AG.I., then it seems that the model performs beyond it in these areas.

Aeolun · 2024-05-16T05:15:47

We were mainly trying to determine if there was a reasonable chance that the model was trained on a certain dataset, nothing else.

cma · 2024-05-15T05:02:13

> It can't answer the questions without the limericks in the prompt.

Maybe I can't solve a bunch of mostly memorized math problems without a visual mnemonic aid. Someone seeing me fail the problems without the visual aid doesn't rule out me having partly memorized solutions.

dontupvoteme · 2024-05-15T08:51:46

It would be interesting to know how it acts if you ask it about one that isn't present, or even lie to it (e.g. take a limerick that is present but change some words and ask it to complete it)

Maybe some models hallucinate or even ignore your mistake vs others correcting it (depending on the context ignoring or calling out the error might be the more 'correct' approach)

Using limericks is a very nifty idea!

neverokay · 2024-05-15T11:20:37

Why not just generate complete random stuff and ask it to find stuff in that?

Kostchei · 2024-05-15T11:56:00

We have run that test.- generate random string(not by llm) names of values- ask the llm to do math (algebra) using those strings. Tests logic, 100% not in the data set GPT2 was like 50% accurate, now we up around the 90%.

dontupvoteme · 2024-05-15T08:47:12

NIAN is a very cool idea, but why not simply translate it into N different languages (you even can mix services, e.g. deepl/google translate/LLMs themselves) and ask about them that way?

internet101010 · 2024-05-15T06:03:14

No disassemble!