Someone needs to come up with a "synthesis from haystack" test that tests not ju...

jddj · 2024-05-14T21:32:58.000000Z

An elaborate Agatha Christie style whodunit, with a series of plot-twists and alibis which can be chopped off the end of the piece to modify who is the most likely suspect

jddj · 2024-05-14T21:48:58.000000Z

Or a spot the difference.

Generate 1000 generic facts about Alice and the same 1000 facts about Eve. Randomise the order and change one minor detail then ask how they differ.

youssefabdelm · 2024-05-15T03:05:00.000000Z

That seems to go back in the direction of needle in the haystack again

pushedx · 2024-05-14T22:33:25.000000Z

    sort alice.txt | diff - <(sort eve.txt)

That's not a task for an LLM

IanCal · 2024-05-14T22:46:50.000000Z

Asking students to write an essay about Napoleon isn't something we do because we need essays about Napoleon - the point is it's a test of capabilities.

pushedx · 2024-05-16T14:31:10.000000Z

My point was more so that this task is so trivial, that's it's not testing the model's ability to distinguish contextual nuances, which would supposedly be the intention.

The idea presented elsewhere in this thread about using an unpublished novel and then asking questions about the plot is sort of the ideal test in this regard, and clearly on the other end of the spectrum in terms of a design that's testing actual "understanding".

semi-extrinsic · 2024-05-15T06:03:20.000000Z

I see you are being downvoted, but I agree with you.

A useful test would copy all Alice statements to Eve statements, then rewrite all of the Eve statements using synonyms, and then finally change one or two details for Eve.

visarga · 2024-05-14T21:47:26.000000Z

The needles form a graph and the prompt asks graph based tasks.

sftombu · 2024-05-14T22:00:45.000000Z

That is an interesting idea

Eisenstein · 2024-05-14T21:21:09.000000Z

My idea is to buy to a unpublished novel or screenplay with a detailed, internally consistent world built in to it and a cast of characters that have well crafted motivations and then ask it to continue writing from an arbitrary post-mid-point by creating a new plot line that combines two characters that haven't yet met in the story. If it understands the context it should be able to write a new part of the story and will be able to use a reader's intuitive sense of the character's motivations to move through their arc.

This whole thing would have to be kept under lock-and-key in order to be useful, so it would only serve as a kind of personal benchmark. Or it could possibly be a prestige award that is valued for its conclusions and not for its ability to use the methodology to create improvements in the field.

semi-extrinsic · 2024-05-15T06:04:20.000000Z

Just use memes. People generate new high-quality niche memes so fast it's impossible for the LLMs to keep up.

visarga · 2024-05-14T21:48:36.000000Z

You can only use it for a short while, they get a copy as well.

Eisenstein · 2024-05-14T22:19:03.000000Z

I have been thinking about this for use in evaluating locally run models, so I didn't make that connection in this case. I guess it would have limited utility.

sftombu · 2024-05-14T20:53:08.000000Z

I was thinking about something similar -- to make part of the question be sufficient information that the LLM can find the limerick. Then the 2nd part would ask something that would require a deeper understanding of the limerick (or other text).

borgdefense · 2024-05-15T12:24:23.000000Z

There is no understanding, it can't do this.

GPT4o still can't do the intersection of two different ideas that are not in the training set. It can't even produce random variations on the intersection of two different ideas.

Further though, we shouldn't expect the model to do this. It is not fair to the model and its actual usefulness and how amazing what the models can do with zero understanding. To believe the model understands is to fool yourself.

nebula8804 · 2024-05-15T02:19:14.000000Z

I wonder if there is some way to have an AI help humans improve their "reading comprehension" aka reasoning across a large body of text. As far as I can tell the only way to do this is to cut out mindless scrolling and force yourself to read a lot of books in the hopes that this skill might be improved.

I am many years out of my grade school years where I was required to read a multitude of novels every year and I guess years of mindless reddit scrolling + focusing on nothing but mathematics and the sciences in college have taken their toll: I read long articles or books but completely miss the deeper meaning.

As an example: my nerd like obsession with random topics of the decade before I was born (until I get bored) caused me to read numerous articles and all of Wikipedia + sources on the RBMK reactors and Chernobyl nuclear accident as well as the stories of the people involved.

But it wasn't until I sat down and watched that famous HBO mini seres that I finally connected the dots of how the lies and secretive nature of the soviet system led to the design flaws in the reactor, and the subsequent suicide of Valery Legasov helped finally expose them to the world where they could no longer be hidden.

Its like I knew of all these events and people separately but could not connect them together to form a deep realization and when I saw it acted out on screen it all finally hit me like a ton of bricks. How had I not seen it?

Hoping one day AI can just scan my existing brain structure and recommend activities to change the neuronal makeup to what I want it to be. Or even better since im a lazy developer, it should just do it for me.

adamgordonbell · 2024-05-14T21:04:27.000000Z

I've been thinking about that as well.

It's hard, but if you have a piece of fiction or non-fiction it hasn't seen before, then a deep reading comprehension question can be a good indicator. But you need to be able to separate a true answer from BS.

"What does this work says about our culture? Support your answer with direct quotes."

I found both gpt-4 and haiku to do alright at this, but sometimes give answers that imply fixating on certain sections of a 20,000 k context. You could compare it against chunking the text, getting the answer for each chunk and combining them.

I suspect if you do that then the chunking would win for things that are found in many chunks, like the work is heavy handed on a theme, but the large context would be better for a sublter message, except sometimes it would miss it altogether and think a Fight Club screenplay was a dark comedy.

Interpretation is hard I guess.

segmondy · 2024-05-14T21:52:07.000000Z

Why can't you be that someone?

gremlinsinc · 2024-05-14T21:57:44.000000Z

lol, made me think of the euphemism: be the change you want to see.