there is actually a paper by OpenAI themselves on summarizing long document.
essentially, break a longer text into smaller chunks, and run a multi-stage sequential summarization. each chunk uses a trailing window of previous chunk as context, and run this recursively.
https://arxiv.org/abs/2109.10862
did a rough implementation myself, works well for articles even 20k tokens. but kind slow because all the additional overlapping runs required. (and more costly)
A technique I have had success with is to do it in multiple passes.
Map-reduce it with overlapping sections, but then propagate back downwards and repeat the process, but now each map-reduce node knows the context it's operating in and can summarize more salient details.
Concretely, on the first pass, your leaf nodes are given a prompt like "The following is lines X-Y of a Z length article. Output a 1 paragraph summary."
You then summarize those summaries, etc. But then you can propagate that info back down for a second pass, so in the second pass, your leaf nodes are given a prompt like "The following is lines X-Y of a Z length article. The article is about <topic>. The section before line X is about <subtopic>. The section after Y is about <subtopic>. Output a 1 paragraph summary that covers details most relevant to this article in the surrounding context."
Could you expand on this? Is the idea to embed paragraphs (or some other arbitrary subsection) of text, and then semantic search for the most relevant paragraphs, and then only summarize them?
Yes that's exactly right, but it presumes you know what to look for and what you want in your summary. Our use case is to pick out action items or next steps from meeting notes so this can work. But not for all use cases - i.e. summarize this paper.
Agreed, you can try sending it in chunks but then you lose context. Perhaps the ChatGPT based API will help if they expose the conversational memory as a feature.
Maybe OP has figured out a method with the current API?
I saw in another thread that people were working around this by asking for a summary of sections and then combining the summaries and asking for a joint summary.
This is an issue. I haven't experimented to see if there are workarounds, so the service currently checks the length of the article text and if it's very long, it will send a portion, otherwise we'll exceed the token limit. There's a note on the front page about it: "Limitations: The OpenAI API does not allow submission of large texts, so summarization may only be based on a portion of the whole article."
I tried, they don’t. Seems when they were ranking #1 on HN yesterday, someone made a summary (top comment) of what they are for that isn’t quite correct.
Can't find it for some reason, can you provide a link? Did they summarize with GPTSimpleVectorIndex or GPTListIndex? GPTSimpleVectorIndex is in get-started examples and is cheaper, but it provides worse results.
Problem is, why should I trust it, that it doesn't miss or interpose "not" in crucial place?
ChatGPT was caught in blatant lie (if we can say such about language model), presented with 100% confidence (if we can say such about language model) multiple times.
This is great! Pocket and other read-it-later services should add a similar feature.
I encounter far more articles than I make time to read. A quick, bulleted summary would be a great way to help determine which articles I want to spend more time on
Back in the day, the demo-scene was rife with Easter eggs, often with a personal touch.
It always puts a smile on my face when I come across clever ones like this, even though that's becoming rarer these days. I guess that's due to concerns about 'hidden code' in commercial codebases, which third-party license holders might be skeptical about.
That is cool, but it doesn't show me the key takeaways like the one in this post (Summate) does. I still have to RTFA after seeing the blurb on your service. It is generally not too far away, but in most cases too short to be useful.
Take the following as an example (a post from yesterday):
Loss of epigenetic information as a cause of mammalian aging
The article discusses how a loss of epigenetic information causes yeast cells to lose their identity and age, and how this process can be reversed by OSK-mediated rejuvenation.
Summate:
- Aging is caused by the loss of epigenetic information, leading to a variety of age-related diseases and processes.
- Epigenetic regulation of aging is affected by environmental inputs, including DNA damage and changes in the TGFbeta signaling pathway.
- Studies in mammals, such as mice, have revealed a wide range of epigenetic changes associated with aging, including increased transcriptional stress, changes in chromatin structure, and increased Wnt signaling.
I guess it boils down to the difference in prompts. Mine is just "Summarize the following article in one sentence: %s\n\n", whereas Summate’s asks for more detail. Also, I’m on text-davinci-002, which tends to produce less verbose output than text-davinci-003.
I want to see something like this applied to files saved locally. It would be much easier to find files based on what they're actually about rather the filename, metadata, or specific quotations.