If I trained this on a 30,000 word document could it give me a summary? Or would...

londons_explore · on Jan 11, 2023

30,000 words wouldn't be enough to train this from scratch - you'd ideally train from hundreds of millions of words at least.

30,000 words would be enough to finetune an existing model. If you did that, then the model would output text similar to the finetuning data. For example, if you finetuned it on shakespeare, then you might be able to use the model to make a new play, in shakespeare's style.

ProjectArcturis · on Jan 11, 2023

If you finetuned it on the text of Shakespeare's plays, how would it link that text to the string "Shakespeare"?

londons_explore · on Jan 11, 2023

It still has the knowledge from the main training on data from across the whole internet, so would still know the word Shakespeare...

But you're right - the model finetuned on shakespeare would be good at writing a new play in the style of shakespeare, but would be bad at giving a critique of shakespeare's works.

londons_explore · on Jan 11, 2023

The context window (block size) of this model is 1024 symbols. Symbols approximately map to words. So you can't ask it to summarize anything over 1024 words.

nprateem · on Jan 11, 2023

Yeah that's the issue I was thinking of, how to get it to summarise large documents. Has anyone any ideas?

londons_explore · on Jan 11, 2023

People have had some success with the following process:

Divide your 30,000 word document into a hundred 300 word chuncks. For each chunk, give as input:

    Please summarize the following text into  50 words:

    [chunk]

Join all the outputs together, and you now have a shorter document. Repeat the process recursively.

You can improve the results by doing the process again, but this time giving some context:

    Please summarize the following text, an extract of a document about [1st attempt at a summary], into  50 words:

    [chunk]

londons_explore · on Jan 11, 2023

You can also use "Please suggest a section title for the following text".

Then that title can be used in the 2nd round, for example using a query of the form "The following is an extract from the Introduction section of a document about The benefits and disadvantages of nuclear power in sweden:"

generalizations · on Jan 11, 2023

I imagine you could do even better by finetuning the neural net on the document before asking for the recursive summary. Then it has all the information to work with, albeit in a compressed form.