Automatic summarization is my MS thesis. In my research, there are two types of summary: Abstraction & Extraction. I agree with the post, abstraction or paraphrasing the text is still a holy grail in automatic summarization. On the other hand, extraction which is just lifting of the most important sentences in the text is the method done by Summly, mentioned in the post, and also in my algorithm. The problem with extraction, is that the obtained text sometimes seems not connected with each other.
There are also some features that are considered in automatic summarization: title or headline, it is the title of the document. Sentence position or where the sentence is located in the text; introduction, body, or conclusion. Sentence length or how many words are in the sentence. Lastly, keyword frequency which is just how many the words appear in the text. There are also other features but I think those 4 are the most important.
The computation of those feature scores, stop words and some constants will affect the output.
Lastly, to evaluate your summary, you may want to use the ROUGE evaluation toolkit (http://www.berouge.com). It needs a reference summary created by a human as a comparison and determine the quality of your summary based on precision, recall, and F-score. ROUGE has different methods to evaluate the summary. There is ROUGE-L which considers the longest common subsequence. There is also ROUGE-W which adds weights. And many more which I don't remember.
That's it. I'm really a fan of automatic summarization and also hoping to create a good algorithm for abstraction.
Edit: sorry for the lack of links and references, it's hard to do those in mobile.
I totally agree! There is a huge gap between my naive algorithm and a real working one. The idea of this post was just to introduce the “automatic summarization world” to those who aren't familiar with it at all.
You may be interested in checking out an automatic summarizer I developed this summer. It uses numerous techniques to weight sentences for extraction - https://github.com/shanedownfall/CNGLSummarizer
I did not know that the method used by Summly was publicly known (though I knew they licensed it from SRI, so chances for that were high). Is the algorithm used by Wavii also known?
Hi, I don't know the exact method done by Summly but based on their output, they are just doing extraction. How they extract those sentences is I don't know.
I don't really heard of Wavii before their acquisition thus not knowing what they really do. Sorry.
There are also some features that are considered in automatic summarization: title or headline, it is the title of the document. Sentence position or where the sentence is located in the text; introduction, body, or conclusion. Sentence length or how many words are in the sentence. Lastly, keyword frequency which is just how many the words appear in the text. There are also other features but I think those 4 are the most important.
The computation of those feature scores, stop words and some constants will affect the output.
Lastly, to evaluate your summary, you may want to use the ROUGE evaluation toolkit (http://www.berouge.com). It needs a reference summary created by a human as a comparison and determine the quality of your summary based on precision, recall, and F-score. ROUGE has different methods to evaluate the summary. There is ROUGE-L which considers the longest common subsequence. There is also ROUGE-W which adds weights. And many more which I don't remember.
That's it. I'm really a fan of automatic summarization and also hoping to create a good algorithm for abstraction.
Edit: sorry for the lack of links and references, it's hard to do those in mobile.