Show HN: Pass a URL, get summarized content

peterldowns · on Feb 28, 2016

Not summary, just sentence ranking and extraction. Still cool but not anything new. Sweet side project though! For anyone wondering how this was done, I have a similar project up at http://bookshrink.com (source code at https://github.com/peterldowns/bookshrink), although I don't fetch article text.

mck- · on Feb 28, 2016

I did a similar hack a while back, which summarizes a piece of text in a single sentence. I'm that lazy.

https://github.com/mck-/oneliner

gravypod · on Feb 28, 2016

I'm writing a book right now. Would you mind if I used your program to make the title and the chapter titles?

franciscop · on Feb 28, 2016

Really nice, I made a similar project for "rich text" where I took into account the html tags such as h1, bold, etc for my Final Year Project. Also mixing few analysis including TF-IDF and others. I'll have a look at the source when I get some time (:

despinozist · on Feb 28, 2016

You should render the output as actual JSONAPI (http://jsonapi.org):

    {
      "links": {
        "self": "...",
        "prev": "...",
        "next": "..."
      },
      "data": [],
      "included": []
    }

So that we can discover the API beyond the form. Use http://www.iana.org/assignments/link-relations/link-relation... as the starting point for link relations ideas.

wyldfire · on Feb 28, 2016

I was pretty quick to knee jerk ask myself "Why is this any better than any other schema?" (I was not convinced that "API discovery" was, by itself, a good enough case).

Then I read the very practical first sentence of the jsonapi page: "If you've ever argued with your team about the way your JSON responses should be formatted, JSON API can be your anti-bikeshedding tool." That alone is probably huge. May not mean much for individual projects, but it's good enough for me to bookmark for the future.

fishnchips · on Feb 28, 2016

Can't help but think of https://xkcd.com/927/ ;)

Not sure if standards like this can prevent bikeshedding. You can always bikeshed about the need to stick to any particular standard. One counterexample to what I'm saying may be one standard Go language formatting with gofmt but that was introduced very early on and became a part of the culture. Too late for that with JSON APIs.

despinozist · on Feb 28, 2016

Ya I was like why all teh downvotes y'all?

Then I realized I had no actual link ("http://") in my initial post.

I tend to assume all have read exactly what I have read. JSONAPI has huge implications for distributed systems architecture.

developer2 · on Feb 29, 2016

I'd like to point out that this formatting convention is not a widespread standard. "You should" is biased towards making life simpler for a small number of people who have used the format before, while complicating and bloating your API responses - for both the developer(s) and consumers of your API.

Consumers are now expected to add a full library to their project to parse/understand the JSON responses. Also, implementation overload for many languages: http://jsonapi.org/implementations/

_hyn3 · on Feb 28, 2016

> You should [...]

At least there's no namespacing or DTD's!

JSON plus a loose adherence to REST won out over SOAP/XML-RPC/WSDL/etc because of its simplicity. Services discovery seems to be a solution in search of a problem.

Even the simple idea of embedded links, with apologies to Dr. Fielding, seem to be a less critical component of REST than was initially believed, since very few modern REST API's actually provide them.

dragonwriter · on Feb 29, 2016

Very few modern "REST" APIs have even a remote resemblance to the REST architecture described by Fielding, it's mostly just RPC over HTML with data in an specialized subset of JSON or XML that is specified per endpoint out-of-band rather than indicated by media type.

I really wish people would stop calling it REST, since that Rob's the meaning from the term.

_hyn3 · on Feb 29, 2016

To some point, I agree! I also used to be a REST purist, but I've become more pragmatic in recent years.

Some crucial points that are often preserved even in today's API's that distinguish them from RPC:

- Any REST API will have the concept of resources that are acted upon by HTTP verbs (methods), instead of RPC-style calling a method named in the URI.

- statelessness (no session state assumed on the server)

- use of HTTP status codes

- resource path generally indicates hierarchy or at least a specific 'one path to this representation'

- broad use of existing HTTP headers for metadata instead of a separate "envelope" in the body as in SOAP

- use of common HTTP headers such as Authorization rather than cookies or other carriers of state

Many of the other compromises are not always because of ignorances, but in order to be broadly useful in the most common cases.

I don't disagree with your point about not calling it REST, because this trend does diverge from Dr. Fielding's dissertation in several important areas (for example, content negotiation, as you point out). That's why I prefer the term "loose REST".

ComputerGuru · on Feb 28, 2016

I'm not sure I am seeking the same wow-factor results from the service that everyone else is raving about.

I submitted this link [0] which was on the HN homepage a couple of days ago and the results that I got back were more either the least important bits or in some ways implying the opposite of the article, so either the writing was really bad or the algorithm needs some work.

Submitting a "simpler" less-ranty article [1] was even less successful, leading to paraphrases of less-important sentences as the results.

Then I submitted the BBC article from this morning about Philae [3] and received much, much better results. I think it works best on articles that have single sentences that clearly sum up the gist of the post as a single, hard fact and doesn't work with anything that works towards logical conclusions or tries to build an argument. Which makes sense, because this isn't an AI and can't actually deduce anything.

0: https://neosmart.net/blog/2016/on-the-growing-intentional-us...

1: https://neosmart.net/blog/2016/when-is-the-2016-retina-macbo...

3: http://www.bbc.com/news/science-environment-35559503

detaro · on Feb 28, 2016

> I'm not sure I am seeking the same wow-factor results from the service that everyone else is raving about.

Um... where is someone raving about the result? Most of the comments seem neutral to negative to me?

lpage · on Feb 28, 2016

> Most of the comments seem neutral to negative to me?

Shameless plug, thanks to HackerMoods [1] I can quantify that statement: 0.85 neutral, 0.08 positive, 0.07 negative. The average Show HN is 0.17 positive and 0.04 negative, so your assessment is in line with the numbers.

[1]: https://news.ycombinator.com/item?id=11188633

Shamiq · on Feb 29, 2016

I'm color blind, and the charts you use are unintelligible to me.

lpage · on Feb 29, 2016

Sorry about that. Design isn't my wheelhouse but I updated it to what google tells me is a colorblind friendly palette. I would definitely appreciate it if you could take a look and let me know how it is.

rokhayakebe · on Feb 28, 2016

Re:0

Search disruption is overdue.

xlayn · on Feb 28, 2016

I would risk to say it works based on assigning information weight to words, number of non repeating and the way they are related and then filter top down.

I did try it with a link I particularly like

http://multivax.com/last_question.html

with the following response.

{"1":"nor could anyone for the day had long since passed zee prime knew when any man had any part of the making of a universal ac","2":"zee prime's mentality was guided into the dim sea of galaxies and one in particular enlarged into stars","3":"he gave no further thought to dee sub wun whose body might be waiting on a galaxy a trillion light-years away or on the star next to zee prime's own","4":"the universal ac said man's original star has gone nova","5":"the universal ac interrupted zee prime's wandering thoughts not with words but with guidance"}}

ShinyCyril · on Feb 28, 2016

Hmm didn't have much luck with: https://mikeanthonywild.com/stopping-blocking-threads-in-pyt...

    { 
        "1":"betterthreads provides an enhanced replacement for the an enhanced replacement for the python this isn't actually a true thread instead it uses gevent to",
        "2":"the widely-accepted solution is to set a timeout on our blocking functions so we can periodically check a which we set from the main thread to indicate we want the child thread to stop",
        "3":"if the thread is still alive the when the *timeout* argument is not present or ``none`` the operation will block until the thread terminates",
        "4":"`runtimeerror` if an attempt is made to join the current thread as that would cause a deadlock",
        "5":"`join` a thread before it has been started and attempts to do so raises the same exception"
    }

That said, I think to summarise that particular would require a certain level of domain expertise, something which a general bot couldn't provide.

phdsummary · on Feb 28, 2016

The summary of that guy's Phd summary http://jxyzabc.blogspot.com/2016/02/my-phd-abridged.html

{"1":"for various reasons i also spend a lot of weekends in new york and make more friends with people working on data and journalism","2":"my friend jean-baptiste who reads it asks why my blog is so good but my paper drafts are so bad","3":"at the beginning of this year i start telling people that i wish i had more female friends since i realize that there are many fewer women around me than before","4":"to keep myself from thinking about my uncertain future all the time i start a cybersecurity accelerator cybersecurity factory with my friend frank wang with the goal of helping research-minded people start companies","5":"i am too lazy to make many friends so i spend my free time reading cooking doing yoga and running"}

Animats · on Feb 28, 2016

Summarization used to be a feature in Microsoft Word through Word 2007, and it did a decent job. That feature was taken out in Word 2010.[1]

[1] https://support.office.com/en-US/article/Automatically-summa...

skewart · on Feb 28, 2016

I didn't know that. Any idea why it was taken out?

lallysingh · on Feb 29, 2016

I suspect that office has to garbage collect features once in a while. Otherwise the maintenance cost would be (more?)horrible.

fiatjaf · on Feb 28, 2016

http://52.90.112.133/recommend/app/getSummary?query=http%3A%...

_RPM · on Feb 28, 2016

An array would be a better choice of structure for the sentences instead of hard coding the indexes.."1"...

brudgers · on Feb 28, 2016

I see your point and don't disagree.

Thinking about why someone might mike the choice to use text, text is more in keeping with *nix philosophy. Not that I'm saying it's better, but grep is pretty light weight and a lot of people use the command line and/or languages other than Javascript. YMMV.

_RPM · on Feb 28, 2016

I'm not sure I understand. JSON is text. JSON provides an array as part of the grammar.

jlarocco · on Feb 28, 2016

I'm not sure I follow you. The JSON response itself is already text.

Besides that, nobody processes JSON on the command line without a tool like JQ, and I'd suspect all of those JSON tools support arrays.

microcolonel · on Feb 28, 2016

Good quality summaries, but it seems it caches pages based on their base URL, and throws away the query parameters.

Some blogs use query parameters to distinguish between articles, so it makes it kinda useless if you want to do more than one article.

andreygrehov · on Feb 28, 2016

This is an off-topic, but I'd like to mention it.

I absolutely love the fact that the OP did not get a domain name for this demo. This is an interesting "technique" I haven't seen for quite a while. People tend to own and re-new tenths of domain names, which are just sitting there for an "just in case" moment. This is a great example of how things can really be simplified - spin up an instance, make a demo, shut the instance down.

developer2 · on Feb 29, 2016

Good luck with the link still being usable in a month or a year. There's a reason we use domain names for sharing. Not only because they are friendly to read and remember, but also because IPs are typically far more transient than domain names.

The IPs behind my projects have changed dozens of times over the years (new server, changing hosting provider, adding a load balancer, etc.). A simple DNS change allows the same domain name to follow the project.

I'm actually surprised HN permits links to IP addresses. While links posted here are not guaranteed to point to the same content in the future anyway, it is more likely that an IP address will change before the project is taken down entirely. Search engine posterity and all.

giancarlostoro · on Feb 28, 2016

There's always free sub-domains:

https://freedns.afraid.org/

8bitben · on Feb 28, 2016

Could also be a good use for subdomains - people often forget how many variations of whatever.domain.com can be useful

shloub · on Feb 28, 2016

« URL's must start with "http://" » « https://medium.com/@darrenrovell/all-journalists-need-to-be-... »

hluska · on Feb 28, 2016

This is an exact duplicate (even posted by the same person) of a link submitted 11 hours ago.

https://news.ycombinator.com/item?id=11190008

Edit - it doesn't work for me either, or maybe it is just very slow?

meeper16 · on Feb 28, 2016

Yes, it is. I sent it out too late last night and thought more people might want to see this in the morning.

gus_massa · on Feb 28, 2016

I think this it's ok here. From the FAQ: https://news.ycombinator.com/newsfaq.html

> Are reposts ok?

> If a story has had significant attention in the last year or so, we kill reposts as duplicates. If not, a small number of reposts is ok.

> Please don't delete and repost the same story, though. Accounts that do that eventually lose submission privileges.

pooper · on Feb 28, 2016

I tried it and it doesn't work. What am I doing wrong?

http://52.90.112.133/recommend/app/getSummary?query=This+is+...

mohaps · on Feb 29, 2016

Any details about the backend/implementation?

shameless plugs for two similar projects(open sourced both) I did a while back 1) Algorithmic Summarizer: https://github.com/mohaps/tldrzr 2) Readability Clone / Article Body Extractor with summary, significant image and text : https://github.com/mohaps/xtractor

Both are deployed on heroku and the urls are in the github readme files.

h1fra · on Feb 28, 2016

Huum, not quiet sure what is was expecting, but the results were not great :(

But I could see the use of this kind of service.

Also does not work with accentuated char.

an_ko · on Feb 28, 2016

I'd like more details. How does it work?

LinkPlug · on Feb 28, 2016

What is it built with? (Stack, Foss etc)

Mark_B · on Feb 28, 2016

Fun with Lorem Ipsum:

http://52.90.112.133/recommend/app/getSummary?query=http%3A%...

meeper16 · on Feb 28, 2016

It seems to be multi-lingual

jack9 · on Feb 29, 2016

http://www.slashdot.org and https://news.ycombinator.com

{"summarized_text": {}}

franze · on Feb 29, 2016

if you submit http://54.86.121.4/recommend/getSummary.html to http://54.86.121.4/recommend/getSummary.html you get

  {"summarized_text": {"1":"insert any block of text or single url url's must start with biomimic@gmail.com"}}

which is of course completely wrong

tuananh · on Feb 29, 2016

urgh: empty

http://54.86.121.4/recommend/app/getSummary?query=http%3A%2F...

LinkPlug · on Feb 28, 2016

What are some alternatives to this?

jbeda · on Feb 28, 2016

Check out https://algorithmia.com/. Stuff like this plus a bunch more. Real business model so you can have more confidence in it.

subinsebastien · on Feb 29, 2016

I have seen http://www.textteaser.com/ by far giving the best results.

stkbach · on Feb 28, 2016

http://thegi.st

It's a bit buggy still. To get a sense of it, stick to BBC articles.

meeper16 · on Feb 28, 2016

None of them work as well as this one.

gkumartvm · on Feb 28, 2016

Wordpress sites urls are not working !!

orliesaurus · on Feb 28, 2016

Not really working as expected :(

dang · on Feb 28, 2016

Url changed from http://52.90.112.133/recommend/getSummary.html by submitter's request.