Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Pass a URL, get summarized content (54.86.121.4)
126 points by meeper16 on Feb 28, 2016 | hide | past | favorite | 56 comments



Not summary, just sentence ranking and extraction. Still cool but not anything new. Sweet side project though! For anyone wondering how this was done, I have a similar project up at http://bookshrink.com (source code at https://github.com/peterldowns/bookshrink), although I don't fetch article text.


I did a similar hack a while back, which summarizes a piece of text in a single sentence. I'm that lazy.

https://github.com/mck-/oneliner


I'm writing a book right now. Would you mind if I used your program to make the title and the chapter titles?


Really nice, I made a similar project for "rich text" where I took into account the html tags such as h1, bold, etc for my Final Year Project. Also mixing few analysis including TF-IDF and others. I'll have a look at the source when I get some time (:


You should render the output as actual JSONAPI (http://jsonapi.org):

    {
      "links": {
        "self": "...",
        "prev": "...",
        "next": "..."
      },
      "data": [],
      "included": []
    }
So that we can discover the API beyond the form. Use http://www.iana.org/assignments/link-relations/link-relation... as the starting point for link relations ideas.


I was pretty quick to knee jerk ask myself "Why is this any better than any other schema?" (I was not convinced that "API discovery" was, by itself, a good enough case).

Then I read the very practical first sentence of the jsonapi page: "If you've ever argued with your team about the way your JSON responses should be formatted, JSON API can be your anti-bikeshedding tool." That alone is probably huge. May not mean much for individual projects, but it's good enough for me to bookmark for the future.


Can't help but think of https://xkcd.com/927/ ;)

Not sure if standards like this can prevent bikeshedding. You can always bikeshed about the need to stick to any particular standard. One counterexample to what I'm saying may be one standard Go language formatting with gofmt but that was introduced very early on and became a part of the culture. Too late for that with JSON APIs.


Ya I was like why all teh downvotes y'all?

Then I realized I had no actual link ("http://") in my initial post.

I tend to assume all have read exactly what I have read. JSONAPI has huge implications for distributed systems architecture.


I'd like to point out that this formatting convention is not a widespread standard. "You should" is biased towards making life simpler for a small number of people who have used the format before, while complicating and bloating your API responses - for both the developer(s) and consumers of your API.

Consumers are now expected to add a full library to their project to parse/understand the JSON responses. Also, implementation overload for many languages: http://jsonapi.org/implementations/


> You should [...]

At least there's no namespacing or DTD's!

JSON plus a loose adherence to REST won out over SOAP/XML-RPC/WSDL/etc because of its simplicity. Services discovery seems to be a solution in search of a problem.

Even the simple idea of embedded links, with apologies to Dr. Fielding, seem to be a less critical component of REST than was initially believed, since very few modern REST API's actually provide them.


Very few modern "REST" APIs have even a remote resemblance to the REST architecture described by Fielding, it's mostly just RPC over HTML with data in an specialized subset of JSON or XML that is specified per endpoint out-of-band rather than indicated by media type.

I really wish people would stop calling it REST, since that Rob's the meaning from the term.


To some point, I agree! I also used to be a REST purist, but I've become more pragmatic in recent years.

Some crucial points that are often preserved even in today's API's that distinguish them from RPC:

- Any REST API will have the concept of resources that are acted upon by HTTP verbs (methods), instead of RPC-style calling a method named in the URI.

- statelessness (no session state assumed on the server)

- use of HTTP status codes

- resource path generally indicates hierarchy or at least a specific 'one path to this representation'

- broad use of existing HTTP headers for metadata instead of a separate "envelope" in the body as in SOAP

- use of common HTTP headers such as Authorization rather than cookies or other carriers of state

Many of the other compromises are not always because of ignorances, but in order to be broadly useful in the most common cases.

I don't disagree with your point about not calling it REST, because this trend does diverge from Dr. Fielding's dissertation in several important areas (for example, content negotiation, as you point out). That's why I prefer the term "loose REST".


I'm not sure I am seeking the same wow-factor results from the service that everyone else is raving about.

I submitted this link [0] which was on the HN homepage a couple of days ago and the results that I got back were more either the least important bits or in some ways implying the opposite of the article, so either the writing was really bad or the algorithm needs some work.

Submitting a "simpler" less-ranty article [1] was even less successful, leading to paraphrases of less-important sentences as the results.

Then I submitted the BBC article from this morning about Philae [3] and received much, much better results. I think it works best on articles that have single sentences that clearly sum up the gist of the post as a single, hard fact and doesn't work with anything that works towards logical conclusions or tries to build an argument. Which makes sense, because this isn't an AI and can't actually deduce anything.

0: https://neosmart.net/blog/2016/on-the-growing-intentional-us...

1: https://neosmart.net/blog/2016/when-is-the-2016-retina-macbo...

3: http://www.bbc.com/news/science-environment-35559503


> I'm not sure I am seeking the same wow-factor results from the service that everyone else is raving about.

Um... where is someone raving about the result? Most of the comments seem neutral to negative to me?


> Most of the comments seem neutral to negative to me?

Shameless plug, thanks to HackerMoods [1] I can quantify that statement: 0.85 neutral, 0.08 positive, 0.07 negative. The average Show HN is 0.17 positive and 0.04 negative, so your assessment is in line with the numbers.

[1]: https://news.ycombinator.com/item?id=11188633


I'm color blind, and the charts you use are unintelligible to me.


Sorry about that. Design isn't my wheelhouse but I updated it to what google tells me is a colorblind friendly palette. I would definitely appreciate it if you could take a look and let me know how it is.


Re:0

Search disruption is overdue.


I would risk to say it works based on assigning information weight to words, number of non repeating and the way they are related and then filter top down.

I did try it with a link I particularly like

http://multivax.com/last_question.html

with the following response.

{"1":"nor could anyone for the day had long since passed zee prime knew when any man had any part of the making of a universal ac","2":"zee prime's mentality was guided into the dim sea of galaxies and one in particular enlarged into stars","3":"he gave no further thought to dee sub wun whose body might be waiting on a galaxy a trillion light-years away or on the star next to zee prime's own","4":"the universal ac said man's original star has gone nova","5":"the universal ac interrupted zee prime's wandering thoughts not with words but with guidance"}}


Hmm didn't have much luck with: https://mikeanthonywild.com/stopping-blocking-threads-in-pyt...

    { 
        "1":"betterthreads provides an enhanced replacement for the an enhanced replacement for the python this isn't actually a true thread instead it uses gevent to",
        "2":"the widely-accepted solution is to set a timeout on our blocking functions so we can periodically check a which we set from the main thread to indicate we want the child thread to stop",
        "3":"if the thread is still alive the when the *timeout* argument is not present or ``none`` the operation will block until the thread terminates",
        "4":"`runtimeerror` if an attempt is made to join the current thread as that would cause a deadlock",
        "5":"`join` a thread before it has been started and attempts to do so raises the same exception"
    }
That said, I think to summarise that particular would require a certain level of domain expertise, something which a general bot couldn't provide.


The summary of that guy's Phd summary http://jxyzabc.blogspot.com/2016/02/my-phd-abridged.html

{"1":"for various reasons i also spend a lot of weekends in new york and make more friends with people working on data and journalism","2":"my friend jean-baptiste who reads it asks why my blog is so good but my paper drafts are so bad","3":"at the beginning of this year i start telling people that i wish i had more female friends since i realize that there are many fewer women around me than before","4":"to keep myself from thinking about my uncertain future all the time i start a cybersecurity accelerator cybersecurity factory with my friend frank wang with the goal of helping research-minded people start companies","5":"i am too lazy to make many friends so i spend my free time reading cooking doing yoga and running"}


Summarization used to be a feature in Microsoft Word through Word 2007, and it did a decent job. That feature was taken out in Word 2010.[1]

[1] https://support.office.com/en-US/article/Automatically-summa...


I didn't know that. Any idea why it was taken out?


I suspect that office has to garbage collect features once in a while. Otherwise the maintenance cost would be (more?)horrible.



An array would be a better choice of structure for the sentences instead of hard coding the indexes.."1"...


I see your point and don't disagree.

Thinking about why someone might mike the choice to use text, text is more in keeping with *nix philosophy. Not that I'm saying it's better, but grep is pretty light weight and a lot of people use the command line and/or languages other than Javascript. YMMV.


I'm not sure I understand. JSON is text. JSON provides an array as part of the grammar.


I'm not sure I follow you. The JSON response itself is already text.

Besides that, nobody processes JSON on the command line without a tool like JQ, and I'd suspect all of those JSON tools support arrays.


Good quality summaries, but it seems it caches pages based on their base URL, and throws away the query parameters.

Some blogs use query parameters to distinguish between articles, so it makes it kinda useless if you want to do more than one article.


This is an off-topic, but I'd like to mention it.

I absolutely love the fact that the OP did not get a domain name for this demo. This is an interesting "technique" I haven't seen for quite a while. People tend to own and re-new tenths of domain names, which are just sitting there for an "just in case" moment. This is a great example of how things can really be simplified - spin up an instance, make a demo, shut the instance down.


Good luck with the link still being usable in a month or a year. There's a reason we use domain names for sharing. Not only because they are friendly to read and remember, but also because IPs are typically far more transient than domain names.

The IPs behind my projects have changed dozens of times over the years (new server, changing hosting provider, adding a load balancer, etc.). A simple DNS change allows the same domain name to follow the project.

I'm actually surprised HN permits links to IP addresses. While links posted here are not guaranteed to point to the same content in the future anyway, it is more likely that an IP address will change before the project is taken down entirely. Search engine posterity and all.


There's always free sub-domains:

https://freedns.afraid.org/


Could also be a good use for subdomains - people often forget how many variations of whatever.domain.com can be useful



This is an exact duplicate (even posted by the same person) of a link submitted 11 hours ago.

https://news.ycombinator.com/item?id=11190008

Edit - it doesn't work for me either, or maybe it is just very slow?


Yes, it is. I sent it out too late last night and thought more people might want to see this in the morning.


I think this it's ok here. From the FAQ: https://news.ycombinator.com/newsfaq.html

> Are reposts ok?

> If a story has had significant attention in the last year or so, we kill reposts as duplicates. If not, a small number of reposts is ok.

> Please don't delete and repost the same story, though. Accounts that do that eventually lose submission privileges.


I tried it and it doesn't work. What am I doing wrong?

http://52.90.112.133/recommend/app/getSummary?query=This+is+...


Any details about the backend/implementation?

shameless plugs for two similar projects(open sourced both) I did a while back 1) Algorithmic Summarizer: https://github.com/mohaps/tldrzr 2) Readability Clone / Article Body Extractor with summary, significant image and text : https://github.com/mohaps/xtractor

Both are deployed on heroku and the urls are in the github readme files.


Huum, not quiet sure what is was expecting, but the results were not great :(

But I could see the use of this kind of service.

Also does not work with accentuated char.


I'd like more details. How does it work?


What is it built with? (Stack, Foss etc)



It seems to be multi-lingual



if you submit http://54.86.121.4/recommend/getSummary.html to http://54.86.121.4/recommend/getSummary.html you get

  {"summarized_text": {"1":"insert any block of text or single url url's must start with biomimic@gmail.com"}}
which is of course completely wrong



What are some alternatives to this?


Check out https://algorithmia.com/. Stuff like this plus a bunch more. Real business model so you can have more confidence in it.


I have seen http://www.textteaser.com/ by far giving the best results.


http://thegi.st

It's a bit buggy still. To get a sense of it, stick to BBC articles.


None of them work as well as this one.


Wordpress sites urls are not working !!


Not really working as expected :(


Url changed from http://52.90.112.133/recommend/getSummary.html by submitter's request.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: