Show HN: Top PDFs Posted to Hacker News in 2018

burtonator · on Jan 9, 2019

Hey guys. This listed is computed via the data we have at Datastreamer (http://www.datastreamer.io/) ... we basically index the web and have a petabyte search engine that we license for people very serious about open data.

I realized that we had Hacker News for every day and wanted to compute the top PDFs for my own usage so I wanted to share with you guys.

My side project, Polar (https://getpolarized.io/) is used for managing PDF and other reading so you might want to check that out too. It's basically a tool for researcher, students, or anyone passionate about long term education and use read a great deal of technical and research material.

PDFs are supported but we also support offline caching of web page content. We also support Anki sync so you can create flashcards and sync them with Anki so that you never forget what you've read.

EDIT. Awesome! This landed on the home page in less than 15 minutes and is now #2.. Super excited you guys found this helpful. Great to contribute back to such an awesome community!

temny · on Jan 9, 2019

One more broken link: #37 Self-Awareness for Introverts [pdf]

Correct link seems to be http://cliffc.org/blog/wp-content/uploads/2018/05/2018_AWarO...

qwerty456127 · on Jan 9, 2019

And this is one of the papers I feel the most interested in.

tapland · on Jan 9, 2019

Here is a link to the original submission: https://news.ycombinator.com/item?id=17010199

Which includes a blog post that goes into this without being just a presentation-aid PDF: http://cliffc.org/blog/2017/07/30/introverts-emotional-proce...

liamzebedee · on Jan 9, 2019

Wow, your product is super cool! I was going to post this in an Ask HN but maybe you would like to share instead - what is it like to architect a crawler for social media sites?

With Twitter Music shutdown, I was looking back on their acquihire of WeAreHunted, a music ranking service which at its core was a crawler that indexed torrents/tumblr/soundcloud to find what's up and coming. As I was pondering this, I was thinking about how they would normalise the data.

My main question was, how much difficulty do you encounter indexing a social site? I can imagine Tumblr, Facebook, and other sites have a plurality of new content appearing at arbitrary intervals (posts, comments, etc) - and I don't imagine that there are RSS feeds to diff here. So how would it function?

burtonator · on Jan 9, 2019

There are a lot of papers/research on the design of a traditional web crawler.

Basically you have a link rank build out your crawl frontier and then you have an incremental ranking algorithm re-rank the frontier.

The problem is that latency is a major factor.

Many of the top social news posts start from some random user that witnesses something very interesting and then that shoots up rapidly.

Our goal basically has to be to index anything that's not spam and has a potential for being massive.

Additionally, a lot of the old school Google architecture applies. A lot of our infra is devoted to solving problems that would be insanely expensive to build out in the cloud.

We keep re-running the math but to purchase our infra on Amazon web services would be like 150-250k per month but we're doing it for about 12-15k per month.

It's definitely fun to have access to this much content though.

Additionally, our customers are brilliant and we get to work with the CTOs of some very cool companies which is always fun!

mandeepj · on Jan 9, 2019

Thanks for the details. I'm exploring a side project to find viral\interesting\fresh content on web\social media. I know this is a hard problem to solve as 'interesting' varies from person to person.

pryelluw · on Jan 9, 2019

This a good marketing effort relevant to the stuff outlined here: https://news.ycombinator.com/item?id=18861371

You now gave me (and others) a reason to download your app. Ive downloaded the pdfs and am tracking my reading progress with your app.

I would consider making pdf reading lists and sharing them (with a great pitch to use your app) as a marketing effort in multiple verticals.

uxp100 · on Jan 9, 2019

Polar seems neat. I'm a little confused by the landing page. It says Offline-first, but you can use git or dropbox to share between machines. fine. Then later in the page it mentions Cloud Syncing. What's the distinction here?

burtonator · on Jan 10, 2019

Not much of one... it can be first. It works totally fine offline and doesn't REQUIRE cloud. Even if you enable cloud support.

MetalGuru · on Jan 9, 2019

Is polar open source. It has a lot of functionality I was planning on building for my own note-taking system. I'd love to contribute.

Edit: Nvm, just saw the git icon. Awesome!

lettergram · on Jan 9, 2019

I built a system which also functions as a kind of search engine.

I too built an application for HN: https://hnprofile.com/

If you're interested, I'd be happy to discuss it as what I focused on was ranking content (not indexing exactly). Might be some interesting synergy. They system requires a fraction of the data of a regular search engine and is often more effective.

tux1968 · on Jan 9, 2019

Hey there, looks great.

One of them was a broken link (at least from here) :

#15 Cognitive Distortions of People Who Get Stuff Done (2012) [pdf]

burtonator · on Jan 9, 2019

oh.. interesting.. Maybe I'll remove it or try to fix the link.

larrywright · on Jan 9, 2019

I think maybe this is a valid link: http://quarry.stanford.edu/xapm1111126lse/docs/02_LSE_Cognit...

burtonator · on Jan 9, 2019

Thanks.. fixed it.

venuur · on Jan 9, 2019

Polar looks awesome! Do you plan to have support for a mobile app? For better or worse I do a ton of reading of PDFs on my phone while commuting by train. It’d be great to sync with my laptop too, but only mobile would be more than enough to get me started.

burtonator · on Jan 9, 2019

Thanks. Yeah. Some initial ideas but nothing concrete yet. I might apply for grants to continue dev on that front

blueagle7 · on Jan 9, 2019

Would love if I could sync up with my Pocket list and download all of my articles.

burtonator · on Jan 9, 2019

Ah. Check out Polar. Same thing only better

blueagle7 · on Jan 9, 2019

I just did haha I wish I wish there was an easy way for me to import all of my Pocket articles into Polar without doing them manually. I guess i'll do the research to figure out a way to script this.

_mlxl · on Jan 9, 2019

You can make your pocket RSS feed public and then write a python script to download all the urls returned by the data dump of the feed.

I did something like this for a side project, you can check out he code: https://github.com/afallon02/pocket2kindle

blueagle7 · on Jan 10, 2019

Thanks this was really helpful.

burtonator · on Jan 9, 2019

Yeah.. Import I think is going to be big feature for us as we try to convert people from Mendeley, and Pocket which have huge existing repositories. I don't have that problem yet.

Right now Polar can import a whole directory full of PDFs but that doesn't really get tagging.

I might end up building a file format so that we can do imports.

So you could take the Pocket RSS list, then convert it to the Polar import format, then just import that directly.

We would probably try to bundle up standard importers though.

azhenley · on Jan 9, 2019

My dissertation made it on the list! Only #176 but that is more attention than I ever expected it to get :)

burtonator · on Jan 9, 2019

Ha. Nice. Well glad we could get you another link and more love.

createdjustnow · on Jan 9, 2019

Just in case if you thinking to download these links

import re import requests

from bs4 import BeautifulSoup

def download_file(download_url, name): #create response object r = requests.get(download_url, stream = True)

    #download started
    with open("repo" + name, 'wb') as f:
        for chunk in r.iter_content(chunk_size = 1024*1024):
            if chunk:
                f.write(chunk)

html = requests.get("https://getpolarized.io/2019/01/08/top-pdfs-of-2018-hackerne...) soup = BeautifulSoup(html.content) sAll = soup.findAll("a")

for href in sAll: if(href.has_attr('href')): link = href['href'] if(link.find(".pdf") > 0): print(link) last_index = link.rindex("/") name = link[last_index + 1:] print(name) try: download_file(link, name) except: print("error downloading " + link )

diminoten · on Jan 9, 2019

    from __future__ import print_function

    import re
    import requests

    from bs4 import BeautifulSoup

    def download_file(download_url, name):
        r = requests.get(download_url, stream = True)

        with open("repo" + name, 'wb') as f:
            for chunk in r.iter_content(chunk_size = 1024*1024):
                if chunk:
                    f.write(chunk)

    html = requests.get("https://getpolarized.io/2019/01/08/top-pdfs-of-2018-hackernews.html")
    soup = BeautifulSoup(html.content, features='html.parser')
    sAll = soup.findAll("a")

    for href in sAll:
        if href.has_attr('href'):
            link = href['href']
            if link.find(".pdf") > 0:
                print(link)

                last_index = link.rindex("/")
                name = link[last_index + 1:]
                print(name)

                try:
                    download_file(link, name)
                except:
                    print("error downloading " + link )

ekphrasis · on Jan 9, 2019

Very nice!

It's also interesting that it differs from this search[1] performed in HN's own search system.

[1] [pdf] with filtering on Past Year -> 4,422 hits.

https://hn.algolia.com/?query=%5Bpdf%5D&sort=byPopularity&pr...

burtonator · on Jan 9, 2019

Yea.. there would be more total hits for PDFs found on page 2 or page 3... which I didn't analyze. 500 I think is good enough. Enough reading already ;)

ekphrasis · on Jan 9, 2019

Sure, although the positions of the items in the answer sets at cutoff = 10 differ as well.

burtonator · on Jan 9, 2019

meaning the ranking? Algolia is probably using a different score algorithm. If I were designing it I would probably also factor in number of comments but I think it's fair to use number of upvotes.

ekphrasis · on Jan 9, 2019

I'm assuming that too. Just an observation. Very interesting work of you nonetheless!

smartbit · on Jan 11, 2019

The SRE book was available as PDF till August 25, 2018 https://news.ycombinator.com/item?id=17614907#17624523. I tried to find it but seems not be availble anywhere as PDF. I could only find Kindle (with limited anotation), at safaribooksonline (with even less anotation capabilities), at https://landing.google.com/sre/sre-book/toc/index.html or as print-on-demand.

app4soft · on Jan 9, 2019

Something wrong with score

  22. Software-Defined Radio for Engineers [pdf]
  score: 292 comments

This article[0] posted by me and reached score is 352:

  Software-Defined Radio for Engineers [pdf] (analog.com)
  352 points by app4soft 6 months ago | 50 comments

[0] https://news.ycombinator.com/item?id=17399554

burtonator · on Jan 9, 2019

Thanks.. probably the same article but on a different URL.

app4soft · on Jan 10, 2019

> probably the same article but on a different URL

What you talking about? Article URL[0] is same!

[0] https://news.ycombinator.com/item?id=17399554

jackfoxy · on Jan 9, 2019

This link gets me to the free download of the document manager. I'm not seeing the list on this page.

dvfjsdhgfv · on Jan 9, 2019

> This link gets me to the free download of the document manager. I'm not seeing the list on this page.

That's the power of "Above the fold" in action.

tu7001 · on Jan 11, 2019

Also this link: www.dcs.gla.ac.uk/~trinder/papers/sac-18.pdf seems to be broken.

voiceclonr · on Jan 9, 2019

Interesting project! Kudos on shipping.