Show HN: Updating archive of HN's /newest, with pre-rendering of the URLs

avifreedman · on Nov 2, 2013

I'm frequently mobile or in the air when scanning /newest so the browser CPU from opening 50+ tabs at once and the latency and loss with all the RTTs on modern pages is a time waster.

And I couldn't find any RSS feeds of /newest (no one that crazy I guess), so I wrote something to grab pointers to the submissions and render them as GIF, HTML, and text.

If this is useful to anyone I'm happy to improve it (including design).

bazzargh · on Nov 2, 2013

There's i18n bugs in the characters you're sending - I notice you've used "text/html; charset=iso-8859-1" but HN itself is "text/html; charset=utf-8" - ie you can only correctly display a subset of HN's headlines. There's an example on the page right now - the Lockheed story from medium.com.

Also, really dislike the target="_blank" on every link - I can get that anyway with ctrl-click.

Was surprised you're generating gif previews but not then inlining thumbnails?

avifreedman · on Nov 2, 2013

Good point(s). Will look at encoding later and rerun all the pages.

Re: _blank - it was a peeve of mine that I'd accidentally miss a ctrl on the click reading /newest, but with this interface that's not a problem since moving back will work and keep your place. I changed it and the historic pages are regenerating now.

Re thumbnails, I wanted pages I could load quickly on ipad, gogo wireless, bad 3g, etc. Happy to do a version with thumbs this week if you'd like.

bazzargh · on Nov 2, 2013

Don't do it for me! I'm happy enough using http://ihackernews.com/ on my mobile; was just asking the question.

rb2e · on Nov 3, 2013

I like this personally as someone who submitted something to HN and I don't mind my content used this way as a link is used pointing back to the main site and its obvious where the main site is but if it downloaded every post I made, then I might take issue.

However some of the website owners may not like you're downloading the content of their articles and hosting them elsewhere. There is a great army of spam sites that will just watch http://pingomatic.com/ and scrape each new entry. Then they will on host it on a splog and stick adverts up. Which gets website owners annoyed.

Trouble is Google might not realize which is the main site and won't get the page rank or the visitors don't come to their site but an alternative. They'll get annoyed as they won't be able to monetize them or see who is reading the page through analytics.

So you may want to set up a DMCA page and abuse email address to stay on the right side of the law. Also a robots.txt which denies Google and Bing from crawling the pages you downloaded.

avifreedman · on Nov 3, 2013

Thanks for the suggestions. robots.txt should now be blocking the /db/, which has all saved content, and a link has been added to the DMCA page on every generated page (putting it at the bottom would be obscure since the pages can get so long).

I'm not planning on copying any of the actual HN content, and don't present copy at all if it is on news.yc. At some point I'll hook into the API to grab comments/points every so often to update into the index pages and probably allow voting from the pages.

rb2e · on Nov 3, 2013

Great! Thank you for doing that.

nezza-_- · on Nov 2, 2013

How do you generate the gifs? This one didn't work too well: http://hnflood.com/db/ab/70a/ab70a72b415067b3130bd0f3ea77baa...

Very nice idea otherwise! I wonder if you could get in (legal) trouble for providing the screenshots/texts.

avifreedman · on Nov 2, 2013

Using phantomjs. For medium it just grabs an image on all renders. I considered making it not show the gif at all in the index for sites on medium for now... but I'll try to get that debugged soonish by someone familiar with phantomjs.

adventured · on Nov 2, 2013

A lot of sites provide screenshots. No legal trouble is likely there (unless you're attempting to reproduce entire sites that way). The text can be another matter depending on how much you're displaying / reproducing.

himal · on Nov 3, 2013

Yes, this one has a problem as well: http://hnflood.com/db/6d/774/6d77414a0b6f1c324938a056ddf1eaa...

krapp · on Nov 2, 2013

I'm working on something similar, but using metadata and get mixed results. Saving screenshots is a really good idea.

ddod · on Nov 2, 2013

This is actually really awesome from a research perspective. I imagine you could do things like analyze word pools, website layouts, colors, and other latent link data and correlate that with success/failure rates on HN. If I had more time, I could definitely see myself building things based off of this. Some sort of API would be useful for such.

Great work!

avifreedman · on Nov 2, 2013

I think that the thriftdb folks (integrated to HN and hnsearch) do a great job with some of what you are looking for. If not, I'm happy to make data accessible.

Actually... It's just using the unix fs as a db right now, the structure is open and pretty easy to decipher. The dir tree for the objects is just "db/substr(md5hashofurl, 0, 2)/substr(md5hashofurl, 2, 3)/md5hash)" and the bytime/yyyy/m/d/hr/min is just symlinks to the md5hash (0 len file).

nwh · on Nov 2, 2013

Could you save something better than the incredibly lossy GIF? Even a JPEG would be better if you're trying to save size on disk.

avifreedman · on Nov 2, 2013

I tried all of the formats from phantomjs and GIF didn't look worse than the JPEG. Disk space isn't an issue... I will take another look or maybe just add JPEG or BMP as well.

Usually I am just scanning to get the sense of whether it's worth reading every word, and if so I go to the original source.