Homemade RSS aggregator followup

abengoam · on Dec 20, 2015

That's awesome! I should know because I also created my own RSS aggregator after the demise of Google Reader.

Here's a screenshot https://imgur.com/YHJOiEX

It fills my needs perfectly because I created it specifically for myself and I control it fully in all aspects.

It's been such a tremendous success for me and so fun to create that I am thinking about replacing other online services with custom-made versions, such as google calendar, google tasks, etc. Something to look forward to in 2016.

Great job, and keep at it!

m_mozafarian · on Dec 21, 2015

After the demise Google reader, pretty much everybody I know started building their very own RSS Reader. We also made one for the Relevant app. It's beautiful card called RSS Reader that you can add it from the library. Then just paste your rss urls in the back of the card and it works. (Currently iOS only). http://relevant.ai/

scrollaway · on Dec 21, 2015

Your reader looks awesome. Did you open source it?

abengoam · on Dec 21, 2015

Thank you! Alas, I did not. I am fully behind the open source/free software movement(s), but right now I am at a point in my life where I can't manage and support an open source project as my availability is super spotty. Also, you don't want to see the css... it makes gotos look good.

scrollaway · on Dec 21, 2015

You should not be afraid of open sourcing code you can't support or are not happy with :)

Here's my most popular Github repository. I never use it and it's entirely supported by its users, I just keep an eye on it:

https://github.com/jleclanche/django-push-notifications

And here is some of my worst production code. Fully undocumented and with my first Python code ever in it, never rewritten/cleaned up:

https://github.com/jleclanche/pywow

There, I just showed you terrible code and unmaintained crap ;) Don't be ashamed!

mmb · on Dec 21, 2015

The bigger problem is that the amount of information you can consume using RSS feeds is declining. Most sites don't publish RSS feeds of their content any more.

Sadly RSS is left over from a time when things were more open. Now everything is an app and everyone wants you to stay in their walled garden.

anotherevan · on Dec 21, 2015

Yeah, there's a couple of interesting sites I would like to follow, but they don't provide RSS feeds, so I don't bother.

l1n · on Dec 21, 2015

I personally use http://changedetection.com/ to create RSS feeds for sites that don't have them (though I may move to Huginn agents in future).

fulafel · on Dec 21, 2015

There seem to be many services for addressing this. I wonder if anyone has recommendations about which one to use? Searching for "auto generate rss" returns at least a screenful of these.

rcarmo · on Dec 20, 2015

I have a fair amount of code that people can re-use to build their own aggregators, since I did a number of experiments when Google Reader died.

One was a Fever clone that had a number of strategies for doing parallel fetching:

- https://github.com/rcarmo/bottle-fever

Andrea Peltrin took that and evolved it into Coldsweat: https://github.com/passiomatic/coldsweat - which I recommend if you want a web UI.

I did a number of other things, but eventually went back to what I used _before_ Google Reader: e-mail.

I was one of the contributors for http://newspipe.sourceforge.net/, and after getting bottle-fever going I decided to investigate the state of the art and did a quick fork of rss2email that injected messages into an IMAP store instead of sending them via SMTP, to avoid spam traps.

It was a quick hack, but it allowed me to read feeds using any mobile IMAP client, and a friend eventually did a Go version, which I've also tweaked to my liking:

- https://github.com/rcarmo/rss2imap (Python) - https://github.com/rcarmo/go-rss2imap (Go)

Any of the above are likely to save people a fair amount of time (do bear in mind that the Python version was a hack atop code that was written by Aaron Swartz a decade ago, and it shows its age).

These days I ended up going back to Feedly, simply because I have to use Windows, the Web UI is good enough and there are lots of good clients for the platforms I use (NextGen, Reeder, etc.)

Plus I realised that trying to archive stuff from hundreds of feeds was somewhat pointless -- the stuff I really want to keep around goes into Pocket or OneNote, and that's that.

Edit: Also, here are some notes from 2008 on Bayesian classification and its effectiveness: http://taoofmac.com/space/blog/2008/01/27/2203#an-update-on-...

voltagex_ · on Dec 21, 2015

I like the idea of (ab)using protocols to do not-quite what they were intended to do.

Pushing RSS feeds into IMAP is a great idea - I wonder how much work it'd take to make NewsBeuter to that, then expose it somewhere and have FastMail pull it into a folder for me.

These hacks eventually start looking like Rube-Goldberg machines, but they've got a certain charm.

Offtopic: I wrote a Wake-on-Lan server that allows me to turn on VMs as if they were physical machines - https://github.com/voltagex/junkcode/tree/master/CSharp/Virt...

The next one for me is probably going to be a DNS server that resolves the name and IPs of VMs.

aw3c2 · on Dec 20, 2015

My perfect aggregator would also create a WARC archive of the webpage of each post, including all external references, maybe the referenced external websites and their references (with that single depth of recursion). The internet is friggen fragile and I would love to archive what I consume.

derefr · on Dec 20, 2015

To go further: there's basically no point in the "description" part of an RSS item. RSS is broken in that authors need to lure people onto their sites, so they make the RSS item itself enclose just enough of a preview to make you "click through"—whereupon their site can show you ads and they can make money.

How RSS should work, in an ideal technical sense, is to eschew enclosing any content-body in feed items themselves, and instead just encourage RSS consumers (feed-reader clients; feed-muxer daemons) to scrape the permalinks of the feed items, and then heuristically extract the body-content from the scrape-result, and cache both the resulting page-archive and the resulting cleaned-up text, making both representations available offline.

This, obviously, kills blog ad revenue. But it's better to kill it and replace it with something better (402 micropayment-required errors at point-of-caching, handled automatically by the RSS content-spidering daemon as an HTTP client, with costs passed on to its subscribers?) than to continue on with this semi-braindamaged "I have an offline cache but that doesn't actually mean I can read anything offline" world.

okasaki · on Dec 21, 2015

All the blogs I read include the full content in the feed. Some even include top comments.

And if you're parsing sites then you have no use for RSS - you can just parse the index to see what's new. Sounds like an nightmare to me though. Who's going to maintain these parsers for hundreds of sites?

derefr · on Dec 21, 2015

RSS tells you what's actually a new page. Usually, if a site wants to be Google-friendly, it'll have a page for every new content item—so, if a site publishes an RSS feed, that's usually enough to "chunk" their content by time.

Heuristics to decide what's new on a site are much, much harder to code than heuristics to extract content when you're explicitly told what's new. There was a service called Dapper, who tried the "extract content from a CSS-selector-specified zone of a page when content changes" approach... and it didn't go well. Yahoo Pipes had similar aspirations; again, got shuttered.

There are always sites that are really bad internet citizens. Some sites just change their front page to update, without creating a canonical archival URL—so then, if they do publish an RSS feed, every item is just a link back to said front page. There's not much you can do in these cases beyond just crowdsourcing "parsers for hundreds of sites" (which does happen—webcomics being a frequent and horrifying example.)

But there is a pretty good alternative, I think: if you can scrape the site itself as a whole at regular intervals, with fine enough granularity that any two contiguous scrapes will detect either 0 or 1 change, then you can probably convert that into a useful RSS feed without trying to figure out from "what changed" from DOM deltas. Basically, shove each scrape into the working index of a git repo, commit, and generate an RSS feed of the diffs. (Wikipedia can give you this in some form; I think gwern.net has an RSS feed that also basically works this way.) And, if the site has even the most minimal source RSS feed, you don't need the "regular intervals" approach: you can use the source RSS feed to provide "cueing" information (i.e. timestamps of when the site has changed) for your scrapes.

Which is all to say, RSS is a very "utilitarian" format. You don't need to rely on all of this processing happening on the server side, but rather can just take one RSS feed, and then use whatever bits of it you want to generate another RSS feed, and then someone can consume that RSS feed and write a heuristic to generate cluster and combine the feed items from it into higher-level summaries, etc.

This sort of thing really shows the "theme" of RSS, to me. RSS wasn't really designed or intended for direct consumption by readers, but rather to make it easy to have something on your site (even your statically-generated site, or your crap one-off PHP blog) that's easy for other services to consume and turn into stuff. It's "Really Simple Syndication": it's not the whole process of getting stuff to clients, it's just the lowest barrier-to-entry to get the supply-side of the equation involved, so that well-made delivery infrastructure can take over from there.

There were never meant to be "RSS reader" software, really, or even e.g. consume-side RSS-to-email gateways. Instead, RSS was meant to be lowest-common-denominator format for other supplier-side services to consume. Instead of writing "lifecycle emails", for example, you were meant to just have your web app generate a lifecycle notification stream RSS feed for each user—and then subscribe an emailing service to it.

We've replaced this behavior, for the most part, with webhooks—having our web-apps actually reach out using a REST client and prod other services' APIs. But RSS (with one consumer—the hooked service) is much, much cheaper than a REST client, from all perspectives: your web-app can likely already generate XML, it likely already has a list of changes and "view rendering" capabilities, etc. Webhooks are for the 1% of apps that can both 1. run server-side, 2. reach-out from their sandboxes to poke something elsewhere on the web, and 3. have a public URL where they can be poked back at. An RSS-based pub-sub event system, meanwhile, will work even if the event-generator is on your local machine: as long as you proxy over the RSS feed URL itself, you don't need to worry about machines trying to figure out whether the site that prodded them with a webhook request is coming from the right IP for that client or not, etc. They just get to blindly consume a URL, like any other piece of idiomatic web software.

jefftk · on Dec 21, 2015

Most people I follow on RSS include the full content of posts in their RSS feeds. When someone doesn't, it's generally an oversight and they fix it when I bring it up.

derefr · on Dec 21, 2015

Anyone I follow that generates their own RSS feed usually does. A lot of CMS software doesn't, though. A Tumblr-hosted blog, for example, will cut off its RSS feed items if they include an explicit fold (i.e. a "Read More" link)—even though the intended purpose of those is to make posts not take up fifty pages on the Tumblr dashboard (the dashboard being a "limited preview" view), not to cut down or restrict the item itself.

zrail · on Dec 20, 2015

That's a really great idea. Here's a one-line wget that will grab the provided URL and all of the data necessary to render, to one level of recursion, and dump it to a WARC:

    wget -e robots=off \
    --user-agent="Mozilla" \
    -r 1 -p -E -H -k -K \
    --warc-file=/path/to/your/warc/file/without/warc/extension \
    'http://www.example.com'

I think I might start capturing these. Shouldn't take up too much additional disk space.

edit: previously it was `-r 2` which is two levels of recursion.

aw3c2 · on Dec 21, 2015

You might have more fun with wpull, it has more sensible options to grab resources on referenced urls. https://github.com/chfoo/wpull

scrollaway · on Dec 21, 2015

You can shorten that to `-pEHkKr 1`, so you know.

rakoo · on Dec 20, 2015

I know it's not a feed aggregator, but you could intrumentalize pinboard (pinboard.in), they have the option of retrieving and storing an archive of all your links if you so desire. They even resolve first level dependencies, so external images are also stored (see https://blog.pinboard.in/2010/11/bookmark_archives_that_don_...). See some numbers on link rot here: https://blog.pinboard.in/2011/05/remembrance_of_links_past/

Pinboard is built as a bookmark manager, but if you say that all entries in a feed is a bookmark then it should work for you. Oh and there's full-text search as well.

Gmo · on Dec 20, 2015

I think you can replace "they" by "he", since, as far as I know, it's a one-man shop !

epaulson · on Dec 21, 2015

Pinboard is fantastic, but it doesn't look like it exposes the archived pages through the API. If you ever wanted to leave Pinboard and do a data export, you might not be able to take everything with you.

vidarh · on Dec 20, 2015

I agree with that. I've recently been going through my archived blog posts while revamping my blog design, and so much has just flat out disappeared. And so much of what has disappeared have been things I would have guessed would stay...

sheraz · on Dec 20, 2015

What if, instead of a WARC file, just a high-res screenshot available in desktop, tablet, and mobile mode?

jimmyswimmy · on Dec 21, 2015

Why would a binary file be better than its source? Seems like a lot more work (you'd need to format and store for each display size) in exchange for less capability (can't search, probably a larger overall file).

sheraz · on Dec 21, 2015

..was just thinking out loud :-)

But one advantage is that this is immediately viewable on any device, because otherwise you would need a WARC viewer, which are not so easy to setup.

pmoriarty · on Dec 20, 2015

I've recently come back to using newsbeuter[1] and have been quite impressed. It's really feature rich and very customizable. It's a terminal app, which some might not like, but for me that's a plus.

[1] - http://www.newsbeuter.org/index.html

gerty · on Dec 21, 2015

I guess Tiny Tiny RSS hasn't been mentioned yet. FOSS, self-hosted with multiple Android clients. I had been using Feedly since Google Reader went down but should have actually been using TTRSS since the beginning. I ain't no power user but it has definitely more than I ever would ask for.

wanda · on Dec 20, 2015

If anybody happens to be looking for an RSS aggregator, I'd like to recommend GoRead.

Obviously I wouldn't pay for it, but self-hosting is pretty straightforward and it has a companion Android app.

Never cared for Feedly and I don't really fancy making my own.

It's the best Google Reader clone I've found.

https://www.goread.io

ents · on Dec 21, 2015

I don't like feedly either, but as a backend for apps it works fine, and is free.

petercooper · on Dec 20, 2015

I made a similar script but that also has 'plugins' so 'URLs' like @username, twitter:topic, /r/subreddit, and hn:topic load up the tweets, Twitter search results, sub-Reddit items, or HN search results respectively for certain keywords, using their respective APIs.

oneloop · on Dec 21, 2015

Care to share?

krylon · on Dec 20, 2015

Very interesting!

I am currently building an RSS aggregator, too. Mine is little more complex, though - I wanna be able to rate items as interesting or boring and use some kind of filter (currently, a simple Bayesian classifier, I intend to replace or at least enhance with something more sophisticated over the holidays) to weed out news that I am not interested in.

The biggest problem is that web design is not my strong suite (to put it mildly), so the thing looks pretty ugly. Classification does not work very well, yet, but I am not sure if this is because the classifier sucks in general or if my training set is too small at the moment (I've only been using the thing for a couple of days now).

Anyway, it is quite interesting to see another approach to the problem.

alanpost · on Dec 20, 2015

Do you use a single classifier for your entire feed, or do you categorize the feeds and maintain a classifier for each topic? (Or, as always, secret option #3: neither.)

krylon · on Dec 21, 2015

Currently, I use a single classifier for all the news items.

I store the items, along with manual ratings, in a database, from which the classifier is trained. (This also allows me to keep a history of older items, which I can search.)

I hope to make it more sophisticated, eventually.

dubbel · on Dec 20, 2015

If you use Bootstrap CSS it looks pretty generic, but it's also nearly impossible to make it look bad.

Sounds like a cool project.

pyre · on Dec 20, 2015

There's also stuff like MDL[1] (or Angular Material[2] if you're using AngularJS).

[1] http://www.getmdl.io/

[2] https://material.angularjs.org/latest/#/

oxplot · on Dec 20, 2015

Given that email clients are fairly mature and advanced already (especially Gmail's), it seemed logical to go the unix way and use it as the UI to stream of feeds sent as email. I wrote a bit of python [1] and stuck on it a free openshift cartridge and it sends me one email per feed item. It's been up since April this year shortly after I abandoned feedly. I like it more than Google Reader now.

[1]: https://github.com/oxplot/lapafeed

axx · on Dec 28, 2015

I'm also working on an open source RSS Reader called HappyFeed. It's compatible with Fever RSS API, so you can use it with Reeder, ReadKit, Press and so on.

I work on this project mainly for myself, but if you're interested and want to contribute, feel free to get in touch!

Screenshots and development blog: https://need.computer/happyfeed/2015/12/20/happyfeed-drag-an...

GitHub: https://github.com/aleks/HappyFeed

fasouto · on Dec 20, 2015

I started creating an RSS aggregator some time ago (https://github.com/fasouto/django-feedaggregator) and it was more difficult than expected. There are many broken feeds and different interpretations of the standard.

One day I should finish it...

rcarmo · on Dec 20, 2015

Check my top-level comment.

younata · on Dec 20, 2015

Been working on my own RSS reader for iOS. https://github.com/younata/RSSClient/

Pretty much all the internal logic (parsing feeds/opml files) are also written from scratch, which was interesting to do.

newtang · on Dec 21, 2015

I built a similarly simple feed aggregator for anyone to use at https://plumfeed.com Shows the most recent post of each feed. I've found it particularly nice for blogs and comics that update once a day or less.

hasteur · on Dec 22, 2015

Adding annother "Google Reader" replacement for TinyTiny-RSS. There is some jiggery-pokery" that needs to be done, but it does nearly everything you could want (I wouldn't mind getting the behind the cut reveals)

ojiikun · on Dec 21, 2015

Surprised there has been no mention of NewsBlur. Though you can pay the author for use of the central instance, it is also fully open sourced on github. Saying it is more feature-rich than most examples would be an understatement.