I assume that there are people who want to only see articles which are similar to the ones that they've already seen, but that defeats the purpose of RSS for me.
I want to see new things from the same sources that I've already identified as being interesting. To that end, I frequently add subscriptions to a category called "New", and after a dozen articles or so either file it properly or stop pulling it.
Great. I'm not claiming that existing RSS aggregators are broken, but this method is certainly better for me personally. Findka's meant to make things a little more automatic for those who want to spend very little time on managing their sources. The bandit algorithm makes it so you don't need to explicitly mark which sources you've decided are interesting. And the main, non-RSS part of Findka (article recommendations via collaborative filtering) helps expose you to new articles and sources.
This new feature is probably most useful for people who aren't already using RSS regularly.
Oh trust me when I say feedly is broken. They keep adding features looking for paying subscribers. 80% of those features are rubbish if all you want is to read articles from feeds you curate yourself. Everything else is just noise.
Considering that almost all readers I’ve used have features that at least partially support this use case, I think even some people who already use RSS will like it :)
My current, somewhat vague end vision for Findka is to be a stream processing tool for end users. You subscribe to a number of feeds/streams/sources, and then Findka lets you transform them into output feeds/sinks. You’ll be able to choose how much you want algorithms to help out. The goal is "all your content in one place," but more sophisticated than what you can do now with current aggregators like Feedly.
This is it! This is the fundamental problem that we need to solve, and what RSS is very poor at addressing today. I think that being able to process massive content streams in a human-friendly way is one of those problems that would be world-changing if solved effectively. Facebook and Twitter have failed at this so far, with far-reaching negative consequences for societies.
My gut feeling is that this probably means that we need a paradigm shift in how we relate to content streams. I think your move towards algorithmic transparency is a step in the right direction, though it may have to be bigger than that.
I think the fundamental design flaw of modern content aggregation sites such as YouTube, Facebook, and Twitter, is that lack of transparency. They think that they know what's best for you.
1. You don't know why you're being recommended what you're seeing.
2. You can't teach them to avoid topics. You can only mark individual videos/posts as "not interested" and hope The Algorithm takes the hint. Likewise, you can't teach them to recommend more of a certain topic.
3. You can't see a feed as someone else, so everyone is in their own bubble by default. Reddit and Hacker News avoids this issue by showing everyone the same feed.
4. They tend to have poorly thought-out metrics for relevancy. For example on Facebook comments, users tend to use the Laugh and Angry reacts as negative signal. These comments also get a large number of negative comments. Facebook is completely blind to this, and shows them at the top since it treats all interactions as effectively equal. Cool, they've just built a system that boosts toxicity.
Yes, exactly! The nice thing about framing information discovery as a stream-processing problem is that you can plug other methods into it. For a long time I was thinking just in terms of user-item rating matrices, i.e. collaborative filtering. You've got a list of users, a list of items, and ratings for some of those pairs. Pick 10 items to show a given user. But that doesn't give the user a whole lot of control.
If you think in terms of stream processing instead, you can treat collaborative filtering/other types of recommender systems as some black box that generates a stream. For example, Findka's ML recommendation algorithm picks a handful of articles for each user every day. It's just another stream, and it can be treated exactly the same way as any of the RSS feeds that users subscribe to.
So you get a list of sources, some of which are "regular" RSS feeds, some of which are algorithmically generated--and they can come from within Findka or from outside--and then the user should be able to decide how those feeds are aggregated. Bandit algorithms I think are a great fit (including contextual bandits!), but you can also include manual controls. e.g. you could say "feed X should be 10% of my main sink feed", or "feed Y should be no more than 5% of my main sink feed." Or give priorities--sample from all the feeds in priority 1 until the contents are exhausted, then move to priority 2 etc. There are probably lots of interesting things that could be done in this space.
Recommendation algorithms are often opaque out of necessity (I guess more generally, ML algorithms), but at least this way they wouldn't be in control of everything.
I wouldn't mind a recommendation engine based on my RSS feeds. But when I subscribe to something via RSS I want to see all of that feed. I can manage unsubscribing myself.
Good to know. I'm planning to add settings so you can control the behavior of each subscription (e.g. choose which ones you want a sample of, choose which ones you want the whole thing), so that workflow should be possible with Findka at some point.
> I'm planning to add settings so you can control the behavior of each subscription (e.g. choose which ones you want a sample of, choose which ones you want the whole thing)
This is exactly what I'd need. I don't really use RSS anymore, but when I did I had the problem that some sources (e.g. some random person's blog) would publish once every few weeks and I'd want to see every post, but others (e.g. the Guardian) would put out multiple posts an hour to the point that subscribing to them would drown out everything else in my feed.
Awesome, glad to get some confirmation. Findka's current sampling method might help with this already, by the way--since it starts out by picking a feed uniformly, posts from the guardian would take up the same percentage of your daily recommendations as posts from random blog, no matter how often the guardian publishes (at least until random blog's posts run out).
This is interesting. It reminds of and yet feels the inverse to Shaun Inman’s old Fever RSS reader, where it would try to identify the “hot” articles by the ones most linked to by various feeds.
Ah, I miss the good ol’ (though perhaps slightly less productive!) days of checking into my RSS reader a few times a day. Google Reader and Fever were my favorites.
Man even though the third party Fever apps were mostly terrible they had such a scrappy vibe to them. Really liked Fever, back in the days when you could sell a self-hosted thing for 30 bucks.
Good luck adding in an almost limitless number of one-off exceptions to handle broken or non-conformant RSS feeds.
Also be prepared to receive bug reports from all your users using the feeds, who just expect it to work, but are stupefied and angry because it's your parser that must suck, not the feeds, because CNN's feed would never be buggy.
/s
Seriously though, is RSS feed parsing a solved problem?
I'm currently using an off-the-shelf python library[1], so the entirety of my RSS parsing code is `[x.link for x in feedparser.parse(url).entries]`. There's also Superfeedr[2] which provides an API, but it's expensive (likely because they give realtime updates, which I don't need).
I love that people still build beautiful things on top of RSS.
The author writes "I supplement this with content-based filtering, which involves analyzing the text of each essay". How is this implemented? Will he harvest the site for every rss item featured? Or use the shallow description provided?
I am asking cause I am heavily in the RSS topic aswell [0] and [1]
First I get the text of each article using Newspaper[1] (no rss involved). Then I used tf-idf and k-means to cluster the articles (I followed this tutorial[2]). Then I combine the clusters with my collaborative filtering model using feature augmentation: for each cluster, I generate N "fake users" who like each item in their cluster, and I add that to the rest of the rating data.
So in effect, content-based filtering is used to handle cold start (when a new article is submitted which doesn't have many ratings yet, the fake user ratings will dominate it), and then as real user ratings are gathered, it will switch gradually to relying only on collaborative filtering.
Not going to criticize because (a) there are some cool features I would be interested to try out and (b) RSS sorely needs some revitalization effort.
However, let me point out 3 things I cannot exactly agree with:
1) I do not read feeds individually. Instead, I have a set of folders sorted by priority descending and post frequency ascending (EU projects I am involved in, for example are in the TOP folder at the top and HBR and MIT Tech Review are in the SPAM folder in the bottom). I use BazQux as my reader and it allows me to select a folder, read all articles in it using j/k and when I hit j once the folder is empty, it jumps to the next folder. With Feedka, I'd lose this ability to browse things by topic. An off-topic: I tried to do the same on Twitter with lists and used ManageFlitter to do a reverse sort on people I follow and drop those who post most but failed miserably.
2) I do not read RSS daily. With Twitter or HN (unless you go on hckrnews.com instead of the main website) you are always afraid to miss somethin or feel overwhelmed to catch up. I read most important / least frequent folders first and all posts in all folders are sorted chronologically. A few weeks ago I was catching up with some things posted in October in my feed. But folder-wise chronological grouping and sort means I will first catch up on all ~30 posts since October in the TOP folder (again, not on a feed-by-feed basis) and only then I will get to those 400 posts from HBR and MIT Tech Review. I spent some time there, sign off and I know it will be there when I am back. With Feedka, I feel that I would go back to this anxiety of missing things if I don't read the newsletter every day. HN is an exception because I want not only to read the artile, but also participate in the discussion.
3) I don't want any algorithmic filtering of articles except perhaps a metric comparing ratio of number of entries posted to the number I starred, opened in a new tab or read for more that 10-15s before jumping to the next post.
I guess this sounds more off-putting than I intended. Go ahead boldly! I will try Feedka if it has OPML import and I wish you great success! Also, now that I am reading my points above, I can totally see why my friends use Twitter and not RSS.
All good points, and thank you! I think Findka is a better fit for people who are not already RSS power users (or RSS users at all)--the idea is to use algorithms to handle this kind of source curation automatically.
Ultimately, I'd like to provide the best of both worlds: automatic, algorithm curation by default, with controls that allow you to override things manually.
So, I just signed up! There was no OPML import and it made me think for a second and then I realised that I want to offload the SPAM folder to the algorithm (no real spam, just in case). I hope Findka will make catching up on sources that post a lot a bit easier. Let's see how 7 links every 5 days would work :) And thank you for putting the effort into the development!
In certain contexts (like blog articles that aren't dealing with implementation specifics) I use "RSS" to mean "Atom/RSS", same as how "classical music" in many contexts also includes music from the baroque and romantic periods. I used "Atom/RSS" for a while, but recently my feeling is "eh".
I want to see new things from the same sources that I've already identified as being interesting. To that end, I frequently add subscriptions to a category called "New", and after a dozen articles or so either file it properly or stop pulling it.
Most RSS readers can handle this workflow.