It seems that RSS feed generators are a bit like static site generators: it's often thought to be easier to make your own than to learn to use someone else's.
Because the design of RSS/Atom put all of the complexity on the client (polling, state management, etc.) it's literally the same as static site generation. And by "the same", I don't mean "an equivalent but separate problem". I actually think having two separate generators—one outputting HTML, the other RSS—seems a bit wasteful. They're both parsing (presumably) the same content hierarchy and outputting it as SGML/XML-ish documents served over HTTP. One app should probably just do both (and it's easy to make your own that does)
This entire post has galvanized me to write up an idea I've been noodling over as I work on a reader myself: a standard that would eliminate precisely the waste you mention by specifying within the HTML all that's needed for a feed.
See https://sfeed.org. In the spirit of the multi-meaninged RSS acronym itself, the S might stand for scrape, selector, speed, or of course Scotty.
Vinni, might you be interested in enabling the standard in Feed Me Up?
Happy to take a merge request that adds the option to set `extends = "sfeed"` or `extends = "hatom"` and then automatically sets the correct selectors though.
(That said, if a publisher goes through the trouble of adhering to those selectors, they might as well publish a feed while they're at it :)
Ah thanks, didn't know about hAtom, will investigate that.
I do think it's (much) easier to add some HTML classes than output an entire separate file.
I've been using Nuxt + Strapi as my new CMS stack, and while it's a big step forward in so many other ways, outputting an RSS feed is far from automatic.
I disagree. I find the it wasteful, that every source will implement their way of rendering data. If we ignore the ad-problem for a moment, I would love if RSS would be the output of every website and the client then renders HTML to achieve the best UX possible. No broken layouts, no distractions, no dark pattern, just content.
I mean, it's exactly like a static site generator — I'd call it JAMstack, except the "API" is a plain HTML page and the markup is RSS :)
So yeah, definitely straightforward enough for a case of NIH syndrome. I think putting together the website took more time than writing the tool itself...
Very nice! I work on Feed Creator - https://createfeed.fivefilters.org - which is similar, although unlike yours doesn't use a headless browser, so selecting Javacript-inserted elements isn't supported.
That looks great! I ran into a couple of feed generators when I first needed this myself, but they were either paid or clumsy to use. That's when I thought, "alright, I can build this myself, and then I can even use CSS selectors, since I'm comfortable with those anyway". Of course, that is the point I should've thought about explicitly searching for one that supported that :)
Ah well. What I do like about my current approach is that I have full control without having to run my own server, which is nice.
The biggest issue I've found is people struggling to work out which CSS selectors to use. I wrote a blog post not that long ago to help people use the browser's developer tools to do that. Might help too with anyone trying to use this (although perhaps most of the HN audience doesn't need help here): https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...
Since everyone is pitching their own, I built https://github.com/fran-penedo/rssify, which started as a fork of https://github.com/h43z/rssify. The basic functionality is similar to Vinnl's: give it a URL and some selectors and it builds the RSS feed. From this, I added a few things: templates (if you want to subscribe to individual projects within a webpage, like fanfics in ao3), transforms (when the data is not quite the text of the DOM element), a flask server you can use to add new URLs you have a template for and update the feeds, and a userscript to add the current URL using the server.
My effort in this space is "furss", though it starts from an rss feed then aims to scrape the full article instead of an extract. https://github.com/jepler/furss
Kinda on a related note I found myself needing to make a bunch of these sorts of scraped feeds. The problem for me was the lack of date parsing support which I sorely needed (and it doesn't appear like this option supports it either)
I did write it specifically for my use-case so there are some "warts" on it like custom generators for HackerOne and Google's Monorail bug tracker. But perhaps someone else might benefit from its ability to create slightly more complicated RSS, Atom, or JSON feeds.
Couldn't those selectors be maintained by the community? Instead of everyone deploying this on their own GitHub Actions, and having to fix it independently when it breaks, a single repo with all kinds of feeds maintained by everyone?
That's an interesting idea: something like DefinitelyTyped, but instead of type definitions for npm packages it provides selectors for URLs. Main challenge there would be organising the moderation, I suppose.
It should be feasible to analyse the structure over time of the extracted data. Therefore, any proposed change which breaks the anticipated rhythm would be suspect.
Why would a project on GitLab be running GitHub Actions? The project's doc acknowledges that git is distributed and can run an automation script of the user's choice of environment, and so should our language match this rather than implicitly endorsing a centralized service.
I think they simply meant 'instead of everyone running their own CI', not GH Actions in general - they're asking for a centralized feed repository and not one that is subject to people putting in work every few months to fix things that break, as that's likely to hurt adoption. In my opinion it sounds like a request for an open source/more community maintained feedly.com.
I'm aware of this. It's more that words matter and have power and we should not choose to adopt the language of a centralized platform when there are words that encompass all.
In case anyone wants to detect the selectors automatically, here's a small python library I wrote that does it for you: https://github.com/lorey/mlscraper
A good idea and very cleanly implemented.
I imagine that there's a ton of other possible applications that don't require much modification to the code.
Thanks for sharing!
I made a thing like this once to scrape school websites to figure out if there was a snow day. I found that most schools have Twitter and it was simpler and faster to subscribe to Twitter for updates.
For anyone curious:
Free - 2 RSS feeds 24 hour refresh rate
$9.99 - 40 RSS feeds 30 min refresh rate
$19.99 - 100 RSS feeds 15 min refresh rate
My company uses RSS.app & I find it extremely nice to work with. You don't even have to provide selectors/click on elements for the vast majority of websites, they do that all for you.
We could roll our own solution using any of the above offerings (or something built in house), but it's cheap enough for our usecase that we don't see a point.
On a job website I wondered if I could use CSS selectors to get to the important stuff. I think a lot of people have had those thoughts, and the website knew. While the page might have some repeatable structure, the ids and classes were just randomly generated. Also, After you open the job listing, they too were all different because the job poster had the ability to set up their own mini html that was embedded in the page.
It is entirely possible that they did this to prevent scraping, but it is also possible it isn't intentional like that - for example, if you use the 'styled components' react library, it ends up spitting out a ton of class names that are basically gibberish, not intentionally.
That's been my experience too, as well as encountering a lot of legacy HTML markup without many useful class/id attributes. The random values are likely result of modern build tools like the one you mention, but there's also the growing popularity of utility-first CSS frameworks like Tailwind which also make finding useful selectors difficult.
In these situations it might be best to rely on position of elements and other CSS selectors rather than attribute values. Unfortunately the :nth-child and :nth-of-type selectors still trip many people up. In Feed Creator we borrowed from XPath to make selecting by position a little easier. We've got a comparison here: https://help.fivefilters.org/feed-creator/css-selectors.html...
We've since rebranded to New Sloth: https://newsloth.com, which is now a simple integrated feed builder, reader and clusterer/deduplicator, specially aimed for knowledge workers with hundreds and thousands of feeds to monitor daily.
Besides visual selector-based feed generation, our API can auto-magically detect relevant selectors (in most cases) as well.
We were a bunch of friends having a "web ring" (remember those?) and I did something similar using XHTML and xpath back in 2001. We stopped when RSS became commonplace in 2003and predicted a wonderful future. Almost 20 years later someone reinvented it, but having to rely om styling rather any semantic or clear structural indormation. It makes me a bit sad.
Not to piss on his/her parade, though: this is a great thing. I am going to tinker with it tonight.
Haha yeah, came up with the idea and went right ahead and implemented it. Probably should do a Show HN on April 1st though — just set a reminder for next year.
The problem with RSS/Atom feeds has always been they were mostly auto-generated. If only those was authored like a primary medium (instead of being just crappy auto-cuts of the articles' beginnings with links to the web site) it would be way better than social networks.
Thanks. I initially used a regular HTML parser as well, but I quickly ran into sites that wouldn't render without JavaScript. I'm therefore now using a regular browser controlled by Playwright to fetch the websites.
Care to name any sites? I've always managed to find workarounds for everything I've wanted to follow. Most websites want to be indexed by search engines, and googlebot doesn't do javascript. So sometimes a forged user agent is all you need. Occasionally, finding the actual json file and parsing the info you need out of it does the job. etc.
The User Agent trick is a good one that I should've tried, but I just checked and it didn't work for this one. Parsing actual JSON wasn't really an option, as I wanted to be able to quickly and easily add RSS feeds.
Possibly SEO is less a concern for the type of website I initially made this for, i.e. Dutch real estate agents. Most people find their listings through funda.nl rather than through search engines; I was just hoping to see them listed before they got posted there.
Send me a message on Twitter or email me (hacker_news@ my domain) if you still want the URL of a failing website to play around with.
This changes from time to time, of course, but when last I investigated, around two years ago, consensus was that it mostly wouldn’t do JavaScript until you nudged it into doing so in some way that I forget, and that it was always slower to index/update if it needed to do JavaScript.
(For my part, I disable JavaScript by default for various reasons, mostly performance, and it’s decidedly uncommon for a general-internet site to be completely broken by it. Sites that get posted on HN are disproportionately JS-dependent, especially if they’re new.)
> (Of course, for the combined feed this would be problematic.)
Not so; unlike the HTML <base> element which applies to a document, the xml:base attribute is applied to an element and its descendants. The typical pattern (as shown in the RFC 4287 example) is to put it on each entry’s <content>. In your markup, you’ll end up with each entry having its URL in three places:
Excellent! I'll look into actually implementing this before making further comments, since I'm sure I'll find out such things as I do :P
Edit: the package I'm using to generate the feeds does not support that attribute yet, so it'll have to wait a bit for my PR to hopefully be accepted: https://github.com/jpmonette/feed/pull/158
Similarly to my repository, I think I would suggest the option to fetch the configuration file from an external resource defined via an action secret. For my automation I'm using a Gist (not sure if Gitlab has same thing; also private but publicly accessible snippets).
At least that way you can keep your own feed configuration while allowing those that fork the repository to not have to manually fix conflicts within the feeds.toml config.
Woah this is cool! What you did with the setup documentation and the bit about automation was a nice touch, I wish more projects had this attention to the detail. Very simple, useful and elegant. Thanks for sharing this with us, Vincent!
This reminds me of the Soupault (https://soupault.app/) philosophy for building static sites. You write it in any language you want like, pass it trough Pandoc or AsciiDoctor as preprocessor, and postprocess with Lua and CSS selectors.
I have a 7 years old litle girl and i have hard time explaining her CSS selectors, any chance someone can help me ? on the other hand she has no problem grasping the concept of XML and RSS feeds,
Actions should not be used for:
- any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.
I'm not sure I understand the criticism. What exactly is supposed to be the alternative? It's not like this is supposed to be used for high-availability purposes or stock trading or something. It scratches a particular itch for a handful of end users who just want to get all their news in their feed reader. It's expected that they'll maintain their own scraper parameters.
Pick a number of programming languages and you can scrape a page in 10-20 lines of code in many cases. The barrier to entry is understanding the DOM layout for a particular website, which is subject to change at any moment.
Purely IMO, a more friendly way to go about it to abstract from code and CSS knowledge is to run a UI that highlights elements, lets you select them, select the title, description, link etc and there you go. Same thing but without the knowledge of DOM/XPath/selectors.
You must have missed the point, or I'm missing the point of this thing. Select whatever you like, do I need an external tool to do that? Either I have knowledge of the DOM or I don't, and if I don't then a UI selector would be the next best guess.
Website doesn’t offer API. So we must scrape. This is a tool for that paradigm.
Your comment is “scraping is brittle”, which is not helpful. Everyone knows scraping sucks. It’s why these tools are being made to make it less unpleasant.
It's not even that 'scraping sucks', that's just a stigma to the word. The point is that it's not stable and having to have DOM knowledge to select the paths is less inclusive.
Yes, I am. And this project is obviously not designed to handle that. (It can't even do pagination.)
But it can bring feed capabilities to simple, timeline like sites. Sites like most frameworks produce. And with the dev tools you can quickly find the needed dom path. It's limited, but easy to use. If it doesn't work, you need a real scraper which is an order of magnitude more complex. (I maintain some of them as well.)
Presenting something on the home page of HN would suggest something novel. I must be missing the point because of the downvotes. Happy to be enlightened. Pick CSS selectors and scrape something? That's quintessentially scraping.
I'd say the novel part of the project is the Github actions/Gitlab CI integration. I haven't seen that used for RSS generation yet. That the selectors are stored in a config file is also unusual for these types of feed generation projects.
Though Show HNs do not need to be novel to get to the top. And RSS related projects seem to go to the frontpage simply because of the fondness HN has for the technology. Also, don't forget https://xkcd.com/1053/ :)
If you scrape using accessible parts of the page (e.g. `aria-` attributes) it is less likely to be brittle, since if it were it would mean their site had stopped being accessible to screen readers, etc.
Is there someplace I can read more about this idea? Any aspect of it. I'm one of the less technically literate members of HN, but I'm a co-owner of a blind dev group.
Not sure for the downvote reason. Like I say scraping is brittle. I've done hundreds of scraping projects in the past. A recent one was taking 50 shopping sites looking for what products they covered. Safe to say you look at it a month later and the XPaths/CSS selectors/whatever scraping method you used has changed.
I can't reply to the other subcomments because of the downvotes.
Just to be clear, I'm not against scraping. Done it myself plenty, point was that it's hard to maintain and if I were to make a tool for the masses, it'd be in a UI that highlights elements and lets you select them rather than dig around the DOM. Pretty sure things like this have existed for 10 years +
Ah that's fine, this is explicitly not a tool for the masses. The problem I had with tools like the ones you describe are that either they just use string matching, or the selectors they generate are particularly brittle (e.g. incorporating randomised class names, like you mentioned elsewhere).
Given that there wasn't really an alternative to scraping for me, I wanted to at least be able to pick selectors myself that were less likely to break with minor changes. Then I figured there might be more people who know CSS and have the same desires, hence my sharing it here :)
I do have some code lying around somewhere that would revive wordpress blogs via wayback, using a UI where you could select elements in the 'theme' and restore the backend database. It's simply a messy business IMO. But for one off jobs, scraping is the go to.
Indeed, but the scraping workaround is brittle. We could make rules for all sites on the web that don't have a feed or API but it's not all that manageable.
It also begs the question of why they don't make their content that way, perhaps it was by choice. Especially if you use this kind of tool to syndicate things.
Not quite sure of the novelty of this one given that people have scraped things for decades.
There's probably a wiki endpoint for their example, maybe, maybe not. A lot of wiki's stuff is free to download and they have extensive endpoints for acquiring data.
Re-stating the "brittleness" over and over doesn't add any weight to your argument. Regardless of whether or not the solution is brittle - it is still a viable solution for users who don't care or mind maintaining it. It sounds like you're not the target user and that's ok...
If you have a less brittle alternative that allows me to get the latest real estate listings in my area in my feed reader before it hits the aggregator sites, I'm all ears!
Heh, one thing I've learned over the years is that a great way to not actually publish my side projects is to overthink the name, so nowadays I just pick one and go with it when I'm ready to publish it.
I’m glad to see that you’re using Atom rather than RSS, but you’re still calling it RSS. Can I convince you to stop calling it that? It’s factually inaccurate, and keeps RSS, the inferior format, more popular by virtue of mindshare. Just call them feeds, the proper generic term.
Same problem with SSL encryption. But with both RSS and SSL, if the user use a modern client, he shouldn't bother if its internally using RSS, Atom, SSL or TLS.
I for one know most of my pals do know about RSS, but haven't heard the term 'Atom feed'.
“SSL” is finally dying as a term; the significant majority of what I see calls it TLS now.
But the problem didn’t exist in the same way with SSL/TLS: SSL was actively killed off in favour of TLS so that regardless of what you call it, you’re actually dealing with TLS. But with feeds, RSS is still supported, and so the mindshare problem happens: people hear about RSS and so implement the inferior and problematic RSS rather than Atom, because they’ve never even heard of Atom and so don’t know that it’s what they should have implemented instead, because it’s more robust. Calling feeds RSS perpetuates RSS, which is undesirable.
I say just advertise it as “feeds”, not “RSS feeds”. Atom is an implementation detail, just like RSS should be. (Well, RSS should be dead in general feeds, with only podcast feeds keeping it alive.)
> I say just advertise it as “feeds”, not “RSS feeds”.
I think that's even worse. For most ordinary people that has a bunch of unintended meanings. "My Facebook has a feed, do you mean subscribing to your site on Facebook? Why do I see a bunch of code when I click the link?"
I think it's the oposite. People continuing calling it RSS is limiting there understanding of this domain. Most people are all about RSS this, reviving that and talk about the specific format, instead of the purpose for which RSS is actually used.
This also leads to rather poor solutions and discussions which are just reinventing the old stuff again and again, instead of letting the domain evolve to match modern demand.
Take for example this project here. Is there any feedreader out there who has this built-in to make it comfortable for the user to use this? The state of tooling as I know it is still at the level were we have seperate tools which pushes out a newsfeed-format, which than must be maintainend seperatly in our feedreaders. Which is an extraordinary hazzle to maintain configurations on multiple places.
> Take for example this project here. Is there any feedreader out there who has this built-in to make it comfortable for the user to use this?
Maybe I'm misunderstanding the question. This project is easy to use in all feed readers precisely because it just generates a feed URL that you can subscribe to like any other feed.
On the other hand, maybe you're asking why the feed readers don't have this feature built in. Some of them do! The one I've used for years, Inoreader, has it, although it might be a premium feature.
On the other other hand, sometimes it seems from your comment like your issue is with the very idea of RSS. That a website can publish a feed in a well-defined / open format and that feed can then be consumed by any conforming reader. That's as I see it the beauty of the whole system. I don't need to use the site itself, if it sucks. I can subscribe to anything I like using whatever software I like, rather than relying on a centralized service. What's the alternative, something like Twitter where I subscribe to a bunch of accounts and they push whatever they like into my feed?
I do not care. RSS and Atom both win over closed platforms. If I could get back to an RSS-oriented internet, I would rejoice even if Atom didn’t exist.
RSS is fine for most uses, and is a much better "brand" than Atom.
There never was a valid reason to define a separate format rather than extending RSS to clarify ambiguities - I think its time to drop this bikeshedding and just think of Atom as a slightly different type of RSS.
Anyway, here's another self-hosted open source RSS feed generator for arbitrary websites: https://github.com/hueyy/HungryHippo