Show HN: RSS feeds for arbitrary websites using CSS selectors

DeLopSpot · on July 6, 2021

It seems that RSS feed generators are a bit like static site generators: it's often thought to be easier to make your own than to learn to use someone else's.

Anyway, here's another self-hosted open source RSS feed generator for arbitrary websites: https://github.com/hueyy/HungryHippo

lucideer · on July 6, 2021

Because the design of RSS/Atom put all of the complexity on the client (polling, state management, etc.) it's literally the same as static site generation. And by "the same", I don't mean "an equivalent but separate problem". I actually think having two separate generators—one outputting HTML, the other RSS—seems a bit wasteful. They're both parsing (presumably) the same content hierarchy and outputting it as SGML/XML-ish documents served over HTTP. One app should probably just do both (and it's easy to make your own that does)

askhan · on July 6, 2021

This entire post has galvanized me to write up an idea I've been noodling over as I work on a reader myself: a standard that would eliminate precisely the waste you mention by specifying within the HTML all that's needed for a feed.

See https://sfeed.org. In the spirit of the multi-meaninged RSS acronym itself, the S might stand for scrape, selector, speed, or of course Scotty.

Vinni, might you be interested in enabling the standard in Feed Me Up?

Vinnl · on July 6, 2021

There already is such a standard! https://microformats.org/wiki/hatom

Happy to take a merge request that adds the option to set `extends = "sfeed"` or `extends = "hatom"` and then automatically sets the correct selectors though.

(That said, if a publisher goes through the trouble of adhering to those selectors, they might as well publish a feed while they're at it :)

detaro · on July 6, 2021

https://microformats.org/wiki/h-feed is the up-to-date version

askhan · on July 6, 2021

I like this name better — the h prefix is cool in that it shows things are coming all the way from basic HTML.

Microformats, I see this is the proper way.

Vinnl · on July 6, 2021

Ah thanks - been out of that world for a long time now.

askhan · on July 6, 2021

Ah thanks, didn't know about hAtom, will investigate that.

I do think it's (much) easier to add some HTML classes than output an entire separate file.

I've been using Nuxt + Strapi as my new CMS stack, and while it's a big step forward in so many other ways, outputting an RSS feed is far from automatic.

pedro1976 · on July 8, 2021

I disagree. I find the it wasteful, that every source will implement their way of rendering data. If we ignore the ad-problem for a moment, I would love if RSS would be the output of every website and the client then renders HTML to achieve the best UX possible. No broken layouts, no distractions, no dark pattern, just content.

niutech · on July 8, 2021

This is more less how Google Web Light worked.

Vinnl · on July 6, 2021

I mean, it's exactly like a static site generator — I'd call it JAMstack, except the "API" is a plain HTML page and the markup is RSS :)

So yeah, definitely straightforward enough for a case of NIH syndrome. I think putting together the website took more time than writing the tool itself...

pedro1976 · on July 6, 2021

I created a automated HTML->Feed mapper [0], that simply analyzes the structure and offers you potential feeds.

[0] https://github.com/damoeb/rss-proxy/

k1m · on July 5, 2021

Very nice! I work on Feed Creator - https://createfeed.fivefilters.org - which is similar, although unlike yours doesn't use a headless browser, so selecting Javacript-inserted elements isn't supported.

Vinnl · on July 5, 2021

That looks great! I ran into a couple of feed generators when I first needed this myself, but they were either paid or clumsy to use. That's when I thought, "alright, I can build this myself, and then I can even use CSS selectors, since I'm comfortable with those anyway". Of course, that is the point I should've thought about explicitly searching for one that supported that :)

Ah well. What I do like about my current approach is that I have full control without having to run my own server, which is nice.

k1m · on July 5, 2021

The more the merrier! :) And always nice to see more in the RSS space.

Feed Creator is in PHP and unlike yours doesn't produce a static file. The CSS selectors, URL and other parameters are all embedded in the URL, e.g.

  https://createfeed.fivefilters.org/extract.php?url=http%3A%2F%2Fjohnpilger.com%2Farticles&item=.entry&item_desc=.intro&item_date=.entry-date

The biggest issue I've found is people struggling to work out which CSS selectors to use. I wrote a blog post not that long ago to help people use the browser's developer tools to do that. Might help too with anyone trying to use this (although perhaps most of the HN audience doesn't need help here): https://www.fivefilters.org/2021/how-to-turn-a-webpage-into-...

axiolite · on July 5, 2021

Here's a list of RSS feed generators:

https://github.com/AboutRSS/ALL-about-RSS#webpagehtml

fran-penedo · on July 5, 2021

Since everyone is pitching their own, I built https://github.com/fran-penedo/rssify, which started as a fork of https://github.com/h43z/rssify. The basic functionality is similar to Vinnl's: give it a URL and some selectors and it builds the RSS feed. From this, I added a few things: templates (if you want to subscribe to individual projects within a webpage, like fanfics in ao3), transforms (when the data is not quite the text of the DOM element), a flask server you can use to add new URLs you have a template for and update the feeds, and a userscript to add the current URL using the server.

jepler · on July 6, 2021

My effort in this space is "furss", though it starts from an rss feed then aims to scrape the full article instead of an extract. https://github.com/jepler/furss

kdbg · on July 5, 2021

Kinda on a related note I found myself needing to make a bunch of these sorts of scraped feeds. The problem for me was the lack of date parsing support which I sorely needed (and it doesn't appear like this option supports it either)

I ended up writing my own CLI tool that similarly supports CSS selectors for feed generation: https://github.com/dayzerosec/feedgen

I did write it specifically for my use-case so there are some "warts" on it like custom generators for HackerOne and Google's Monorail bug tracker. But perhaps someone else might benefit from its ability to create slightly more complicated RSS, Atom, or JSON feeds.

Example config with date parsing: https://github.com/dayzerosec/feedgen/blob/main/configs/bish...

remram · on July 5, 2021

Couldn't those selectors be maintained by the community? Instead of everyone deploying this on their own GitHub Actions, and having to fix it independently when it breaks, a single repo with all kinds of feeds maintained by everyone?

Vinnl · on July 5, 2021

That's an interesting idea: something like DefinitelyTyped, but instead of type definitions for npm packages it provides selectors for URLs. Main challenge there would be organising the moderation, I suppose.

contingencies · on July 5, 2021

It should be feasible to analyse the structure over time of the extracted data. Therefore, any proposed change which breaks the anticipated rhythm would be suspect.

specialist · on July 6, 2021

Yes, say more, please.

I've long been waiting for the rise of the unblocker. Community curated, like you suggest.

Ad blockers use filters to exclude. Like a blacklist.

My hunch is a strategy of using scrapers to extract OC, completely skipping over ads, would also be viable.

The output could be a bit more rich than today's Reader View.

toastal · on July 6, 2021

Why would a project on GitLab be running GitHub Actions? The project's doc acknowledges that git is distributed and can run an automation script of the user's choice of environment, and so should our language match this rather than implicitly endorsing a centralized service.

judge2020 · on July 6, 2021

I think they simply meant 'instead of everyone running their own CI', not GH Actions in general - they're asking for a centralized feed repository and not one that is subject to people putting in work every few months to fix things that break, as that's likely to hurt adoption. In my opinion it sounds like a request for an open source/more community maintained feedly.com.

toastal · on July 6, 2021

I'm aware of this. It's more that words matter and have power and we should not choose to adopt the language of a centralized platform when there are words that encompass all.

lorey · on July 5, 2021

In case anyone wants to detect the selectors automatically, here's a small python library I wrote that does it for you: https://github.com/lorey/mlscraper

clickok · on July 5, 2021

A good idea and very cleanly implemented. I imagine that there's a ton of other possible applications that don't require much modification to the code. Thanks for sharing!

lorey · on July 9, 2021

Thank you so much, happy to hear! There's a more versatile version coming soon :)

kwerk · on July 5, 2021

This looks interesting! Will you be adding an open source license?

lorey · on July 9, 2021

Will do, just haven't decided yet :)

earthboundkid · on July 6, 2021

I made a thing like this once to scrape school websites to figure out if there was a snow day. I found that most schools have Twitter and it was simpler and faster to subscribe to Twitter for updates.

earthboundkid · on July 6, 2021

The project: https://github.com/baltimore-sun-data/track-changes

ianberdin · on July 5, 2021

My friend is working on a quite simple feed generator from any website, social media: https://rss.app. Maybe helpful.

rnkn · on July 6, 2021

Some feedback for the project, please reveal pricing/limits before signup.

goleary · on July 8, 2021

Agreed, they should show it before signup.

For anyone curious: Free - 2 RSS feeds 24 hour refresh rate $9.99 - 40 RSS feeds 30 min refresh rate $19.99 - 100 RSS feeds 15 min refresh rate

My company uses RSS.app & I find it extremely nice to work with. You don't even have to provide selectors/click on elements for the vast majority of websites, they do that all for you.

We could roll our own solution using any of the above offerings (or something built in house), but it's cheap enough for our usecase that we don't see a point.

peanutz454 · on July 5, 2021

On a job website I wondered if I could use CSS selectors to get to the important stuff. I think a lot of people have had those thoughts, and the website knew. While the page might have some repeatable structure, the ids and classes were just randomly generated. Also, After you open the job listing, they too were all different because the job poster had the ability to set up their own mini html that was embedded in the page.

VoidWhisperer · on July 5, 2021

It is entirely possible that they did this to prevent scraping, but it is also possible it isn't intentional like that - for example, if you use the 'styled components' react library, it ends up spitting out a ton of class names that are basically gibberish, not intentionally.

k1m · on July 5, 2021

That's been my experience too, as well as encountering a lot of legacy HTML markup without many useful class/id attributes. The random values are likely result of modern build tools like the one you mention, but there's also the growing popularity of utility-first CSS frameworks like Tailwind which also make finding useful selectors difficult.

In these situations it might be best to rely on position of elements and other CSS selectors rather than attribute values. Unfortunately the :nth-child and :nth-of-type selectors still trip many people up. In Feed Creator we borrowed from XPath to make selecting by position a little easier. We've got a comparison here: https://help.fivefilters.org/feed-creator/css-selectors.html...

dugite-code · on July 6, 2021

Always good to see RSS projects pop up on hackernews. I'm still maintaining the Feediron plugin for TT-RSS - https://github.com/feediron/ttrss_plugin-feediron

Unlike this project Feediron is only for modifying existing RSS feeds to extract the desired information. Typically uses xpaths to select content

pacman2 · on July 5, 2021

Thank you, will try it out.

Two things I am using:

Twitter to RSS: https://github.com/RSS-Bridge/rss-bridge

Arbitrary RSS feeds: https://feedity.com

nreece · on July 6, 2021

Thanks for using and mentioning Feedity!

We've since rebranded to New Sloth: https://newsloth.com, which is now a simple integrated feed builder, reader and clusterer/deduplicator, specially aimed for knowledge workers with hundreds and thousands of feeds to monitor daily.

Besides visual selector-based feed generation, our API can auto-magically detect relevant selectors (in most cases) as well.

pacman2 · on July 6, 2021

oh no. My free feedity feeds will keep working?

bjoli · on July 6, 2021

We were a bunch of friends having a "web ring" (remember those?) and I did something similar using XHTML and xpath back in 2001. We stopped when RSS became commonplace in 2003and predicted a wonderful future. Almost 20 years later someone reinvented it, but having to rely om styling rather any semantic or clear structural indormation. It makes me a bit sad.

Not to piss on his/her parade, though: this is a great thing. I am going to tinker with it tonight.

canada_dry · on July 5, 2021

Browsing OP's repositories was this gem: https://gitlab.com/vincenttunru/flatuscode A VSCode extension. That adds farts on every keypress. That's all.

hallway_monitor · on July 5, 2021

Pretty great for April 1st... Or whenever you find an unlocked machine!

Vinnl · on July 5, 2021

If you're back in the office, I can recommend enabling it for just one particular programming language on your coworker's VSCode.

canada_dry · on July 5, 2021

When I saw it, I did do the quick calc based on the last update: worked out to around Christmas. OP might have just been bored over the holidays.

Vinnl · on July 5, 2021

Haha yeah, came up with the idea and went right ahead and implemented it. Probably should do a Show HN on April 1st though — just set a reminder for next year.

qwerty456127 · on July 9, 2021

The problem with RSS/Atom feeds has always been they were mostly auto-generated. If only those was authored like a primary medium (instead of being just crappy auto-cuts of the articles' beginnings with links to the web site) it would be way better than social networks.

rcarmo · on July 5, 2021

This is nice. I’m actually doing much the same with Node-RED’s HTML parser, which also supports simple selectors.

Vinnl · on July 5, 2021

Thanks. I initially used a regular HTML parser as well, but I quickly ran into sites that wouldn't render without JavaScript. I'm therefore now using a regular browser controlled by Playwright to fetch the websites.

axiolite · on July 5, 2021

Care to name any sites? I've always managed to find workarounds for everything I've wanted to follow. Most websites want to be indexed by search engines, and googlebot doesn't do javascript. So sometimes a forged user agent is all you need. Occasionally, finding the actual json file and parsing the info you need out of it does the job. etc.

Vinnl · on July 5, 2021

The User Agent trick is a good one that I should've tried, but I just checked and it didn't work for this one. Parsing actual JSON wasn't really an option, as I wanted to be able to quickly and easily add RSS feeds.

Possibly SEO is less a concern for the type of website I initially made this for, i.e. Dutch real estate agents. Most people find their listings through funda.nl rather than through search engines; I was just hoping to see them listed before they got posted there.

Send me a message on Twitter or email me (hacker_news@ my domain) if you still want the URL of a failing website to play around with.

moehm · on July 5, 2021

> googlebot doesn't do javascript

But it does.

Primary source: I experienced it on my own site.

Secondary source: https://developers.google.com/search/docs/guides/fix-search-...

chrismorgan · on July 5, 2021

This changes from time to time, of course, but when last I investigated, around two years ago, consensus was that it mostly wouldn’t do JavaScript until you nudged it into doing so in some way that I forget, and that it was always slower to index/update if it needed to do JavaScript.

(For my part, I disable JavaScript by default for various reasons, mostly performance, and it’s decidedly uncommon for a general-internet site to be completely broken by it. Sites that get posted on HN are disproportionately JS-dependent, especially if they’re new.)

bellyfullofbac · on July 5, 2021

I wish there's a snippet of what the XML looks like, even better if it's "rendered"...

Vinnl · on July 5, 2021

That's a good one, I'll see if I can add something to the site. Meanwhile you can see the generated examples here:

- https://vincenttunru.gitlab.io/feeds/funfacts.xml

- https://vincenttunru.gitlab.io/feeds/wikivoyage.xml

And the combined feed:

- https://vincenttunru.gitlab.io/feeds/all.xml

Add those links to your feed reader to see rendered examples while I update the site.

Edit: preview added to the website.

moehm · on July 5, 2021

FYI, the links inside your feeds don't work, because they are relative, not absolute.

Vinnl · on July 5, 2021

Ah, you mean the ones inside the contents? That's a good one. I'm not sure if that's easily fixable, but I'll give it some thought. For those interested, I'll track it here: https://gitlab.com/vincenttunru/feed-me-up-scotty/-/issues/1

chrismorgan · on July 5, 2021

The appropriate way is to use the xml:base attribute, as demonstrated in the example at https://datatracker.ietf.org/doc/html/rfc4287#page-4.

Vinnl · on July 5, 2021

Thanks! I'll look at that.

chrismorgan · on July 5, 2021

> (Of course, for the combined feed this would be problematic.)

Not so; unlike the HTML <base> element which applies to a document, the xml:base attribute is applied to an element and its descendants. The typical pattern (as shown in the RFC 4287 example) is to put it on each entry’s <content>. In your markup, you’ll end up with each entry having its URL in three places:

  <id>http://example.com/item</id>
  <link href="http://example.com/item"/>
  <content type="html" xml:base="http://example.com/item">…</content>

Vinnl · on July 5, 2021

Excellent! I'll look into actually implementing this before making further comments, since I'm sure I'll find out such things as I do :P

Edit: the package I'm using to generate the feeds does not support that attribute yet, so it'll have to wait a bit for my PR to hopefully be accepted: https://github.com/jpmonette/feed/pull/158

Thanks for the pointers!

mhitza · on July 5, 2021

This could nicely supplement my GitHub automation that emails feed digests https://github.com/mhitza/subscriptions-digest

Similarly to my repository, I think I would suggest the option to fetch the configuration file from an external resource defined via an action secret. For my automation I'm using a Gist (not sure if Gitlab has same thing; also private but publicly accessible snippets).

At least that way you can keep your own feed configuration while allowing those that fork the repository to not have to manually fix conflicts within the feeds.toml config.

0des · on July 5, 2021

Woah this is cool! What you did with the setup documentation and the bit about automation was a nice touch, I wish more projects had this attention to the detail. Very simple, useful and elegant. Thanks for sharing this with us, Vincent!

spacec0wb0y · on July 5, 2021

What do I do to get this working?

So far I've forked feeds, edited feeds.toml, checked it out as a branch gh-pages, pushed the branch up to github.

I can see the page at https://<username>.github.io/feeds/ but https://<username>.github.io/feeds/actions is just a 404.

Vinnl · on July 5, 2021

Ah, those instructions are unclear — as far as I know, you first have to go to https://github.com/<username>/feeds/actions to enable Workflows for your repository. Then, your feeds should be published to https://<username>.github.io/feeds/<feedname>.xml.

Does that work?

spacec0wb0y · on July 5, 2021

not exactly. I saw below the instruction to run `npx feed-me-up-scotty` so I did and it generated the public/ dir and feeds.

ok I managed to access the actions route by a different URL to the readme. I copied the pages.yml workflow from your repo.

Few minutes later I could see my feed. Very nice! I need to clean up my selectors now!

spicybright · on July 5, 2021

I very simple plumbing tools like these.

toastal · on July 5, 2021

This reminds me of the Soupault (https://soupault.app/) philosophy for building static sites. You write it in any language you want like, pass it trough Pandoc or AsciiDoctor as preprocessor, and postprocess with Lua and CSS selectors.

vhodges · on July 5, 2021

You both might find https://stitcherd.vhodges.dev/ interesting then.

ossusermivami · on July 13, 2021

I have a 7 years old litle girl and i have hard time explaining her CSS selectors, any chance someone can help me ? on the other hand she has no problem grasping the concept of XML and RSS feeds,

galambo · on July 9, 2021

Related: https://galambo.wordpress.com/2021/04/21/rss-tools/

colordrops · on July 5, 2021

This seems like it could be used to build a simple scraper as well.

Vinnl · on July 5, 2021

I'd recommend using the tools I used for this directly if you're looking to do this. Playwright in particular: https://playwright.dev

alanchen · on July 5, 2021

This perhaps has more flexibility and can deal with almost any website.

shadowbanned69 · on July 6, 2021

https://weboas.is/tech/ quite a collection of tech RSS feeds

flowerlad · on July 6, 2021

I would love to have an RSS feed for stock quotes that works by scraping Google finance. A lot of people would love to have that as well.

tgel0 · on July 6, 2021

I created a free sheet-to-RSS service which lets you do that (somewhat). Here is a short write-up including the links:

https://www.notion.so/How-to-create-your-own-stock-and-crypt...

asdff · on July 6, 2021

You can do this with python libraries already

jazzdev · on July 8, 2021

So GitHub Actions is essentially a free server to run any background cron jobs I want?

jazzdev · on July 8, 2021

I guess not. Github terms[1]:

Actions should not be used for: - any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.

[1] https://docs.github.com/en/github/site-policy/github-terms-f...

ricardo81 · on July 5, 2021

Nice, but it's essentially scraping? Scraping is brittle.

Some design changes and the whole thing breaks. Maintenance nightmare.

bscphil · on July 5, 2021

I'm not sure I understand the criticism. What exactly is supposed to be the alternative? It's not like this is supposed to be used for high-availability purposes or stock trading or something. It scratches a particular itch for a handful of end users who just want to get all their news in their feed reader. It's expected that they'll maintain their own scraper parameters.

ricardo81 · on July 5, 2021

Pick a number of programming languages and you can scrape a page in 10-20 lines of code in many cases. The barrier to entry is understanding the DOM layout for a particular website, which is subject to change at any moment.

Purely IMO, a more friendly way to go about it to abstract from code and CSS knowledge is to run a UI that highlights elements, lets you select them, select the title, description, link etc and there you go. Same thing but without the knowledge of DOM/XPath/selectors.

cromka · on July 5, 2021

> highlights elements, lets you select them,

And how would that software then remember your chosen elements if not by their CSS ids/classes?

ricardo81 · on July 5, 2021

You must have missed the point, or I'm missing the point of this thing. Select whatever you like, do I need an external tool to do that? Either I have knowledge of the DOM or I don't, and if I don't then a UI selector would be the next best guess.

z3c0 · on July 5, 2021

By positional indexing, of course.

Just kidding, that's horrendous.

kortilla · on July 5, 2021

Website doesn’t offer API. So we must scrape. This is a tool for that paradigm.

Your comment is “scraping is brittle”, which is not helpful. Everyone knows scraping sucks. It’s why these tools are being made to make it less unpleasant.

ricardo81 · on July 5, 2021

It's not even that 'scraping sucks', that's just a stigma to the word. The point is that it's not stable and having to have DOM knowledge to select the paths is less inclusive.

moehm · on July 5, 2021

> having to have DOM knowledge to select the paths is less inclusive

Right click on the element in Firefox, click "inspect element" and it shows you the unique selector for that element.

ricardo81 · on July 5, 2021

Are you aware that some websites use randomised css ids and classes? Have to wonder why. Google search results being an example.

moehm · on July 5, 2021

Yes, I am. And this project is obviously not designed to handle that. (It can't even do pagination.)

But it can bring feed capabilities to simple, timeline like sites. Sites like most frameworks produce. And with the dev tools you can quickly find the needed dom path. It's limited, but easy to use. If it doesn't work, you need a real scraper which is an order of magnitude more complex. (I maintain some of them as well.)

playpause · on July 5, 2021

There’s usually some way to target what you want through descendent/sibling, tag name, and attribute selectors.

ricardo81 · on July 5, 2021

Yes, if you're determined enough you can scrape anything. Just not sure what the new thing here is.

detaro · on July 5, 2021

who has claimed there is something fundamentally new here?

ricardo81 · on July 5, 2021

Presenting something on the home page of HN would suggest something novel. I must be missing the point because of the downvotes. Happy to be enlightened. Pick CSS selectors and scrape something? That's quintessentially scraping.

onli · on July 5, 2021

I'd say the novel part of the project is the Github actions/Gitlab CI integration. I haven't seen that used for RSS generation yet. That the selectors are stored in a config file is also unusual for these types of feed generation projects.

Though Show HNs do not need to be novel to get to the top. And RSS related projects seem to go to the frontpage simply because of the fondness HN has for the technology. Also, don't forget https://xkcd.com/1053/ :)

lhnz · on July 5, 2021

If you scrape using accessible parts of the page (e.g. `aria-` attributes) it is less likely to be brittle, since if it were it would mean their site had stopped being accessible to screen readers, etc.

KirillPanov · on July 5, 2021

Why the downvotes here? This is objectively true and, in my experience, very very useful.

Screen readers are scrapers too. "Scraper" does not mean "user agent I don't like".

DoreenMichele · on July 5, 2021

Is there someplace I can read more about this idea? Any aspect of it. I'm one of the less technically literate members of HN, but I'm a co-owner of a blind dev group.

Please and thank you.

KirillPanov · on July 5, 2021

https://developer.mozilla.org/en-US/docs/Web/Accessibility/A...

ricardo81 · on July 5, 2021

Not sure for the downvote reason. Like I say scraping is brittle. I've done hundreds of scraping projects in the past. A recent one was taking 50 shopping sites looking for what products they covered. Safe to say you look at it a month later and the XPaths/CSS selectors/whatever scraping method you used has changed.

This is why APIs exist.

Vinnl · on July 5, 2021

Ideally sites would just publish RSS feeds themselves, but not all of them do — let alone an API…

ricardo81 · on July 5, 2021

I can't reply to the other subcomments because of the downvotes.

Just to be clear, I'm not against scraping. Done it myself plenty, point was that it's hard to maintain and if I were to make a tool for the masses, it'd be in a UI that highlights elements and lets you select them rather than dig around the DOM. Pretty sure things like this have existed for 10 years +

Vinnl · on July 5, 2021

Ah that's fine, this is explicitly not a tool for the masses. The problem I had with tools like the ones you describe are that either they just use string matching, or the selectors they generate are particularly brittle (e.g. incorporating randomised class names, like you mentioned elsewhere).

Given that there wasn't really an alternative to scraping for me, I wanted to at least be able to pick selectors myself that were less likely to break with minor changes. Then I figured there might be more people who know CSS and have the same desires, hence my sharing it here :)

ricardo81 · on July 5, 2021

I do have some code lying around somewhere that would revive wordpress blogs via wayback, using a UI where you could select elements in the 'theme' and restore the backend database. It's simply a messy business IMO. But for one off jobs, scraping is the go to.

ricardo81 · on July 5, 2021

Indeed, but the scraping workaround is brittle. We could make rules for all sites on the web that don't have a feed or API but it's not all that manageable.

It also begs the question of why they don't make their content that way, perhaps it was by choice. Especially if you use this kind of tool to syndicate things.

Not quite sure of the novelty of this one given that people have scraped things for decades.

There's probably a wiki endpoint for their example, maybe, maybe not. A lot of wiki's stuff is free to download and they have extensive endpoints for acquiring data.

eyedontgetit · on July 5, 2021

Re-stating the "brittleness" over and over doesn't add any weight to your argument. Regardless of whether or not the solution is brittle - it is still a viable solution for users who don't care or mind maintaining it. It sounds like you're not the target user and that's ok...

Vinnl · on July 5, 2021

If you have a less brittle alternative that allows me to get the latest real estate listings in my area in my feed reader before it hits the aggregator sites, I'm all ears!

cyborgx7 · on July 5, 2021

this is something I've been looking for

neat

markdown · on July 5, 2021

Great project, but that's a shitty name.

Vinnl · on July 5, 2021

Heh, one thing I've learned over the years is that a great way to not actually publish my side projects is to overthink the name, so nowadays I just pick one and go with it when I'm ready to publish it.

markdown · on July 5, 2021

I can't disagree with that :)

chrismorgan · on July 5, 2021

I’m glad to see that you’re using Atom rather than RSS, but you’re still calling it RSS. Can I convince you to stop calling it that? It’s factually inaccurate, and keeps RSS, the inferior format, more popular by virtue of mindshare. Just call them feeds, the proper generic term.

nvr219 · on July 5, 2021

If OP didn't use the word RSS and said "atom feeds" I wouldn't know what they're talking about, so I recommend keep calling it RSS.

slightwinder · on July 6, 2021

The generic name for this is Newsfeed (or webfeed according to wikipedia, but never heard this in the wild).

Would you understand it with those names?

moehm · on July 5, 2021

Same problem with SSL encryption. But with both RSS and SSL, if the user use a modern client, he shouldn't bother if its internally using RSS, Atom, SSL or TLS.

I for one know most of my pals do know about RSS, but haven't heard the term 'Atom feed'.

chrismorgan · on July 5, 2021

“SSL” is finally dying as a term; the significant majority of what I see calls it TLS now.

But the problem didn’t exist in the same way with SSL/TLS: SSL was actively killed off in favour of TLS so that regardless of what you call it, you’re actually dealing with TLS. But with feeds, RSS is still supported, and so the mindshare problem happens: people hear about RSS and so implement the inferior and problematic RSS rather than Atom, because they’ve never even heard of Atom and so don’t know that it’s what they should have implemented instead, because it’s more robust. Calling feeds RSS perpetuates RSS, which is undesirable.

I say just advertise it as “feeds”, not “RSS feeds”. Atom is an implementation detail, just like RSS should be. (Well, RSS should be dead in general feeds, with only podcast feeds keeping it alive.)

bscphil · on July 5, 2021

> I say just advertise it as “feeds”, not “RSS feeds”.

I think that's even worse. For most ordinary people that has a bunch of unintended meanings. "My Facebook has a feed, do you mean subscribing to your site on Facebook? Why do I see a bunch of code when I click the link?"

slightwinder · on July 6, 2021

I think it's the oposite. People continuing calling it RSS is limiting there understanding of this domain. Most people are all about RSS this, reviving that and talk about the specific format, instead of the purpose for which RSS is actually used.

This also leads to rather poor solutions and discussions which are just reinventing the old stuff again and again, instead of letting the domain evolve to match modern demand.

Take for example this project here. Is there any feedreader out there who has this built-in to make it comfortable for the user to use this? The state of tooling as I know it is still at the level were we have seperate tools which pushes out a newsfeed-format, which than must be maintainend seperatly in our feedreaders. Which is an extraordinary hazzle to maintain configurations on multiple places.

bscphil · on July 7, 2021

> Take for example this project here. Is there any feedreader out there who has this built-in to make it comfortable for the user to use this?

Maybe I'm misunderstanding the question. This project is easy to use in all feed readers precisely because it just generates a feed URL that you can subscribe to like any other feed.

On the other hand, maybe you're asking why the feed readers don't have this feature built in. Some of them do! The one I've used for years, Inoreader, has it, although it might be a premium feature.

On the other other hand, sometimes it seems from your comment like your issue is with the very idea of RSS. That a website can publish a feed in a well-defined / open format and that feed can then be consumed by any conforming reader. That's as I see it the beauty of the whole system. I don't need to use the site itself, if it sucks. I can subscribe to anything I like using whatever software I like, rather than relying on a centralized service. What's the alternative, something like Twitter where I subscribe to a bunch of accounts and they push whatever they like into my feed?

justshowpost · on July 6, 2021

SSL is TLS looks similar but aren't the same. Today, we use TLS.

alanfranz · on July 5, 2021

I do not care. RSS and Atom both win over closed platforms. If I could get back to an RSS-oriented internet, I would rejoice even if Atom didn’t exist.

account42 · on July 6, 2021

RSS is fine for most uses, and is a much better "brand" than Atom.

There never was a valid reason to define a separate format rather than extending RSS to clarify ambiguities - I think its time to drop this bikeshedding and just think of Atom as a slightly different type of RSS.

pmlnr · on July 5, 2021

Oh, yes, please revive the bloody format wars /s