ArchiveTeam needs OPMLs and feed URLs to grab cached data from Google Reader

ivank · on June 28, 2013

ArchiveTeam has saved cached Reader feed data for 37.3M feeds so far, and even though this seems like a lot, it still doesn't include many of the feeds people are subscribed to. Hence the request for OPMLs/subscriptions.xml files.

If you're interested in being able to read old posts in some future feed reading software, or just like having the data preserved, you can upload your OPMLs and ArchiveTeam will make its best effort to grab the feeds.

More details: http://archiveteam.org/index.php?title=Google_Reader

7TB+ of compressed feed text: http://archive.org/details/archiveteam_greader

Also, if anyone has billions of URLs that I can query, I could use them to infer feed URLs and save an incredible amount of stuff. See email in profile if you do.

ersii · on June 28, 2013

I went to Google Takeout (https://www.google.com/takeout/#custom:reader), checked my feeds for private ones (didn't have any) and submitted my feeds

epaulson · on June 28, 2013

I wrote some bad python to save my own copy of all the cached content from the feeds I subscribe to:

https://github.com/epaulson/stash-greader-posts

It doesn't upload them anywhere, but at least I've got my own copy of them if I ever think of something I want to do with them.

ivank · on June 28, 2013

Thanks for the link, I highly recommend people do this for the feeds they care about, since ArchiveTeam cannot guarantee bug-free operation.

Also, it looks like there is another tool mentioned in https://news.ycombinator.com/item?id=5958188

tmzt · on June 28, 2013

Cool, installing the applications on docker on my dedi.

I have a couple of question though:

Will the data remain archived on my system after it is updated? And what format will that be in?

Will there be a public API to access this data once uploaded, or for services such as Feedly to import back entries from feeds? (I would hope they would support that, but the public API would be enough for me.)

Thank you for providing this service.

ivank · on June 28, 2013

After the greader*-grab programs upload data to the target server, it is removed from your machine. All of the data eventually ends up in WARCs at https://archive.org/details/archiveteam_greader

As for an API, someone will hopefully write one to directly seek into a megawarc in that archive.org collection, or import everything into their feed reading service.

tmzt · on June 28, 2013

Any way I could patch the program to stop it from deleting the data after it is uploaded?

Among other things I would like to set up an ElasticSearch cluster for my own feeds.

Is the WARC format defined somewhere? I haven't looked at any other ArchiveTeam projects so I'm not informed if this format is used elsewhere.

ivank · on June 28, 2013

Yeah, you could patch seesaw-kit to not delete local data. Note that greader-grab just gets a random work item from the tracker.

There's an ISO spec for WARC and tools linked at http://www.archiveteam.org/index.php?title=The_WARC_Ecosyste...

subsystem · on June 28, 2013

I haven't tried it, but the --keep-data option might work?

https://github.com/ArchiveTeam/seesaw-kit/blob/master/run-pi...

ivank · on June 28, 2013

Yeah, that looks like the right thing.

ersii · on June 28, 2013

No, the data will be uploaded - first to an staging server run by ivank/"ArchiveTeam". Then it will be uploaded to the Internet Archive (Some has been uploaded already: https://archive.org/details/archiveteam_greader)

No, not currently. But the Internet Archive will provide the raw data and anyone is free to setup such an API :-)

Thanks for helping out!

antimatter15 · on June 28, 2013

This isn't relevant (except tangentially to the Internet Archive's Wayback machine), but I'm curious about the ethics (or legal standing) of rehosting the Google Reader app (client side portions) with a re-implementation of the (internal) Google Reader API so that the app remains usable in an unchanging state.

eli · on June 28, 2013

The client code and assets are surely copyrighted by Google, no?

Also, I think the backend is the hard part.

th0ma5 · on June 28, 2013

I think there is at least one project out on GitHub that is trying to make a compatible backend API built upon Node.js and other technologies.

drivebyacct2 · on June 28, 2013

Haha, no one procrastinated on exporting their data now did they?

webwanderings · on June 28, 2013

I am sorry but I don't think people should upload their OPML like this.

Your feed collection is like your personal life. It should be private (so what if the URLs are public and general). By disclosing your collection, it is like you are living in a glass house.

Unless I'm missing something, I don't see a single helpful reason for this service.

ersii · on June 28, 2013

Feel free to remove any feeds that make you insecure in submitting your feeds.

If you feel like you'd want to submit them, but hide.. I don't know, that some feeds might be relevant to each other? or something like that: then you could, split up your list of feeds and submit them in chunks that make sense to you. From different IPs or what not.

All of the historical data from going through all feeds through Google Reader, will be uploaded to the Internet Archive.

That means it will be available to everyone.

This is a non-profit service run by volunteers, that believe in saving data - because there's smart and creative people around the world (High concentration on HN) that can do good things with data.

I can think of one example: All of the new RSS Reader services could slurp this data in and provide you with a better service (and they won't know the feed URLs came from you)

zeckalpha · on June 28, 2013

The point is most feeds do not have all of their historical posts. Google Reader preserved old posts in the feeds. ArchiveTeam is taking our OPMLs and getting a copy of Reader's archive before it is taken down.

webwanderings · on June 28, 2013

If you plug in a feed in InoReader, you get thousands of items from the past. It seems they are fetching the historical feeds (I can't tell how far back they go).

But why would ArchiveTeam wants to preserve the historical items in a feed if the feed does not belong to them in the first place (neither did it belong to Google)?

eli · on June 28, 2013

Isn't that like asking why you should preserve an old book even if you aren't its author?

webwanderings · on June 28, 2013

No, there is a difference. An old book at the brink of extinction still belongs to you. You can get third party services to preserve the book for you with the stipulation that your preservation work and the book carries decent privacy rights (it won't be broadcast to the world what you're doing). Remember the old days when you would go to a store to develop your camera roll? The service implied that your picture content is between you and the developer of the film.

It's fine if people don't see any privacy implication here by submitting their reading collection. But as far as my single individual point is concerned, I don't see why I should upload my OPML for the sake of preservation. I have hard time uploading it to any other Google replacement out there trying to compete.

sp332 · on June 28, 2013

Oh, so you're not really worried about people downloading the content of the blogs, you just don't want anyone to connect the list of blogs to you? Well, you can just take the URLs out of the OPML file, and submit them one at a time. You can even use different IP addresses if you're that paranoid.

webwanderings · on June 28, 2013

" you just don't want anyone to connect the list of blogs to you?"

That's exactly my point.

ersii · on June 28, 2013

Several thinks ArchiveTeam have grabbed a lot more historical than what's available at for example InoReader. I personally think so as well.

There's also no telling if InoReader is open for grabbing what they've grabbed already again. Meaning it's potentially behind closed doors.

ArchiveTeam submits the data to Internet Archive, which anyone can upload and download from. This data is continously being uploaded and made public and free. See https://archive.org/details/archiveteam_greader for example.

Anyone can do anything with that data. Your OPML files are not being submitted though. That's also being said on the linked site for this item.

gwern · on June 28, 2013

> I am sorry but I don't think people should upload their OPML like this.

It's not all or nothing. OPML is easy to edit, and I did just that before I uploaded my OPML - deleted anything which might be a security issue. It took like 10 seconds.