Hacker News new | past | comments | ask | show | jobs | submit login
Internet Archive Scholar (archive.org)
446 points by nabla9 on Dec 9, 2022 | hide | past | favorite | 54 comments



Some days, nearly half the links I click are dead so I've found myself relying on the waybackmachine more and more over the past few months. It's really shocking just how fast digital obsolescence reared its ugly head. Of course angelcites etc. were a clear early blow, but nowadays...

I've started saving the html (including the css seems like too much overhead, and often it's incomplete or relies on downloads still - and screenshots are not searchable) of every interesting article I find online, downloading quite a few videos too with yt-dlp. I'd long copypasted all interesting comments into a txt file, but now it seems like data hoarding's the way to go - at least in moderation, focusing on things I'll actually refer back to.

I remember 15 years ago, discovering pdf dumps on random sites like a kid in a candy store. Perhaps it'll be like that again, with people presenting museums of their favorite old pages.


Some people use this tool (of mine) for saving web content from either bookmarks or just everything you browse: https://github.com/crisdosyago/Diskernet

There's also plent¥ of other similar tools:

- https://github.com/ArchiveBox/ArchiveBox

- https://github.com/gildas-lormeau/SingleFile


> Coming to a future release, soon!: The ability to publish your own search engine that you curated with the best resources based on your expert knowledge and experience.

This would be fantastic, being able to browse a curated internet made of accumulated lists from other trusted users on the net, similar to how ad blocking lists work today.

You are genuinely trying to steer the internet into what it used to be: a museum of knowledge and expert discussion.

Edit: Ah but wow, Polyform license. Huh.


How have I never seen your tool before.

>22120 archives content exactly as it is received and presented by a browser, and it also replays that content exactly as if the resource were being taken from online.

I've been looking for this for a really long time.


Beware the very strange and bad license for Diskernet, which is "Polyform Strict License 1.0.0"


For people looking for more info on these strange licenses:

https://www.reddit.com/r/linux/comments/coazye/what_does_rli...


Thank you for raising concerns about the licensing of Diskernet. I understand that the Polyform Strict License 1.0.0 may be unfamiliar to some users, but I believe that it offers several benefits.

First, my licensing protects my rights as the creator of Diskernet. I want to ensure that my hard work is not used or modified without my permission, especially in commercial settings. I believe this is a fair and reasonable request.

Second, my licensing allows individuals to use Diskernet for free for personal use. This means that anyone can download and use the tool to improve their own browsing experience, without any cost or obligation.

Third, my licensing allows businesses and organizations to purchase a license and use Diskernet for their own purposes. This allows me to continue improving and supporting the tool, while also providing businesses with a valuable tool for archiving and organizing their online content.

I understand that the Polyform Strict License may not be the most common licensing approach, but I believe it offers a good balance between the interests of the open source community and my rights as the creator of Diskernet. I hope you will consider giving my tool a try, and I believe you'll find it to be a valuable addition to your online browsing experience.


That Polyform licence looks truly awful. Lack of mention of how I could use it while working.


> how I could use it while working.

The readme clearly links where to buy a license for that.


I'd never heard of SingleFile before but it looks excellent. It would be great if Firefox could incorporate it into its save function too. Firefox save page works but as shown on the SingleFile demo video, it's not really what a user would expect, it's often not complete and splitting it across multiple files/directories isn't ideal either.


It's unfortunate as Firefox used to have excellent MHTML support (which similarly achieves an all-in-one file) via addons, particularly the feature-rich UnMHT. While Chromium and its derivatives support MHTML saving natively (and in the past Opera Presto and IE).

If they brought back MHTML saving support it'd be a great win.


Zotero also saves snapshots of pages if you already cite academic pages


While the author(s) are still alive, they are often a productive contact.

(In one case I was able to give back: bundling up the several scans an author had of a half-century old paper from their student days into a single, hopefully cromulent, PDF)

Edit: recall also that accepting that links are one-way and might be dead was the key simplification that allowed HTTP to take off after prior attempts at hypermedia had failed.


Good (Edit) point! It's good that the web accepts dead links by design, we can't expect perfection from our distributed information, but it seems that the bitrot of information is too high compared to the information storage technologies available.

2 spinning rust drive can store the library of congress. ~ 2,000 drives would store the web (1). How many millions of these drives get manufactured per year? Our technology systems are failing us - all those words are being lost, like tears in rain.

(1)Back of the envelope estimation: https://www.worldwidewebsize.com/ ~ estimates 50 billion websites, with some estimates ~ 6 pages of information per website. Let's say 1mb per page on average. So ~2,000 drives would store the entire web.


2000 drives x $100 = $200,000. Double that for backup, $400,000. Admin, maintenance, let's say total, $1M/year. So, Wikipedia could end its own deadlink problem (IF the reference sources would agree.

But stuff goes missing at Wayback because people don't agree to their pages being backed-up. Copyright, whatever. So it's like Global Heating, the tech is there, but people just can't agree. So 'pirate' backer-uppers go to jail. And island-nations and expensive ocean-side properties are being submerged. So it goes.


The Internet Archive has a bot that updates dead Wikipedia references to point to archived content.

https://meta.wikimedia.org/wiki/InternetArchiveBot


Even if that estimate is off by an order of magnitude, which given the weight of modern web pages it easily could be, 20,000 drives to store the entire web seems way more doable than I ever would have imagined.


I use the markdownload extension[1] on firefox and move the .md file into my notes folder (notable[2]). Works very well.

1. https://addons.mozilla.org/en-US/firefox/addon/markdownload/

2. https://notable.app/


I just save the pdf of a site that's really important to me.

When I did a major college project in 2003, I made sure to make pdfs of any academic article that I referenced. It actually saved me, because some articles disappeared by the time I went to revise my references.


> I've started saving

I do similar. I've had https://github.com/ArchiveBox/ArchiveBox bookmarked for a while as something to try better organise all that, but like a great many things I haven't go around to it yet.


I use Raindrop for this. It’s a pretty great bookmark manager made by an indie dev, but it also can create archives of bookmarked pages.

https://help.raindrop.io/backups#permanent-library


“Only available in Pro plan”. No information on whether that is a general limitation or if these features are present if self-hosted, so I assume the former. And there seems to be little obvious information for self-hosting.

So probably not one for my use case.


I use Raindrop but didn't know abou that feature. Thanks.


Thanks, I'll add that to the list of things to try out.


I've been wanting to run my own search engine sorta thingy that indexes websites I feed it. I sometimes find little nooks of the net that post resources I may need in the future. Like my own mini-Google that indexes a list of sites.

How can I go about creating this? Are their off-the-shelf solutions, will I need to say combine scrapy with elastic search? The links in this thread look promising.


I wanted to say 'au contrer' to your 'screenshots are not searchable' and link this[0] but I don't actually see images in the readme.. I swear it was there, maybe it's a buried extra flag..

[0] https://github.com/phiresky/ripgrep-all


I find the SingleFile extension superb for this.


I've already made my donation to IA this year but I might need to make another.

Somehow it's the IA's job to fix problems that we all know are problems, sadly.


If you're able/comfortable, please consider setting up a recurring donation. For long-term planning reasons, it's helpful for organizations to have a consistent recurring revenue stream that they can use to project assets further into the future. One-off donations are good, too! But if you're going to consistently send them money anyway, you may as well do it in a predictable manner to help their accounting.


Wikipedia reminded me multiple times to donate to the Internet Archive this year.


Another chance to upvote the donation link to top thread on an IA story since a direct submission got swallowed by the dupe detector! They are doing such much amazing things.

https://archive.org/donate

Your Donation Will Be Matched 2-to-1! [...] Right now, we have a 2-to-1 Matching Gift Campaign, tripling the impact of every donation. (from the home page)


Same. Oh hey these scammers are asking for money again? Wait, I haven’t given to IA in a while.


Haha, i know what you mean...my mind works the same way ;)


Is this so Wikipedia can be archived by the IA?


I always take the wikipedia donation drive as a reminder to donate to archive.org instead.


It's a joke. Hacker News doesn't like donating to Wikipedia; many choose to donate to Internet Archive instead.


HN doesn't mind donating to Wikipedia, they dislike donating to Wikimedia. That one letter is a big difference.


Is there really a way to donate specifically to Wikipedia? I've never seen it in any of the many, many threads.


As far as I know there isn't... that's the problem.


PSA the Internet Archive is a 501c3 non-profit (library!) and survives on donations and grants.

A huge percentage of the operating budget is from small donors. The funding is preposterously small compared to other public-service public interest such as Wikipedia.

A lot of us take it for granted and assume there is e.g. support from FAANG companies proportionate to the degree they lean on it.

This is 100% NOT THE CASE.

Please advocate for recurring institutional donations from your firm. The audience reading this has a lot of voice in a lot of organizations who could without a though sign up to make annual 10K, 100K, 1M donations...

...and essentially, none do.

Please help change that!!!

https://archive.org/donate/


Anyone who uses amazon.com can set the Internet Archive as their preferred charity and shop using smile.amazon.com. A percentage of your purchases amount will go to the IA.


Related:

Internet Archive Scholar - https://news.ycombinator.com/item?id=26419782 - March 2021 (3 comments)

Internet Archive Scholar: Search Millions of Research Papers - https://news.ycombinator.com/item?id=26401568 - March 2021 (47 comments)


This seems like the type of thing that will become the search engine of first resort in the future as AI-generated propaganda and nonsense pollutes the spectrum of websites and search results.


Not from the IA, but see https://scholia.toolforge.org for an especially nice presentation of freely-available scholarly metadata.


You may also be interested in OpenAlex.org which also uses wikidata (along with DOIs, ORCIDs, ISSNs and a few other standard identifiers) to classify publications.


After a little testing, this looks like a good information source, although the combination of Google Scholar and sci-hub is probably still the best option, i.e. I couldn't find anything with Internet Scholar that wasn't available with the other options, and the quality of results on searchs is somewhat higher with Google Scholar (this may be because Google Scholar utilizes citation count as a search parameter, which Internet Archive Scholar doesn't seem to do).

Internet Archive is a great resource, however, it should get state funding as it provides a fundamentally important archival service. It's too bad it has to rely so much on private philanthropic donations (although state support comes with possible political interference, i.e. censorship of material that some politician doesn't like, maybe that's less of a problem with private donations, although then you could have some billionaire doing the same thing).


The content in scholar.archive.org has been indexed in to Google Scholar (and other indices are likewise welcome to crawl the sitemap). There was some content "only" in scholar.archive.org, but now it should basically all be in Google Scholar. We haven't gotten around to describing this publicly, but it was an explicit decision and partnership between the organizations.

Indeed scholar.archive.org does not currently use citation count in search rankings. We have a decent citation graph, which we are working to expose in scholar (it is visible in fatcat.wiki today). Would probably only ever use citation count as a weak boost in search rankings (eg, "any citations at all", "more than 25 citations" as boosts, nothing beyond that), don't want to create too strong a feedback loop influencing future citations.

scholar.archive.org specifically was partially funded by the Mellon Foundation (and partially through donations and other service revenue). IA overall has diverse funding, including grants and service revenue from the USA (Library of Congress, IMLS, etc); other national governments (paid crawl services); foundation grants; universities and libraries (crawl, preservation, and digitization services); and of course general donations. The last category of course has the fewest strings and lets us pursue new projects which might be hard to get traditional funding for. Remember that the whole premise of web archiving was considered radical and quixotic at the beginning!

(source: I work at IA on scholar)


I tried to search for scientific authors from the 1800s and they're there.

Google Scholar on the other hand brings me to paywalls, even though the articles are so old they should be out of copyright.


archive.org is an alternative good internet as a giant library, as dreamed in early 90's: web archive, film archive, software archive, media archive... and now research papers archive.


This is amazing. Some of my grandfather's papers are on there. [1] He was a medical missionary in China and India. I doubt anyone in my family has had a chance to see these before. Quite a gift. I'm trying to get my grandmother's books onto there as well at some point.

[1] - https://scholar.archive.org/search?q=%22Frederick+G.+Scovel%...



I wonder if this includes anything from Sci-Hub or are they unrelated.


Artifacts that are of questionable legality due to copyright are archived but not made public for obvious reasons (typically referred to being “darked”).


Agreed




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: