Hacker News new | past | comments | ask | show | jobs | submit login

I've had a similar problem. In updating my portfolio site recently, I noticed a vast majority of links were dead. Not just live projects published maybe 3 years or more ago (I expect those to die). But also links to articles and mentions from barely one year ago, or links to award sites, and the like. With a site listing projects going back ~15 years, one can imagine how bad things were.

I had to end up creating a link component that would automatically link to an archive.org version of the link on every URL if I marked as "dead". It was so prevalent it had to be automated like that.

Another reason why I've been contributing $100/year to the Internet Archive for the past 3 years and will continue to do so. They're doing some often unsung but important work.




For portfolios, you should really also look into https://webrecorder.io/

It's _not_ a video recording service. It saves and can replay all network requests during a session (including authenticated requests). It's open source, you can self host, I'm not affiliated even though I'm very happy that it exists


Thanks for mentioning this site. I've done this with mitmproxy but it's complicated, and this was super easy -- I could recommend to anyone.


That's impressive. I'll take a look, thanks for the tip.


I've updated my portfolio before and noticed that as well. I usually include a screenshot or two when I first add a project, so at least that remains.

If the site goes down later, I just remove the link and don't worry about it. My code from 15 years ago is probably atrocious, so I'll consider it a small blessing :P


I actually downloaded a copy of the NYT article I was quoted in in 1996 specifically because I feared it would fall off the internet at some point.

It's behind a paywall now, but at least I have a digital copy!


Isn't that illegal, since you don't own the copyright? Or are you not distributing it and keeping it for archive purposes?


It's on my website, which is a blatant copyright violation. So far the NYT hasn't asked me to take it down.

It could potentially be considered fair use, since I'm not making a profit and I provide commentary.


> It's on my website, which is a blatant copyright violation

Tangential, but I long for a world where we can all be as candid with each other.


I figure you're doing the same as someone that cuts an article that they are mentioned in out of a newspaper and frames it on their wall. I've seen plenty of restaurants and businesses do it.


Except that's a physical product that was purchased.


What if OP purchased a copy of that day's paper?


That's exactly the point I'm making.


Isn't the argument for adverts that they're somewhat equivalent to buying the website page?

And: Who has been harmed here?


> It could potentially be considered fair use, since I'm not making a profit and I provide commentary.

Although people through that term around willy nilly, in our current framework that means being sued for a minimum of $100,000 per supposed violation, and making your fair use defense in front of a judge.

Youtubers have reported spending $50,000 just to begin talking with lawyers and preparing briefs.

Maybe our ISPs can start offering us insurance.


If he is actually providing commentary and it does meet the fair use requirements, the EFF would probably end up representing him.

Remember to donate to the EFF, they're literally the only thing between you and a world where the corporations rule the world.


Don't downvote people for asking questions ffs!


beware: robots.txt can retroactively clear archive.org data



To clear things up: robots.txt can retroactively hide content from the archive. If it's changed back to allowing the archive's crawler, content from before the ban can be accessed again.


another reason to use archive.is


Considering the topic of discussion, how sure can you be that archive.is will still be around in a year? Three years? Ten?

As much as I tried, all I could find about it is that it's run by one guy in Czech Republic who's paying $2000/month out of pocket for hosting, and apparently dislikes Finland.


It's actually worse - http://blog.archive.is/post/151510917631/how-do-you-guys-kee... says it's $4k/mo.

http://archive.is/robots.txt doesn't seem too bad, it looks like you could slowly inhale everything... in theory. There are no sitemaps (they're there, but empty placeholders); you have to know the site name to be able to get a workable list.

The author hasn't ruled out/blocked archiving the snapshots, but apparently it's... big. http://blog.archive.is/post/154930531126/if-someone-was-will...


I think http://www.webcitation.org/ might be better in that regard since it's a consortium of "editors, publishers, libraries". See "How can I be assured that archived material remains accessible and that webcitation.org doesn't disappear in the future?" in their FAQ (http://www.webcitation.org/faq). Although from my perspective it seems to be more geared towards academic use.


archive.is is very nice, but they're a URL-shortener as well, so their links are utterly opaque strings of alphanumerics, whereas the Wayback Machine preserves both the full original URL and the date and time it was captured in the archival URL.


What is the difference between these two?


archive.is does not crawl automatically, it must be pointed at a page by a user. While this makes it particularly useful for snapshotting frequently-changed pages, it is not a replacement for the proper Internet Archive.


perma.cc is another option as well.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: