Hacker News new | past | comments | ask | show | jobs | submit login
Wayback machine gets a facelift, new features (archive.org)
206 points by anigbrowl on Nov 4, 2013 | hide | past | favorite | 77 comments



In case anyone doesn't remember the old design, here is a link: https://web.archive.org/web/20131016082142/http://archive.or...


Did you just waybackmachine the waybackmachine?


Why didn't this cause an infinite loop?



Inception! :-)


Oh how I wish the wayback machine would ignore robots.txt... So many websites lost to history because some rookie webmaster put some misguided commands into the file without thinking about the consequences (eg. block all crawlers except google)


The worst part is when a site is stored in the wayback, then the domain expires and the new owner (or squatter) has a robots.txt that blocks everything, then all the old content becomes inaccessible.


They should store a history of WHOIS data for the site's domain and make separate archives for when the owner changes, I think. Also, why did anyone think that applying robots.txt retroactively is a good idea? :/


The worst part of this is that it's retroactive, so adding a robots.txt that denies the wayback machine access causes the machine to delete all history of the site. This is really annoying for patent cases where the prior art is on the applicant's own website: they can go and remove the prior art so it's no longer available (which is why examiners make copies of the wayback content before making their reports).


To be pedantic, they aren't lost. They are just unavailable until the robots.txt goes away. I'm fairly sure the Internet Archive aren't too keen on deleting things (unless you absolutely super duperly wants it gone and you're the author/owner of the data).


I'm surprised that some upstart search engine hasn't made a selling point that they ignore robots.txt and claim they search the pages google doesn't or something.


Speaking as an upstart search engine guy (blekko) who also has a bunch of webpages and a huge robots.txt, that's a bad idea. Such a crawler would be knocking down webservers by running expensive scripts and clicking links that do bad things like deleting records from databases or reverting edits in wikis. You don't want to go there.


Really? I was always taught that search engines only do "get" requests, and anything that modifies data is in a "post" request. Are there really that many broken web sites out there, that hasn't already fallen victim to crawlers that ignore robots.txt?


Yes, there are a lot of broken websites out there.


I noticed this today. Googling "united check in" and clicking the link "check" gave me a link that told me the confirmation number that I entered was invalid though I never entered one.


If their IPs became known, they might get blocked?


IANAL. But, although in principle, providing an easy opt-out shouldn't really matter with respect to copyright and so forth, as a practical matter it seems as if it does--in that, if you at least vaguely care about your website not being mirrored you have an easy way to prevent it. An organization like the Internet Archive simply can't afford (in terms of either time or money) to take a more aggressive approach to mirroring.

To be more specific--short of granting the Internet Archive some sort of special library exemption--what if I were to say, create a special archive of popular cartoon strips. What's the distinction?

[EDIT: The retroactive robots.txt situation seems less clear but, like orphan works, also depends on the scenarios you care to devise.]


Robots.txt is the only way to opt-out of the Wayback Machine.


A serious question: why should you be allowed to 'opt-out' of history? Is this really your call, as a website owner?


Or, with a more historical lens, lots of history has been learned by pouring over intimate private personal correspondences of historical figures - most of whom I would imagine would feel quite perturbed to see their love letters on display in museums.

Should historians not read private letters sent long ago? Should they swear to some oath and take a moral stand that such things shouldn't be examined?

If the answer is "No, they should read them.", then in that same way, then why, for historical record, should we observe robots.txt? Isn't it the same thing?


There's a technical reason - the blocked pages might open infinite URL spaces or bring the site down (crawler hitting /cgi-bin).


That is NOT a technical reason.

Technically speaking a robots.txt that says

User-agent: * Disallow: /

means that you should not crawl the site today. It should have no effect whatsoever on displaying pages that WERE crawled before the timestamp on the robots.txt file.


Actually, if you want to interpret robots.txt that way, it raises the problem of "how long can I consider a robots.txt valid for?"


Those cornercases can exist on sites that don't have a robots.txt and still have to be crawled correctly.


so I take it you're for facebook documenting as much of everyone's lives as possible - for historical reasons?


Probably not.

I don't think there is an inarguable answer to my rhetorical question. People's intents and wishes do matter.

But there also is an idea from antiquity about the public good and the commons. I guess at some point my personal wishes get trumped by this overarching principle.

The whole point of the question was that someone would say "You may not read my love letters" and then society said "Too bad, we're doing it anyway. And reprinting it in highschool text books."

Is that ok? I don't think there's a clear line and I do think there are probably moral boundaries.

I'm by no means Lawrence Lessig and this type of discourse I'm really not experienced at. I do think there are many important questions here that we may need to rethink our thoughts on.


One might nitpick that there was initially some distinction between the publicly available internet and a private facebook; although the latter seems to be making strides to narrow this gap.


Yes, because secrets and forgetting can be important.

It's not our cultural tradition that every written work (train schedules, greeting cards, friendly notes, lolcats,etc.) must be archived at the Library of Congress. I'm not sure that it'd be a good idea.

Archive.org is a good idea.


Since it's your bandwidth and your content, sure.


copyright law says it is.


No one is stopping you from archiving my websites if you think the data will have some importance. It seems like you're suggesting that archive.org is the universal keeper of history and everyone should agree with that idea.


no, you can just mail them.


I'd love to see this, even if they'd keep the content private for x number of years. Copyright runs out eventually and it would still be archived then.


We still have the NSA. They ignore everything.


The "Save Page Now" feature looks great. Hopefully this cures Wikipedia of its increasing link-rot.

Also, the Supreme Court will be happy: http://www.nytimes.com/2013/09/24/us/politics/in-supreme-cou...


I mentored a Google Summer of Code project to do just that - every citation on Wikipedia would be forwarded to Archive.org for permanent storage, and the citation link would be modified to offer the cached version as an alternative.

https://www.mediawiki.org/wiki/User:Kevin_Brown/ArchiveLinks

For various reasons this didn't get completed or deployed. It's still a good idea though. IMO it should be rewritten, but it wouldn't be a lot of code. I'd love to help anyone interested.

(French Wikipedia already does this, by the way. Check out the article on France, for example - all the footnotes have a secondary link to WikiWix. https://fr.wikipedia.org/wiki/France)


Alexis said (at the IA 10th Anniversary bash) that they are going to have this running very soon, using a bot to go over all of Wikipedia and insert archived links close to the dates of existing references (if available), and also capturing newly added links.


Excellent. Alexis rocks.


>For various reasons this didn't get completed or deployed.

Could you list why? It looks like a sorely needed feature!


I would just like to say that the Internet Archive is a pretty small bunch of people and they have a lot of never ending work to do on a somewhat tight budget.

I would assume it's mostly that. They seem very accepting and willing to do a lot of things.

That's why I'm a "donation subscriber". If you'd like to know more about it, please visit: http://archive.org/donate/ - a subscription helps extra much, because it's a constant flow of cash. But one-time donations are of course of help as well.


It wasn't the IA's fault. At the time, the IA was already working on an API to submit URLs and to rapidly cache items, so we just needed early access.

The GSoC student didn't follow up with the process of getting it adopted. I didn't either, which I regret. I left the WMF in early 2012 so I guess it was dropped on the floor for a while.

That said I have since found out that others have taken up the charge.


How difficult would it be to create a bookmarklet or chrome extensions to save the current page you're on to Archive.org?

Bookmarklet (thanks to sp332):

javascript:void(open('//web.archive.org/save/'+encodeURI(document.location)))


Heck, the page has it half-written in the code for the "Save page now" button :)

  document.location.href='//web.archive.org/save/'+$('#web_save_url').val();


Sir, not all of us our javascript/HTML5 people. I do mostly operations, and can barely drag myself through Javascript until I get the chance to take some vacation time to concentrate on learning the web side (JS/HTML5/etc). I admit that I don't know what I'm doing sometimes.

Does this look right?

javascript:void(open('//web.archive.org/save/'+encodeURI(document.location)))

I hacked it together from what you posted and what my archive.is bookmarklet specifies: javascript:void(open('http://archive.is/?run=1&url='+encodeURIComponent(document.l...)

EDIT: I can confirm that the bookmarklet I provided above does work. sp332, thanks for your help.


Here's Wikipedia's blurb on the subject, fwiw: https://en.wikipedia.org/wiki/Wikipedia:Citing_sources/Furth...


I am incredibly pleased at the save-page-now feature. Before there was a hack where using liveweb.archive.org might save a page on-demand, but you had no way of knowing. I'm adding this to my archive-bot immediately.


Here's page with my bookmarklet saved by itself :) http://web.archive.org/web/20131104224622/http://www.compone...


alternative : archive.is


Recently "vanished" by Wikipedia in response to a bot "spamming" WP with valid links to archive.is and archive.org.

http://enwp.org/WP:Archive.is_RFC

http://enwp.org/WP:Archive.is_RFC/Rotlink_email_attempt <-- conversation with archive.is operator or representative

http://enwp.org/WP:Using_Archive.is <-- "corrected"

http://enwp.org/Archive.is <-- deleted as "non notable"


Glad to see they finally got an API, however I'm a bit disappointed that it doesn't return the oldest archived date for a site, only the newest. I often need to check how long ago a site was originally archived. The API would have been very helpful for that, but the closest they provide is an option to query whether or not it was archived on a specific date, which is nowhere near as helpful.


Their older CDX API provides that functionality:

https://github.com/internetarchive/wayback/tree/master/wayba...


Wow, I never knew it existed, thanks for the link!


Funny, I just spent some time this weekend creating a json API wrapper around the little-known Memento API that wayback offers. My idea was to make a bookmarklet that would show prior versions of the page the user was visiting. (The backend is pretty trivial, actually, but I could use some help with the javascript/dom parts.)

(The CDX API linked below is links to the actual warc/arc archive files, not the web-viewable versions.)

Here's some info on the Memento API that links to web-viewable versions of a given url: http://ws-dl.blogspot.com/2013/07/2013-07-15-wayback-machine...

http://web.archive.org/web/timemap/link/{URI} will return a text stream of urls and dates.

Edit: formatting


I love the Wayback Machine (and all of Archive.org, really). I recently used it to reminisce about some old VRML-based chat communities that I frequented about 10 years ago. It had a record for every single of them.


They still have the site I used to host pre-release Warcraft 3 servers when I was young until my parents got a call from Blizzard telling them to take it offline ;)


Cybertown, perhaps?


Yep! There was also GoonieTown, which didn't last very long and eventually became VR Dimension. Flatland Rover was another, but it used its own 3DML engine instead of Blaxxun Contact and VRML. Good times!


I remember both. I was active from '99 to about 2002, and was a City Councilor/Colony Leader at one point. Nice to run into someone else with a similar background, the internet just isn't the same as it used to be in those days.


I just launched a similar service called https://www.DailySiteSnap.com that screenshots, emails, and archives a specified web site on a daily basis. My use case is to be able to look back at any one day and see what my site looks like, since Archive.org doesn't refresh my page as often as I update it.

Disclaimer: I'm really not trying to over-market myself, but I figured readers of this thread might be interested in my project. Happy to take down this post if it's read as too spammy.



Thanks for the info, was unaware of ARC/WARC formats. That said, I still think many people are looking for something simpler/easier, and a daily screenshot is good enough. Particularly, it will guarantee preserved formatting as browsers continue to evolve.


You can probably make your service do both screenshots and WARC, instead of loading a site directly, load it through WARC Proxy (https://github.com/odie5533/WarcProxy), that will write out a WARC file and you can still store your screenshot.

Once you have the WARCs you can upload them to Archive.org and they can be added to the wayback, or you can set up your own service for browsing them, built off something like warc-proxy https://github.com/alard/warc-proxy (Yeah, same name different purpose...)

There is also a MITM version of WARCProxy that will let you store HTTPS sites: https://github.com/odie5533/WarcMITMProxy


As of version 1.14, wget natively supports warc (including built-in gzip and cdx index file generation).

http://www.archiveteam.org/index.php?title=Wget_with_WARC_ou...

This makes creating a browse-able mirror of a site in warc format fairly straightforward, as wget will automatically make links relative, as well as fetch requisite files (css, js, images) for each page.


Yeah, but as far as I can guess, derwiki's service doesn't use wget, so running a proxy to store the WARCs is the next-simplest thing.


If his service runs on any sort of Linux distro, its stupid simple to call wget with a system call. Wget comes standard with all of the most popular distros.


I agree that screenshots are easier; WARC files are future-proof.


That would be nice, if you mailed the screenshots. But it looks like you just send a link to the screenshot.


They still won't let you look at pages if some domainer has aquired the domain and installed a robots.txt that disallows crawling.

They really should look at the date on the robots.txt and only apply it to pages retrieved while it is in effect.

Show us the pages from before the robots.txt became so restrictive!


Now I have a real reason to search the Wayback machine on the Wayback machine! Then: https://web.archive.org/web/20131024095443/https://archive.o... Now: https://web.archive.org/web/20131029213051/https://archive.o...


The new "Save Page Now" feature is great, but there is still no way to add full sites to crawl. For example, I added: http://www.cgw.com/Publications/CGW.aspx

But it would take hours or days to add every article from every issue.


Thank god they didn't change much. I hate when extremely functional websites decide to 'revolutionize' their interface (I'm looking @ you Google Maps).

I love this service.


On a purely aesthetic side, the new input form does clash with the old menu. The ~carousel seems a bit "cpu consuming", maybe a simpler tile grid as in Windows Phone 8. That said I love the service, and the frontend is probably not the most important part of their system.


They accept donations, and they even take Bitcoin: https://archive.org/donate/index.php

Be sure to send them some!


Disregarding whatever the rules are about this in the TOS, is there a good way to download/scrape your old archived website?


Don't know how good they are but waybackdownloader.com seems to provide that service


Let me know if you find one


they need to fire their plastic surgeon if a font update and spacing is what is considered a facelift.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: