Hacker News new | past | comments | ask | show | jobs | submit login

I've got a set of small repos of small government sites I've used a combination of `wget` and `curl` and other shell commands to snapshot, mostly so I can have a reliable mirror when teaching web scraping: https://github.com/wgetsnaps

But as the submitted article points out, archiving the Web is much trickier these days, and wget is no longer sufficient for anything relatively modern. I've been impressed with what Internet Archive has seemingly been able to do, and I've been interested whether it's the result of improved techniques on their side, or of certain sites following a standard that happens to make them more archivable.

For example, 538's 2018 election trackers are very JS dependent, yet IA has managed to capture them in a way that not only preserves the look and content, but keeps their widgets and visualizations almost entirely fully functional:

https://web.archive.org/web/20181102125134/https://projects....

However, even the excellent archive of 538's site shows a huge weakness in IA's efforts: IA (quite understandably) aggressively caches a site's dependencies, such as external JS and JSON data files. If you scroll down the 538 example posted above, you'll see that despite being a snapshot on Nov. 2, 2018, many of its widgets only contain data from the last time IA fetched its external dependencies, which appears to be August 16, 2018.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: