Hacker News new | past | comments | ask | show | jobs | submit login
ArchiveBox: Open-source self-hosted web archiving (archivebox.io)
220 points by goranmoomin on April 19, 2021 | hide | past | favorite | 59 comments



It saves pages to archive.org as well. You might want to be careful while using this to archive personal content.


Yes, I can write a long article about why it's the default someday. I've agonized over this decision for many many months, and it's flipped flopped a few times as well.

The short version is that defaults in software are really important (90% of users wont change them), and I don't trust myself to code ArchiveBox 100% correctly so as to never lose data, or the majority of people to store their archives correctly so as to never lose data on their own. Archive.org is the redundant failsafe. Another good reason is that Archive.org is not the only way that your archive content can be leaked, the security model means that archived pages can read each other's content, so I want to make it abundantly clear to users that by default it's designed to only archive content thats already public (in which case it's already fair game for Archive.org).

I've settled on leaving it on as the default, but I do mention 3 times in the README how to disable it, most notably in the CAVEATS section which explains both the security model drawbacks and how to prevent your content from being leaked to Archive.org or other 3rd party APIs.


Although I tend privacy-by-default for most deployed technologies, the context of archiving does change the criteria quite a lot; you've selected a sensible and reasonable default, I reckon. Hopefully integrity is a consideration too? Glad to read that article, one day :)


Integrity is absolutely paramount too of course, which is why I chose Django (because of the mature DB migrations system that makes upgrades deterministic, reversible, and relatively painless). Hand coding a schema migration system would be a recipe for disaster and an easy opportunity for users to lose data.

Nevertheless, no system is perfect, and even with Django helping guard database integrity and multiple redundant index files, it's possible I'll make a mistake someday that leads to data loss on upgrade. I don't want that situation to be the next (mini) library of Alexandria, and saving copies to Archive.org helps serve as a last-resort backup.


Appreciate the project, web archives are becoming an important part of the internet ecosystem.

I personally have no issue with the defaults, but if you've agonized over the defaults, perhaps you should consider clearly documenting it in the main project README instead of leaving it for people to find in the config documentation.


As mentioned, it's in the README in 3 places, most notably here, where it has an entire section: https://github.com/ArchiveBox/ArchiveBox#archiving-private-u...



archive.org will only archive publicly visible content and it respects robots.txt


"and it respects robots.txt"

Not since 2017. https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

They now have a clunky manual process to exclude your site. https://help.archive.org/hc/en-us/articles/360004651732-Usin... ("How can I exclude...")

They don't spoof user agents, but blocking them actively doesn't remove their history.


When did this change? It used to be that adding robots.txt would retroactively remove archives for a domain.


2017.


Hide. Not remove.


Wow, great! Self-hosted, open-source, solid UI, tie-ins to the broader ecosystem... seems to check all the right boxes. Looking fwd to trying it and if all goes well, maybe see about integrating it into AthensResearch. Thanks for sharing!


I’ve been using ArchiveBox since the last time it popped up on HN and I like it a lot. It recently got a significant UI upgrade.


If you're interested in this sort of thing, you might also be interested in Archivy [1], which is somewhat similar but it (thankfully) doesn't upload your stuff to archive.org

[1] https://archivy.github.io/


> doesn't upload your stuff to archive.org

FIY You just have to set the environment variable SUBMIT_ARCHIVE_DOT_ORG=False


It's an accident waiting to happen.


See my answer here for why it's the default: https://news.ycombinator.com/item?id=26866689


Archivy is great too, in fact ArchiveBox sponsors Archivy development ;)


What’s the danger in uploading to archive.org?


Aside from the can of worms that is copyright infringement? There was a recent HN discussion about how much of a pain it is to get something removed from archive.org.

Not uploading people's stuff to permanent, public archives seems like a good rule of thumb.


Isn't this just passing the URL to archive.org which then does the actual archiving?

If it isn't already public (i.e., reachable by archive.org), it won't be afterwards?


Something can be public today and not tomorrow. Something can be made public by accident. Something can be publically reachable (i.e. a private URL but one without a login) without the intention of being searchable.


> There was a recent HN discussion about how much of a pain it is to get something removed from archive.org.

I really wanna read this but can't find the thread, do you happen to have a link?


Last time I tried to do this same thing, I didn't know about these, and ended up spending a couple days on wget and httrack. Do all these alternatives work from the command line, or are they their own little proprietary ecosystem?


How does it differ from https://wallabag.org?


It is quiet convenient to combine the two. You are able to export your wallabag list via RSS and import it on a schedule in archivebox.


https://www.linkace.org/

Just a heads-up. Found that a while ago and much prefer it over wallabag.


Thanks for sharing LinkAce! Maintainer here. If you or others have any questions, feel free to ask.


What makes it better?


I think it's primarily personal preference of features and how things are stored and presented. While Wallabag is more a Pocket/Read It Later alternative, LinkAce does not save the website itself, but a reference to it including the taxonomy you assign to it. It is intended to be a long-term bookmark archive, but without handling all the website archiving on its own.


Wallabag only stores links. ArchiveBox archives a snapshot of the actual content of a page at a specific point in time.


Wallabag extracts the content and stores it. Not just the links.


Didn't know that. Thanks for clarifying.


Even works on the Raspberry Pi, apparently. This would be nice in combination with a Pi-Hole.


Am I the only one who spins up a new VM in VMWare ESXi for things like this?


VIC containers for me - a full VM is a bit overkill for something this light :)


What's your setup like? I use VMWare workstation pro- I have Windows Enterprise 2019 LTSC N installed and snapshotted to a base VM.

Any time I need to do anything, I will full clone the base; with a decent SSD it takes maybe 10 seconds for the full clone and I have a full OS.


Actual ESXI on a decently powerful (33 cores, 512 gigs) machine. VMWare’s been really good to me, minus some points for occasional stupidity on upgrades.


Yeah, why don’t you just use a container?


How do tools like this cope with pages that are rendered by Javascript. What do the tools actually save? For instance if I save a Quora page using Firefox I can open it but if Quora is not accessible it doesn't work.


ArchiveBox is a wrapper around ~12 different extractor modules, each of which saves the page or its assets in a different way. The most relevant to JS is Singlefile, which renders the page in headless chrome and then snapshots the DOM with all assets inlined after a few seconds of JS execution. It's not perfect, but it works well even for the majority of JS-heavy sites.

For the very complex sites that really rely on a ton of interactive JS or dynamic requests to APIs to render their content, check out https://ArchiveWeb.page + https://ReplayWeb.page by https://webrecorder.io.


I can't wait for the API to be completed. I want to build something to archive HN (article + comments) and turn it into epub go read offline. Hard to do currently.


I was trying to do something similar with walls bag, but no epub file to send :(


I assume its auto corrected, do you mean Wallabag? https://github.com/wallabag/wallabag


Last time I saw this you could only view the archive via the UI. It's really come a long way.

Could be very useful now.


This is cool! Didn't know there were so many archiving options either, gonna check them all out.


No mention of search?


I literally just spun up a copy but it looks like it has Sonic full-text integration, however I'm not 100% sure if it's working via the UI as there isn't much feed back letting you know why a site has displayed in the results.


It has full-text search using ripgrep and sonic.


Would be great to allow for it to save to worm storage.


Worm storage?


write once read many


You might like one of ArchiveBox's peers, https://ArchiveWeb.page by Ilya Kreymer / Webrecorder.io, it has an option to save directly to IPFS.


Can I export my history from my iPhone web browsing?


I'm not sure how the iCloud Sync works in this case but if you use Desktop Safari where it has your iOS history too you might be able to get it out of the Safari sqlite DB on your computer:

https://stackoverflow.com/questions/28628385/sqlite-safari-h...



If you can somehow extract your browsing history, archivebox can ingest a list of links


ArchiveBox comes with a script that exports Safari history to a text file (which can then be imported into AB):

    ./bin/export_browser_history.sh --safari
https://github.com/ArchiveBox/ArchiveBox/blob/dev/bin/export...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: