ArchiveBox: Open-source self-hosted web archiving

weekay · on April 19, 2021

It saves pages to archive.org as well. You might want to be careful while using this to archive personal content.

nikisweeting · on April 19, 2021

Yes, I can write a long article about why it's the default someday. I've agonized over this decision for many many months, and it's flipped flopped a few times as well.

The short version is that defaults in software are really important (90% of users wont change them), and I don't trust myself to code ArchiveBox 100% correctly so as to never lose data, or the majority of people to store their archives correctly so as to never lose data on their own. Archive.org is the redundant failsafe. Another good reason is that Archive.org is not the only way that your archive content can be leaked, the security model means that archived pages can read each other's content, so I want to make it abundantly clear to users that by default it's designed to only archive content thats already public (in which case it's already fair game for Archive.org).

I've settled on leaving it on as the default, but I do mention 3 times in the README how to disable it, most notably in the CAVEATS section which explains both the security model drawbacks and how to prevent your content from being leaked to Archive.org or other 3rd party APIs.

jka · on April 19, 2021

Although I tend privacy-by-default for most deployed technologies, the context of archiving does change the criteria quite a lot; you've selected a sensible and reasonable default, I reckon. Hopefully integrity is a consideration too? Glad to read that article, one day :)

nikisweeting · on April 20, 2021

Integrity is absolutely paramount too of course, which is why I chose Django (because of the mature DB migrations system that makes upgrades deterministic, reversible, and relatively painless). Hand coding a schema migration system would be a recipe for disaster and an easy opportunity for users to lose data.

Nevertheless, no system is perfect, and even with Django helping guard database integrity and multiple redundant index files, it's possible I'll make a mistake someday that leads to data loss on upgrade. I don't want that situation to be the next (mini) library of Alexandria, and saving copies to Archive.org helps serve as a last-resort backup.

22c · on April 20, 2021

Appreciate the project, web archives are becoming an important part of the internet ecosystem.

I personally have no issue with the defaults, but if you've agonized over the defaults, perhaps you should consider clearly documenting it in the main project README instead of leaving it for people to find in the config documentation.

nikisweeting · on April 20, 2021

As mentioned, it's in the README in 3 places, most notably here, where it has an entire section: https://github.com/ArchiveBox/ArchiveBox#archiving-private-u...

kenniskrag · on April 19, 2021

Can be disabled but on in default mode.

https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#...

asaddhamani · on April 19, 2021

archive.org will only archive publicly visible content and it respects robots.txt

tyingq · on April 19, 2021

"and it respects robots.txt"

Not since 2017. https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

They now have a clunky manual process to exclude your site. https://help.archive.org/hc/en-us/articles/360004651732-Usin... ("How can I exclude...")

They don't spoof user agents, but blocking them actively doesn't remove their history.

Karunamon · on April 19, 2021

When did this change? It used to be that adding robots.txt would retroactively remove archives for a domain.

tyingq · on April 19, 2021

2017.

pseudalopex · on April 19, 2021

Hide. Not remove.

chrisweekly · on April 19, 2021

Wow, great! Self-hosted, open-source, solid UI, tie-ins to the broader ecosystem... seems to check all the right boxes. Looking fwd to trying it and if all goes well, maybe see about integrating it into AthensResearch. Thanks for sharing!

thedanbob · on April 19, 2021

I’ve been using ArchiveBox since the last time it popped up on HN and I like it a lot. It recently got a significant UI upgrade.

elric · on April 19, 2021

If you're interested in this sort of thing, you might also be interested in Archivy [1], which is somewhat similar but it (thankfully) doesn't upload your stuff to archive.org

[1] https://archivy.github.io/

dugite-code · on April 19, 2021

> doesn't upload your stuff to archive.org

FIY You just have to set the environment variable SUBMIT_ARCHIVE_DOT_ORG=False

pseudalopex · on April 19, 2021

It's an accident waiting to happen.

nikisweeting · on April 19, 2021

See my answer here for why it's the default: https://news.ycombinator.com/item?id=26866689

nikisweeting · on April 20, 2021

Archivy is great too, in fact ArchiveBox sponsors Archivy development ;)

faitswulff · on April 19, 2021

What’s the danger in uploading to archive.org?

elric · on April 19, 2021

Aside from the can of worms that is copyright infringement? There was a recent HN discussion about how much of a pain it is to get something removed from archive.org.

Not uploading people's stuff to permanent, public archives seems like a good rule of thumb.

tashbarg · on April 19, 2021

Isn't this just passing the URL to archive.org which then does the actual archiving?

If it isn't already public (i.e., reachable by archive.org), it won't be afterwards?

elric · on April 19, 2021

Something can be public today and not tomorrow. Something can be made public by accident. Something can be publically reachable (i.e. a private URL but one without a login) without the intention of being searchable.

notRobot · on April 19, 2021

> There was a recent HN discussion about how much of a pain it is to get something removed from archive.org.

I really wanna read this but can't find the thread, do you happen to have a link?

throwaway823882 · on April 19, 2021

Last time I tried to do this same thing, I didn't know about these, and ended up spending a couple days on wget and httrack. Do all these alternatives work from the command line, or are they their own little proprietary ecosystem?

rcMgD2BwE72F · on April 19, 2021

How does it differ from https://wallabag.org?

coffeeri · on April 19, 2021

It is quiet convenient to combine the two. You are able to export your wallabag list via RSS and import it on a schedule in archivebox.

unicornporn · on April 19, 2021

https://www.linkace.org/

Just a heads-up. Found that a while ago and much prefer it over wallabag.

Kovah · on April 19, 2021

Thanks for sharing LinkAce! Maintainer here. If you or others have any questions, feel free to ask.

pseudalopex · on April 19, 2021

What makes it better?

Kovah · on April 19, 2021

I think it's primarily personal preference of features and how things are stored and presented. While Wallabag is more a Pocket/Read It Later alternative, LinkAce does not save the website itself, but a reference to it including the taxonomy you assign to it. It is intended to be a long-term bookmark archive, but without handling all the website archiving on its own.

hlasdjlfhalwjk · on April 19, 2021

Wallabag only stores links. ArchiveBox archives a snapshot of the actual content of a page at a specific point in time.

ernesth · on April 19, 2021

Wallabag extracts the content and stores it. Not just the links.

hlasdjlfhalwjk · on April 26, 2021

Didn't know that. Thanks for clarifying.

ssl232 · on April 19, 2021

Even works on the Raspberry Pi, apparently. This would be nice in combination with a Pi-Hole.

dehrmann · on April 19, 2021

Am I the only one who spins up a new VM in VMWare ESXi for things like this?

Karunamon · on April 19, 2021

VIC containers for me - a full VM is a bit overkill for something this light :)

gzer0 · on April 19, 2021

What's your setup like? I use VMWare workstation pro- I have Windows Enterprise 2019 LTSC N installed and snapshotted to a base VM.

Any time I need to do anything, I will full clone the base; with a decent SSD it takes maybe 10 seconds for the full clone and I have a full OS.

Karunamon · on April 19, 2021

Actual ESXI on a decently powerful (33 cores, 512 gigs) machine. VMWare’s been really good to me, minus some points for occasional stupidity on upgrades.

thies226j · on April 19, 2021

Yeah, why don’t you just use a container?

kwhitefoot · on April 19, 2021

How do tools like this cope with pages that are rendered by Javascript. What do the tools actually save? For instance if I save a Quora page using Firefox I can open it but if Quora is not accessible it doesn't work.

nikisweeting · on April 20, 2021

ArchiveBox is a wrapper around ~12 different extractor modules, each of which saves the page or its assets in a different way. The most relevant to JS is Singlefile, which renders the page in headless chrome and then snapshots the DOM with all assets inlined after a few seconds of JS execution. It's not perfect, but it works well even for the majority of JS-heavy sites.

For the very complex sites that really rely on a ton of interactive JS or dynamic requests to APIs to render their content, check out https://ArchiveWeb.page + https://ReplayWeb.page by https://webrecorder.io.

gravypod · on April 19, 2021

I can't wait for the API to be completed. I want to build something to archive HN (article + comments) and turn it into epub go read offline. Hard to do currently.

jcoscolla · on April 19, 2021

I was trying to do something similar with walls bag, but no epub file to send :(

antman · on April 19, 2021

I assume its auto corrected, do you mean Wallabag? https://github.com/wallabag/wallabag

dugite-code · on April 19, 2021

Last time I saw this you could only view the archive via the UI. It's really come a long way.

Could be very useful now.

skerit · on April 20, 2021

This is cool! Didn't know there were so many archiving options either, gonna check them all out.

pacifika · on April 19, 2021

No mention of search?

dugite-code · on April 19, 2021

I literally just spun up a copy but it looks like it has Sonic full-text integration, however I'm not 100% sure if it's working via the UI as there isn't much feed back letting you know why a site has displayed in the results.

nikisweeting · on April 19, 2021

It has full-text search using ripgrep and sonic.

gogopuppygogo · on April 19, 2021

Would be great to allow for it to save to worm storage.

asaddhamani · on April 19, 2021

Worm storage?

kinard · on April 19, 2021

write once read many

nikisweeting · on April 20, 2021

You might like one of ArchiveBox's peers, https://ArchiveWeb.page by Ilya Kreymer / Webrecorder.io, it has an option to save directly to IPFS.

ramraj07 · on April 19, 2021

Can I export my history from my iPhone web browsing?

dewey · on April 19, 2021

I'm not sure how the iCloud Sync works in this case but if you use Desktop Safari where it has your iOS history too you might be able to get it out of the Safari sqlite DB on your computer:

https://stackoverflow.com/questions/28628385/sqlite-safari-h...

nojito · on April 19, 2021

Safari reading list already does this

https://support.apple.com/guide/safari/keep-a-reading-list-s...

rakoo · on April 19, 2021

If you can somehow extract your browsing history, archivebox can ingest a list of links

nikisweeting · on April 19, 2021

ArchiveBox comes with a script that exports Safari history to a text file (which can then be imported into AB):

    ./bin/export_browser_history.sh --safari

https://github.com/ArchiveBox/ArchiveBox/blob/dev/bin/export...