Yes, I can write a long article about why it's the default someday. I've agonized over this decision for many many months, and it's flipped flopped a few times as well.
The short version is that defaults in software are really important (90% of users wont change them), and I don't trust myself to code ArchiveBox 100% correctly so as to never lose data, or the majority of people to store their archives correctly so as to never lose data on their own. Archive.org is the redundant failsafe. Another good reason is that Archive.org is not the only way that your archive content can be leaked, the security model means that archived pages can read each other's content, so I want to make it abundantly clear to users that by default it's designed to only archive content thats already public (in which case it's already fair game for Archive.org).
I've settled on leaving it on as the default, but I do mention 3 times in the README how to disable it, most notably in the CAVEATS section which explains both the security model drawbacks and how to prevent your content from being leaked to Archive.org or other 3rd party APIs.
Although I tend privacy-by-default for most deployed technologies, the context of archiving does change the criteria quite a lot; you've selected a sensible and reasonable default, I reckon. Hopefully integrity is a consideration too? Glad to read that article, one day :)
Integrity is absolutely paramount too of course, which is why I chose Django (because of the mature DB migrations system that makes upgrades deterministic, reversible, and relatively painless). Hand coding a schema migration system would be a recipe for disaster and an easy opportunity for users to lose data.
Nevertheless, no system is perfect, and even with Django helping guard database integrity and multiple redundant index files, it's possible I'll make a mistake someday that leads to data loss on upgrade. I don't want that situation to be the next (mini) library of Alexandria, and saving copies to Archive.org helps serve as a last-resort backup.
Appreciate the project, web archives are becoming an important part of the internet ecosystem.
I personally have no issue with the defaults, but if you've agonized over the defaults, perhaps you should consider clearly documenting it in the main project README instead of leaving it for people to find in the config documentation.
Wow, great! Self-hosted, open-source, solid UI, tie-ins to the broader ecosystem... seems to check all the right boxes. Looking fwd to trying it and if all goes well, maybe see about integrating it into AthensResearch. Thanks for sharing!
If you're interested in this sort of thing, you might also be interested in Archivy [1], which is somewhat similar but it (thankfully) doesn't upload your stuff to archive.org
Aside from the can of worms that is copyright infringement? There was a recent HN discussion about how much of a pain it is to get something removed from archive.org.
Not uploading people's stuff to permanent, public archives seems like a good rule of thumb.
Something can be public today and not tomorrow. Something can be made public by accident. Something can be publically reachable (i.e. a private URL but one without a login) without the intention of being searchable.
Last time I tried to do this same thing, I didn't know about these, and ended up spending a couple days on wget and httrack. Do all these alternatives work from the command line, or are they their own little proprietary ecosystem?
I think it's primarily personal preference of features and how things are stored and presented. While Wallabag is more a Pocket/Read It Later alternative, LinkAce does not save the website itself, but a reference to it including the taxonomy you assign to it. It is intended to be a long-term bookmark archive, but without handling all the website archiving on its own.
Actual ESXI on a decently powerful (33 cores, 512 gigs) machine. VMWare’s been really good to me, minus some points for occasional stupidity on upgrades.
How do tools like this cope with pages that are rendered by Javascript. What do the tools actually save? For instance if I save a Quora page using Firefox I can open it but if Quora is not accessible it doesn't work.
ArchiveBox is a wrapper around ~12 different extractor modules, each of which saves the page or its assets in a different way. The most relevant to JS is Singlefile, which renders the page in headless chrome and then snapshots the DOM with all assets inlined after a few seconds of JS execution. It's not perfect, but it works well even for the majority of JS-heavy sites.
I can't wait for the API to be completed. I want to build something to archive HN (article + comments) and turn it into epub go read offline. Hard to do currently.
I literally just spun up a copy but it looks like it has Sonic full-text integration, however I'm not 100% sure if it's working via the UI as there isn't much feed back letting you know why a site has displayed in the results.
I'm not sure how the iCloud Sync works in this case but if you use Desktop Safari where it has your iOS history too you might be able to get it out of the Safari sqlite DB on your computer: