Hacker News new | past | comments | ask | show | jobs | submit login
Archive.is owner on “continuity of his project” (archive.today)
190 points by spzx on Sept 8, 2021 | hide | past | favorite | 76 comments



Archive.is is an incredible service, but the fact he's paying out $3k-4k/mo out of pocket in expenses & time doesn't strike me as sustainable for the long term.

I'm reminded of some non-profit organization that was forced to shut down their websites because they ran out of money. In retrospect they could've setup a trust fund early on, stuck all their money in there, and then had a perpetual annual income from that for operating costs, instead of spending down. All fundraising could've gone into the trust fund in order to boost the annual budget, etc.


Also some interesting numbers in this post:

>How much does hosting cost you per month at the moment?

>about ~$2600/mo of pure expenses on servers/domains, not counting “work time”, “buying laptop/furniture”, etc. ($100…300/mo covered by donations + $300…500 by ads)

https://blog.archive.today/post/659383959382294528/you-said-...


As someone who managed to get more or less donation campaigns in the past, here are my 2 cents.

There's a huge difference between "please donate to our project" and "it costs us $X/month to run it, we have Y users and managed to collect $Z so far, please donate".

The first one will get you about $1 per 10K-1M users. The second one will get your goal fulfilled, as long as you are reasonable, and have enough users. All it takes is a noticeable message and a way to update it automatically based on the money received.


This is a very fair answer, and all "permalinks" are lies. At the same time, I wonder if it might not be possible to have snapshots up on, I dunno, IPFS or torrent sites or something such that when the unavoidable happens, not all is lost.


Depending on the format of the archives (hopefully WARC), the owner could hand the entire archive over to the Internet Archive or Archive Team for ingestion by Wayback Machine.

If the concern is perpetual access to archived content but under your terms, that is where the cost comes in. Somebody somewhere is paying for power, cooling, connectivity, and disks. The Internet Archive estimates it costs them $2/GB to host data uploaded in perpetuity. Please consider donating if you're uploading content for permanent archival and/or deriving value from hosted content.


Pretty much. If you really want to maintain access to old stuff in perpetuity, you have to pay for it yourself by either storing it on your own equipment or paying someone to store it on theirs.


Sure -- but at least with something p2p hooked up to it I can let someone else benefit from my desire to assure my own access.


Yep, I agree.


The Internet Archive isn't perfect either. They can be pressured to remove pages.


It's not perfect. If you desire to store content that the Internet Archive must dark for whatever legal or compliance reasons, you'll have to cover the cost for that storage. The cost will always be non zero to perform such an operation.


nothing is permanent in life. heaps of torrents i want have 0 seeders.

ipfs's economically incentivised hosting layer (filecoin) offers storage in monthly increments, not perpetual.


> nothing is permanent in life

practically yes, but

https://github.com/philipl/pifs


An image of a Web asset is useful for archival purposes and may be indexed using text recognition ML.


Images take up way more space than HTML documents… actually, what with how bloated most “websites” are these days, that's probably not true any more.


Yes, the site owner's response seems excessively didactic when a system of mirrors would solve the problem.


> a system of mirrors would solve the problem.

A system of mirrors prevents a single node going down from taking the whole system (in theory at least, we've all seen plenty of times where failover goes poorly), but it doesn't do anything to ensure the long term survival of the system as people lose interest, lose the ability to participate, and sometimes die.

If you were around certain internet forums in the late '00s you might have run in to an image hosting platform called WaffleImages which was created in response to yet another popular free image hosting service locking down their embedding and ruining thousands of old posts. The goal was to distribute image hosting among community-operated mirrors, and it worked great for a few years. Over time though people lost interest while the rate of new mirrors getting added dropped to basically zero and eventually it fell apart.


Mirrors don't solve the problem, they move it.


How is a mirror not solving the problem of data loss if a single instance goes away?


The problem isn't a single instance going away. The problem is what happens when for whatever reason the owner stops maintaining the project. This is a common problem and despite all the bluster and buzzwords, the IT community hasn't really found a solution. Torrents are the only kind-of-solution, but they are not ideal for something that needs to be constantly updated.


Zeronet may help for content that need to be updated


> This is a very fair answer, and all "permalinks" are lies

Cool URIs don't change: https://www.w3.org/Provider/Style/URI.html

You can believe that cool URIs don't change or you could go the IPFS route. Similar to the way torrents have a 'health' score of plenty of seeders, and IPFS resources could live as long as people want that resource to exist (Not sure if that situation is baked into IPFS though).

Also: have you looked into Filecoin?[0]

[0] https://filecoin.io/


Archive.is still doesn’t work if you use Cloudflare DNS due to a spat with Cloudflare and the operator. So to me, the continuity and reliability is already a big question. Not only is it a question of sustainability economically, but also ideologically: what happens if another similar decision is made to lock out a portion of users?


For reference: the spat is that Cloudflare DNS does not leak geographic information of the queryer through EDNS, and the archive.is fellow is requiring geographic information to provide valid DNS lookups. So he intentionally sends back bad results when it is 1.1.1.1 querying his nameservers.

I love the site, but his stance on this doesn't really make sense to me, and it's a shame that millions and millions of people use 1.1.1.1 daily and archive.is is the one website that doesn't work for those people.


I'm not sure the term 'leak' applies. It's an anti-cdn play. Refusing to use EDNS correctly makes the web slower for a lot of people. And it adds little to nothing to privacy since the answer IP is going to know your IP at the next step anyways...

As for why archive.is cares so much...that I don't know. Perhaps they rely on such data to give a fast experience, and are tired of this charade...but that's just speculation.


Cloudflare's edge network is sufficiently dense that ECS data is unnecessary in almost all cases. The requesting data center will be close enough to the client that doing geoip on the source IP will have the same results as using ECS.

There's nothing incorrect about what Cloudflare is doing, EDNS does not require ECS data to be included in requests, but for whatever reason the maintainer of Archive.is decided to block 1.1.1.1 over it.


But not sending ECS data harms non-cloudflare CDNs, right?

edit: from here https://news.ycombinator.com/item?id=19828702 I gather that this indeed harms CDNs outside the ones that Cloudflare has a business relationship with.

> EDNS IP subsets can be used to better geolocate responses for services that use DNS-based load balancing. However, 1.1.1.1 is delivered across Cloudflare’s entire network that today spans 180 cities. We publish the geolocation information of the IPs that we query from. That allows any network with less density than we have to properly return DNS-targeted results. For a relatively small operator like archive.is, there would be no loss in geo load balancing fidelity relying on the location of the Cloudflare PoP in lieu of EDNS IP subnets.

> We are working with the small number of networks with a higher network/ISP density than Cloudflare (e.g., Netflix, Facebook, Google/YouTube) to come up with an EDNS IP Subnet alternative that gets them the information they need for geolocation targeting without risking user privacy and security. Those conversations have been productive and are ongoing. If archive.is has suggestions along these lines, we’d be happy to consider them.


> I love the site, but his stance on this doesn't really make sense to me

would not be surprised if he has some personal axe to grind with cf (they are no sheep either).

also i would be wary of overestimating market penetration of any 3rd-party dns provider; iirc google has total dominance of this segment and is still below 10%.


HN discussion from 2019 on this topic: https://news.ycombinator.com/item?id=19828317


Millions use 1.1.1.1 ? Any source for that ?


The Android app has 418,000 reviews and over 50 million installs, and the iPhone app has 230,000 reviews. The number of people who use it without an app is probably a lot higher.


Cloudflare also often blocks no-JS users.

All-around a huge accessibility impediment.


This is kinda irrelevant in a discussion about archive.is blocking cloudflare DNS users and cloudflare DNS servers/ips.


CloudFlare really is an enemy of the Free (Libre) open internet.

They are the next Google in terms of "evil companies"


What is a good alternative to Cloudflare DNS? I've been using 1.1.1.1 since it was launched, but now this makes me want to switch to something better.


Perhaps look at Quad9.


Why not your ISP?


Was MitMing half the web the giveaway?


Cloudflare is not neutral. They blocked 8chan after political pressure from MSM.


I lost faith in Cloudflare when they switched from being a neutral infrastructure service to yet another politically-motivated big tech company. In their blog post around the ban of 8chan (https://blog.cloudflare.com/terminating-service-for-8chan/), they acknowledged that they didn't know if 8chan broke any laws but they nevertheless decided to pass personal judgment based on a vague notion that 8chan "inspired" a shooting. That's quite an unprincipled way to operate a fundamental network utility that backs 10% of the Fortune 1000 and 20% of the top 10000 websites.


archive.is is my browser to text-heavy websites like blogs, news, twitter, and documentation (outline.com is another one, reader mode, yet another). It completely debloats a webpage as it archives (unlike web.archive.org, say). Suits my purposes just fine. I must note though, archive.is (from what I recall), forwards IP address of whoever initiated an archive process to the origin.

archive.is was also a great mirror to instagram and linkedin for public profiles, but it doesn't archive instagram anymore.


Here's his quote on instagram archiving problems:

"There is no Instagram content which don’t need to login.

If you can access the page without login, it is sort of “promo preview“, after few pages accessed this way, they add your IP into “promo is over“ list and will redirect to /login on every future request.

I just have not enough fresh IPs to abuse this mechanism."

https://blog.archive.today/post/659927354404192256/instagram...


There is an interesting conundrum here, when we post to the internet do we also consent to having that information saved for all eternity?

Archiving everything is still a novel and not fully understood concept it is not that clear that it is useful or beneficial over long term.

Forgetting can be a bliss.


This is kind of an antithesis to his message in the post; that nothing is actually permanent, and while many people are concerned about continuity of service, ultimately perfect continuity is impossible, whether that's due to the organization running out of money, bad backup practices and a fire, global warming wiping Ashburn Virginia off the map, or the end of the human civilization by some other means.

I've helped with a couple risk registers at tech companies. Two things I've never seen appear in a risk register: The company runs out of money. Human society is wiped out. I've been laughed at once for bringing up variations on these. They're out of scope; risks stop being a threat when there's no one left to care about them.

I think the goal of keeping an internet-scale level of data accessible and searchable, for longer than one lifetime, is an impossible task. Maybe Archive.org/Archive.is can pull it off; I doubt it. Its an insane amount of data. Most of it is totally pointless, but its really difficult to pick-apart what's useful and what's useless, so you have to keep as much as possible without bias. All of that is on hard disks which violently spin around at 8 meters per second, accessed by software which we all know breaks every day but are too afraid to admit it, over a network of other computers with all the same flaws, distributed globally, yet can be significantly disrupted by one roadside construction worker and a jackhammer.

The internet didn't increase the lifetime of data; it decreased it. Sure, we have far more of it at our fingertips than any other point in history, but that's not lifetime; that's just volume. And that volume has desensitized us; its fundamentally impacting our innate biological memory capacity, and the social structures we form around memory. We know the Library of Alexandria existed because people wrote about it; the pages laid for thousands of years; its memory passed verbally from person to person.

If all computers stopped functioning tomorrow, not even disappear, they're still there, they just don't work: Would the memory of Stranger Things still be known in two thousand years? I doubt it, but: if the only thing which offers us a satisfying "Yes" is "we keep the computers running, accessible, indexable, searchable"; that seems, at the very least, given the extreme challenges we as a species will be facing over the next century, beyond the scope of human possibility


The Sun’s ability to function like an Earth-wide Eprom eraser might cause some catastrophic disruption given our reliance on the Internet and computing devices. A large enough geomagnetic/solar storm is not unprecedented.

See: https://en.wikipedia.org/wiki/Carrington_Event


I think it’s one of those things modern parents are going to have to understand about their brave new world, and teach their modern children. Like look both ways before crossing the street, remember that everything you write on the Internet is permanent, so think before you write, or if you don’t want to do that then at least write it under a pseudonym.


why does it have to be that way though?

you are stating the current status quo as unavoidable - why is that?

Archiving everything has a massive cost and if it were illegal to archive unless people consent it would be much harder to get away with it.

When I post to a public forum I gave that forum rights to publish what I said, I did not give everyone else the right to store what I said


I am not a lawyer but I think it would be legally difficult to make it illegal to record what people say and do in public spaces. That was a right that the entire Western media depended on long before the Internet was invented.

The crux may be whether websites that require you to login to use them, like Facebook, are considered public or not. But anything you say or do in Facebook that you don’t restrict to being viewable only by 1st degree friends is probably considered public.


> why does it have to be that way though?

That'll be -5 points for questioning the social credit system, Citizen.


> There is an interesting conundrum here, when we post to the internet do we also consent to having that information saved for all eternity?

I'm not sure about consent, but presume it will be 'stuck' and un-removeable from the net once it's out there. (So be careful what you disseminate). Some people even go out of their way to make sure certain content will never be forgotten from the web.


Yes. I think people need to understand that anything on the internet is by default, there forever. Post something privately or behind a password if you don't want everyone seeing it.


A lot of people think that the current status-quo is the only way forward. Why? There is no reason for it.

There is no practical reason why a forum could not have an optin-optout-delete API/protocol/etc that all search engines/archivers should follow.

When I delete something all should follow the orders and those that do not need to be responsible for it.

Of course there are plenty of business reasons why engines don't want to do it.


This is the difference between "speech" and "text".

When you write a brief or a book, it's a performative act; you cannot pretend nothing was said.

The Internet is textual media, just like books, and unlike television or talking in the park.

If you don't like it, you can only burn books or drive modern technology in that direction


>There is no reason for it.

The reason is that bits are not physical. Anyone can copy them and re-upload them, without any cost.


> all should follow the orders and those that do not need to be responsible for it

Furthermore the tide will be ordered back, the contents returned to Pandora's box, and universal entropy decreased. Failure to comply will result in a fine.


there are plenty of human reasons as well, as indicated by your usage of the word "should"


What blissful times we lived in before the daily drumbeat of “Accomplished young professional discovered to have said offensive things on the internet when they were a dumb teenager, reputation tarred and feathered for the rest of their life.”


we don’t know if it’ll be a lifelong thing though. I suspect with all this pushback and emotional exhaustion (and I really do believe it’s emotionally exhausting to constantly be hounding over peoples morality on the internet) people will just stop giving a shit about what someone said 5 years ago pretty soon.


It’s a numbers game. Billions of people won’t care, but a few dozen can make enough of a stink that a habitually risk averse institution would rather let someone be canceled than risk the controversy snowballing into something bigger.


Only if we let them.

There are all kinds of feedback cycles going on. Neither of these two are rational:

1. There are no ways to correct the problem you describe

2. The problem will correct itself without individuals and institutions making it happen


He better also has built some savings for legal defense. Because he could easily be sued into oblivion especially in Europe.


even if we archive everything, hundreds of years from now all of “the worlds information” could very well be unusable and unreadable for a variety of factors(no one remembers how to deal with the file formats, EMP, bit rot). books however will continue to work just fine as they have for thousands of years


If we're talking that long of a timescale, how long does your typical book these days actually last? I'm no expert, but it makes me wonder how long consumer paper actually lasts. Reasonable(?) search result below.

https://www.quora.com/How-long-does-it-take-for-paper-to-dec...


adding paper to a compost pile will give different results than keeping a book stored in the proper conditions


About reliable email adresses: I'm using my university alumni "email forwarding for life", but this loses me some emails due to DMARC and friends. What are the alternatives?


Register your own domain and use that for email. Then you can move that domain's email service to different providers.

I use Gmail right now, but I could move my domain's email to Fastmail or another provider without _too_ much work.

Of course, it's also a good idea to back up your email archives too, since that _can_ go away with the loss of a provider's service.


Domain "ownership" is too fragile. Well, you do not lose the email archive, but new domain "owner" could steal you identity.


I’ve recently migrated four different accounts, three of them Gmail with their own domains, to Fastmail and I was astonished by how easy the process (that I’d put off for _years_) was. Huge weight off my mind and I’ve been very happy with the service since then.


I see the answer to this question more around backups and helping future people overcome technical limitations (knowledge transfer + data archiving).

All things are ephemeral after a certain point but archiving typically lasts much much longer than the human operators. Likewise documenting the process and barriers to overcome will help people in the future solve the problem (and a broader amount of people).

This doesn't have to be public, just needs a way to become public.


Perhaps he can somehow join forces with archive.org? Maybe they take over when/if he is no longer able to do it?


On a related note, I guess Tumblr is still around to use as a blog ?


Yes it has been around, but it's quite jarring to see anyone actually use it.


Didn't realize people used it for anything other than getting around paywalls.


That's my main use and sometimes feel bad about it, since I just want to see the content and not archive some mediocre article for “posterity”.


Does any one know what's up with all the different domains for archive.is?

The blog is at blog.archive.today, it calls itself the "archive.is blog", but when I visit archive.is or archive.today, I'm brought to archive.vn. When I click the "archive.today" logo in the header, I'm taken to archive.ph


According to my understanding: archive.today/archive.is are the main domains the service is known under, others are mirrors selected depending on country you are located in (because of domain bans in some countries).

Site owner said once archive.today is the domain to use when linking because it will automatically redirect to the correct one.


He's mentioned this before. He has multiple domains because they can go down sometimes.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: