I'm curious if there's a point where a crawler is misconfigured so badly that it becomes a violation of the CFAA by nature of recklessness.
They say a single crawler downloaded 73TB of zipped HTML files in a month. That averages out to ~29 MB/s of traffic, every second, for an entire month.
Averaging 30 megabytes a second of traffic for a month is crossing into reckless territory. I don't think any sane engineer would call that normal or healthy for scraping a site like ReadTheDocs; Twitter/Facebook/LinkedIn/etc, sure, but not ReadTheDocs.
To me, this crosses into "recklessly negligent" territory, and I think should come with government fines for the company that did it. Scraping is totally fine to me, but it needs to be done either a) at a pace that will not impact the provider (read: slowly), or b) with some kind of prior agreement that the provider is accepting responsibility to provide enough capacity.
While I agree that putting content out into the public means it can be scraped, I don't think that necessarily implies scrapers can do their thing at whatever rate they want. As a provider, there's very little difference to me between getting DDoSed and getting scraped to death; both ruin the experience for users.
"One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler."
Wow. That's some seriously disrespectful crawling.
10TB/day is, roughly, a single saturated 1Gbit link. In technical bandwidth terms, that is the square root of fuck all.
The crazy thing here is that the target site is content to pay ~$700 for the volume of traffic that you can move through a single teeny-tiny included-at-no-extra-charge cat5 Ethernet link in one single day. And apparently, they're going to continue doing so.
Hosting documentation shouldn't need that much bandwidth. It's text and zip files full of text. Without bots, that's a very small cost even if the bytes are relatively costly.
Why is it "classy" to not name names when a business (which likely holds it self out there as reputable) behaves badly, especially when it behaves in a way that costs you money? Everyone is so vague and coy. These companies are being abusive and reckless. Name and shame!
Naming names isn't really required. The hosts have a $5,000 bandwidth fee, but so do the consumers. There's maybe 10 companies with the financial & compute resources to let a $5,000-per-month-per-website bug run rampant before taking the harvesting service offline.
Meta/Google/Whoever may benefit from economies of scale, so they're not seeing the full $5,000 their side, but they're hitting tens of thousands of sites with that crawler.
You know you can hit the data rate they were complaining about by using a residential fiber connection, right? 10 TB per day is about 1 Gigabit continuous if I’m not mistaken. There are probably millions of people that could to this if they wanted to.
There's millions of people who could do that to an individual website. There are remarkably few organisations who could do that simultaneously across the top 100,000 or so sites on the internet, which is how readthedocs has encountered this issue.
They are a persistent source of spam email servers, scrapers, and bot probes.
The simple reason is the operators quickly dump a host, and the next user is left wondering why their legitimate site is instantly spelunking spam ban lists.
It is the degenerative nature of cloud services... and unsurprisingly we end up often banning most parts of Digital Ocean, Amazon, Azure, Google, and Baidu.
It's the degenerative nature of assuming an IP corresponds to a user. They have not corresponded to users for over a decade. I once discovered I'm banned on my mobile phone connection from at least one app which doesn't know that CGNAT exists (a very poor assumption for mobile phone apps in particular). If you must block IPs, do it as a last resort, make it based on some observable behavior, quickly instated when that behavior occurs, and quickly uninstated when it does not.
Really depends on the use-case, but yeah the response happens in a proportional manner.
We also follow the tit-for-tat forgiveness policy to ensure old bans are given a second chance. Mostly, we want the nuisance to sometimes randomly work, as it wastes more of their time fixing bugs.
And note, if a server is compromised and persistently causing a problem... we won't hesitate to black hole an entire country along with the active Tor exit nodes and known proxies lists (the hidden feature in context cookies).
Read that before, read it again to make sure I didn't miss anything. The lack of clarity here is disappointing and only asks more questions than it answers.
What I'm getting from this is that you hate having users almost as much as Reddit (which enshittified their website and banned all non-shit mobile apps and all search engines other than Google).
Imagine a world, where people walk into your business with a mask over their face saying horribly abusive things... while pretending they are your neighbors... And poof... they automatically vanish along with their garbage content.
They may visit again, but are less likely to mess with the platform. Note, cons never buy anything... ever... it is against their temperament.
I find it interesting several of cons on YC are upset by someone else's administrative policies. Reddit should enforce these policies too, or at least drop a country or pirate flag icon beside nasty posts...
Have a great day, and don't fear the ban hammer friend =3
The more interesting thing for me is: the crawler didn't detect it on its own that it racked up 10TB from one site in one day.
If I would design a crawler, I'd keep at least some basic form of tracking, if only to check for people deliberately trolling me by delivering me an infinite chain of garbage.
I've been running guthib.mattbasta.workers.dev for years, and in the past few months it's hit the free limit for requests every day. Some AI company is filling their corpus with exactly that: infinite garbage.
That's also $73 worth of bandwidth on another server host. Please stop using extremely expensive hosts and then blaming other people for the consequences of your decision.
Try page rate-limiting (6 hits a minute is plenty for a human), and then pop up a captcha.
If they keep hitting the limit within an hour 4+ times, than get fail2ban to block the IP for 2 days.
73TB is a fair amount to have on a cloud... usually at >30TiB firms must diversify with un-metered server racks, and CDN providers (traditional for large media files etc.)
Rate limiting firewalls and spider traps also work well...
There is page referral monitoring, context cookies, and dynamically created link chaff with depth-charge rules.
One can dance all day friend, but a black-hole is coming for the entire IP block shortly. And unlike many people, some never remove a providers leased block and route until it is re-sold. =3
I mean, blocking random IP ranges is your prerogative if you don't want to have customers. The scrapers will find ways around, while actual users will be unable to use your site. Residential proxies are something like $5 for 1000.
True, note domestic ISP IP ranges are published, and unless one deals internationally... don't bother serving people that will never buy anything from your firm anyways.
Domestic "Users" functioning as proxies will be tripping usage limits, and getting temporarily banned. Google does this by the way, try hammering their services and find out what happens.
Context cookies also immediately flag egregious multi-user routes, and if it is a ISP IP you know its a problem user. If it is over 15 users an hour per IP, than you can be 100% sure its a Tor proxy.
we ban over 243000+ IPs, and saw zero impact to our bottom line.
Those of us running sites for public information rather than sales cannot make the simple cut-off that you do.
And again, none of this is simple. It has taken me a few weeks to establish a usage mechanism that does catch the worst feed pullers, but it still can hurt legit new users. That is an opportunity cost.
One must assume most user IP edge proxies are compromised hosts. If someone paid for that list they were almost certainly conned, as the black hats regularly publish that content on their forums. These folks want as many users as possible in order to hide their nuisance traffic origin in the traffic noise.
Allowing users known to have an active RAT or their "proxy friends" on a commercial site is not helping anyone.... especially the victims.
The captcha trigger events on many sites will often keep nagging/blocking people till they update.
Don't take this trend personally friend, as if we see a fake iPhone sporting bandwidth >1300Mbps+... than the host is getting permanently banned anyway.
1. gets incrementally slower until firewall user rate limiting tokens refill the bucket (chokes >6MiB/min bandwidth use, and enforces abnormal traffic ban rules.)
2. Pauses serving a page if you spider though 6+ pages a minute (chokes speculative downloading)
3. if you violate site usage rules 4+ times in the past hour, than your get a 2 day IP ban
4. if you trip a spider trap, than you get a 5 day ban
5. If you are issued more than 5 context cookies, than the IP will get spammed with a captcha on every page for 5 days
6. If you violate any number of additional signatures (shodan etc.) than you get your IP block and route permanently banned. There is only 1 exception to this rule, and we don't share that with anyone.
7. The site content navigation is programmatically generated in JavaScript
Over 99% of the bandwidth (and CPU) taken by the biggest podcast / music services simply on polling feeds is completely unnecessary. But ofc pointing this out to them gets some sort of "oh this is normal, we don't care" response because they are big enough to know that eg podcasters need them.
I run pinecast.com. If there was a leaderboard for hn users serving XML, I'd almost certainly be in the top five.
I don't disagree with your post. But: RSS downloads are at an all time low, and that's a bad thing.
They're at an all time low because Spotify and Apple both fetch feeds from centralized servers. 1000 subscribers no longer means 24000ish daily feed fetches, it means 48. With keep alive or H2, these services simply don't reconnect. The number of IPs that hit me from Apple, for instance, is probably only double digits.
Since Apple and Spotify both sit between me and the listeners, they eliminate the privacy that listeners would otherwise enjoy. It also forces podcasters to go to them to find out how many people are subscribed, which means lots of big databases instead of one database that I host for my customers.
Centralization of feed checking carries huge risks, in my opinion, especially as both Apple and Spotify make moves to also become the hosting providers.
Do you have a CDN between you and Apple / Spotify? Because if you do I think that Apple/Spotify are polling that CDN every few minutes and the CDN is having its bandwidth wasted invisibly, but presumably priced in.
Also I agree that the re-centralisation is a bad thing, mainly.
(I'd like to move to email to discuss this further, if possible: I have an arXiv paper to write!)
The data I'm giving you is based on logs from the CDN. Most feeds are checked by Apple and Spotify every hour, but usually it's less frequently rather than more: shows that haven't been published to in a year or more might see very infrequent feed checks.
Not easily on a static server and not without the risk of annoying an actual real human listener!
If you look at the "Defences" section: https://www.earth.org.uk/RSS-efficiency.html#Hints you'll see there are some things that can be done, such as randomly rejecting a large fraction of requests that don't allow compression (gzip is madly effective on many feed files: it's rude for a client not to allow it). But all these measures take effort to set up, and don't stop the bad bots making the request 100s of times too often. Just responding to each stupid request forces a flurry of packets and wakes up and uses CPU...
> Clients (hopefully bots) that disregard robots.txt and connect to your instance of HellPot will suffer eternal consequences. HellPot will send an infinite stream of data that is just close enough to being a real website that they might just stick around until their soul is ripped apart and they cease to exist.
My 100 mbps upload bandwidth at home is free (apart from the monthly 35€ payment). Useless bots will get stuck downloading from me instead of hogging readthedocs.
I blocked Microsoft/OpenAI a few weeks ago for (semi) childish reasons. Seven months later, Bing still refuses to index my blog, despite scraping it daily. The AI scrapers and crawlers toggle on Cloudflare did the trick.
Not only that, even commoncrawl had issues (about a year ago) where AWS couldn't keep up with the demand for downloading the WARCs.
As someone who written a lot of crawling infrastructure and managed large scale crawling operations, respectful crawling is important.
That being said it always seems like google has had a massively unfair advantage for crawling not only with budget but with brandname, and perceived value. It sometimes felt hard to reach out to websites and ask them to allow our crawlers, and grey tactics were often used. And I'm always for a more open internet.
I think regular releases of content in a compressed format would go a long way, but there would always be a race for having the freshest content. What might be better is offering the content in machine format, XML or JSON or even SOAP. Which is usually better for what the sites crawling want to achieve, cheaper for you to serve, and cheaper and less resource intensive compared to crawling. (Have them "cache" locally by enforcing rate limiting and signup)
> That being said it always seems like google has had a massively unfair advantage for crawling not only with budget but with brandname, and perceived value.
VCs and other startup culture evangelists are always challenging founders to figure out what their ‘unfair advantage’ is.
While the crawling is disrespectful, it seems RTD could find a cheaper host for their files. At my work we have a 10G business fiber line and serve >1PB per month for around $1,500. Takes 90% of the load off our cloud services. Took me just a couple weeks to set up everything.
Having built an AI crawler myself for first party data collection:
1. I intentionally made sure my crawler was slow (I prefer batch processing workflows in general, and this also has the effect of not needing a machine gun crawler rate)
2. For data updates, I made sure to first do a HEAD request and only access the page if it has actually been changed. This is good for me (lower cost), the site owner, and the internet as a whole (minimizes redundant data transfer volume)
Regarding individual site policies, I feel there’s often a “tragedy of the commons” dilemma for any market segment subject to aggregator dominance:
- individual sites often aggressively hide things like pricing information and explicitly disallow crawlers from accessing them
- humans end up having to access them: this results in a given site either not being included at all, or accessed once but never reaccessed, causing aggregator data to go stale
- aggregators often outrank individual sites due to better SEO and likely human preference of aggregators, because it saves them research time
- this results in the original site being put at a competitive disadvantage in SEO, since the their product ends up not being listed, or listed with outdated/incorrect information
- that sequence of events leads to negative business outcomes, especially for smaller businesses who often already have a higher chance of failure
Therefore, I believe it’s important to have some sort of standard policy that is implemented and enforced at various levels: CDNs, ISPs, etc.
The policy should be carefully balanced to consider all these factors as well as having a baked in mechanism for low friction amendment based on future emergent effects.
This would result in a much better internet, one that has the property of GINI regulation, ensuring well-distributed outcomes that are optimized for global socioeconomic prosperity as a whole.
Curious to hear others’ perspectives about this idea and how one would even kick off such an ambitious effort.
Shouldn't all sites have some kind of bandwidth / cost limiting in place? Not to say that AI crawlers shouldn't be more careful, but there are always malicious actors on the internet, seems foolish not to have some kind of defense in place
The big three cloud providers (AWS/GCP/Azure) have collectively decided that you don't want to set a spending limit actually, so they simply don't let you.
The big three cloud providers are the most expensive by a factor of 10-100x, and shouldn't be used under any circumstances unless you really, really need specific features from them.
It's harder to do right then you think. The first dynamic bandwidth (and concurrent connection) limiter that I wrote was to protect a site against Google in part!
> We have IP-based rate limiting in place for many of our endpoints, however these crawlers are coming from a large number of IP addresses, so our rate limiting is not effective.
Do you have something else in mind? Just shut down the whole site after a certain limit?
Tik Tok crawler fucked us up by taking a product-name (e-commerce) and inserting it into the search bar recursively with the results page. Respect the game, but not respecting the robots.txt crawl delay is awful.
What amazes me that none of this is surprising, all this behavior (not just what's described in the post) is on par with what the companies are doing, and have been doing for decades... And yet there will be many people, including here on HN, who will just cheer these companies on because they spit out an "opensource model" or a 10-dollars-a-month subscription
The situation didn't change when it was search index crawlers being called-out. At the end of the day, this sort of "abuse" is native to the world of the internet; like you said, it's decades old at this point.
HN will cheer on a lot of things that are counter-intuitive to their wellbeing; open-weight models doesn't feel like one of them. You can't protest AI (or search engines) because after long enough people can't do their job without them. The correct course of action is to name-and-shame, not write pithy engineering blogs begging people to stop. People won't stop.
And I'll be damned if Tanta isn't having his tweets used for training Elon's AI. The parochial circle regresses as it goes around, I'm done acknowledging the make-believe barriers we pretended the internet clung to.
You post it, others consume it. Same as it ever was.
Generally Googlebot is well behaved and efficient these days, though I have discovered that it is currently horribly broken around 429 / 503 response codes... And pays no attention to Retry-After either... Also Google-Podcast which is meant to have been turned off!
Emphasis on were, since they've made their search such utter shit in the quest for ad revenue that they're now going to have their own AI sum up your results (badly) instead to attempt to solve the problem they created.
Not originally since they just sent people to the site.
But these days where they just rip content from the site to give people as answers, completely depriving the site of traffic, yeah that seems basically just as bad as the AI bots.
Exploitation of the common man being a key ingredient of a product has never once inspired any actual consumer revolt. Fundamentally, no matter what they say, as long as they get their fleeting hit of dopamine from buying/using a thing, people just don't care.
As Squidward says: nobody gives a care for the fate of labor as long as they get their instant gratifications.
> "One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler."
Invoice the abusers.
They're rolling in investor hype money, and they're obviously not spending it on competent developers if their bots behave like this, so there should be plenty left to cover costs.
Never going to happen due to misaligned incentives. In 2024 everyone wants to keep their data behind lock and key. The commons is gone. Just look at the Google/Reddit deal.
Had a conversation with a firm that wanted a distributed scraper built, and they really did not care about site usage policies.
You would be fooling yourselves if you think such a firm cared about robots.txt or page tags.
We warned them they would be sued eventually, to contact the site owners for legal access to the data, and issued a hard pass on the project. Probably they assumed if the indexing process was out of another jurisdiction their domestic firm wouldn't be liable for theft of service or copyright infringement.
It was my understanding AI/ML does not change legal obligations in business, but the firm probably found someone to build that dubious project eventually...
Spider traps and rate-limiting are good options too. =3
Robots.txt doesnt create a legal obligation. It’s just a set of rules saying “if you don’t follow these rules to politely crawl our site, we’ll block you from crawling our site”.
Obviously “anything goes” in civil suits however - if someone is being absurdly egregious with their crawling there’s usually some exposure to one tort or another.
Please review HiQ vs. LinkedIn - it hinged on the fact that HiQ hired crowdsourced workers (“turkers”) to create fake profiles through which to access LinkedIn’s platform (who had to agree to the ToS to create these accounts). The court found that hiQ expressly agreed to the user agreement when it created its corporate account on LinkedIn’s platform.
This doesn't apply if you don't ever agree to anything - which is the case if the information is not locked behind account creation.
If I recall it is considered theft-of-service if you bypass the posted site usage terms with an agent like a spider, and certainly a copyright violation for unauthorized content usage (especially in the context of a commercial venture.)
One may be sued, but not because you parsed robots.txt wrong =3
> it is considered theft-of-service if you bypass the posted site usage terms
My understanding is that this is not accurate.
HiQ v LinkedIn established that this is only the case if you actually agreed to the terms of service. Such "agreement" only happens if the information is walled behind an account creation process, e.g. Facebook, Inc. v. Power Ventures, Inc. If it's just scraping publicly available webpages, the only legal issue with scraping would be unreasonably or obviously negligent scraping practices which lead to degradation or denial-of-service. And obviously the line for that would have to be determined in civil court.
eBay v. Bidder's Edge (2000) is the last case that I could find which even considered violation of robots.txt as very minor part of the judgement, but the findings were based far more on other things. Intel Corp. v. Hamidi also implicitly overruled the judgement in that that ruling (though not related to robots.txt, which was really just a very minor point in the first place).
Hard to say, I seem to recall it was because some spider authors used session cookies to bypass the EULA (the page probe auto-clicks "I agree" to capture the session cookie), and faked user agent strings to spoof gogglebot to gain access to site content.
One thing is for certain, is its jurisdictional... and way too messy to be responsible for maintaining/hosting (the ambiguous copyright protection outside a research context looked way too risky.) =3
Unlikely, site-generator hosts are still happily providing a limitless supply of remixed well-structured nonsense, random images with noise, and valid links to popular sites.
In this case, they showed up to the data buffet long after it went rotten due to SEO.
How is that supposed to work? Under which law will you force me to pay for using your public website, especially if I am not in your country? Just put up a captcha and block crawlers, you’re not going to get them to pay you
everything old is new again. i remember when someone at google started aggressively crawling del.icio.us from a desktop machine and i ended up blocking all employees...
Have capitalists ever stopped just because their actions (that make them money) hurt others? Because the consequences of the damage they cause might in the end hurt them, too?
Just 2 buggy crawlers seems not that many, sure they each had large impact, but given that there are likely hundreds if not thousands of such crawlers out there it's a rather small number. It seems that most crawlers are actually respectful.
I used to run a site with a huge number of pages that had high running costs but low revenue.
The only web crawler that did anything for me was Google, as Google sent an appreciable amount of traffic. Referrers from Bing were almost undetectable: the joke among my black hat SEO friends at the time was that you could rank for money keywords like "buy wow gold" and get 10 hits. Then there were the Chinese crawlers like Baidu that would crawl at 10x the rate of Google but send zero referrers. And then there were crawlers looking for copyrighted images that cost me money to accommodate even if they never sent me cease and desist letters.
As much as I hate the Google monopoly I couldn't afford having my site crawled like that without any benefit to me.
It's an awful situation for the long term though because it prevents new entrants. Right now I am thinking about a new search engine for a vertical where a huge number of products are available from different vendors and when you do find results from Google they are sold out at least 70% of the time. I hate to think it's going to get harder to make something.
There was a paper about webcrawlers circa 2000 that pointed out that the vast majority of academics who ran webcrawlers never published a paper based on their work.
They say a single crawler downloaded 73TB of zipped HTML files in a month. That averages out to ~29 MB/s of traffic, every second, for an entire month.
Averaging 30 megabytes a second of traffic for a month is crossing into reckless territory. I don't think any sane engineer would call that normal or healthy for scraping a site like ReadTheDocs; Twitter/Facebook/LinkedIn/etc, sure, but not ReadTheDocs.
To me, this crosses into "recklessly negligent" territory, and I think should come with government fines for the company that did it. Scraping is totally fine to me, but it needs to be done either a) at a pace that will not impact the provider (read: slowly), or b) with some kind of prior agreement that the provider is accepting responsibility to provide enough capacity.
While I agree that putting content out into the public means it can be scraped, I don't think that necessarily implies scrapers can do their thing at whatever rate they want. As a provider, there's very little difference to me between getting DDoSed and getting scraped to death; both ruin the experience for users.