Affected companies are becoming increasingly frustrated with the army of AI crawlers out there as they won't stick to any scraping best practices (respect robot.txt, use public APIs, no peak load).
It's not necessarily about copyright, but the heavy scraping traffic also leads to increased infra costs.
What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.
The idea is not to make scraping impossible, but to make it expensive. A human doesn't make requests as fast as a bot, so the pretend human is still rate limited. Eventually, you need an account, and tracking of that also happens, and accounts matching specific patterns get purged, and so on. This will not stop scraping, but the point is not to stop it, but to make it expensive and slow. Eventually, expensive enough that it might be better off to not pretend to be a human, pay for a license, and then the arms race goes away.
Can defenses be good enough it's better to not even try to fight? It's a far harder question than wondering if a random bot can make a dozen requests pretending to be human
Make it easier to get the data, put less roadblocks in the way for legitimate access, and you'll find fewer scrapers. Even if you make scraping _very_ hard, people will still prefer scraping if legitimate use is even more cumbersome than scraping, or you refuse to even offer a legitimate option.
Admittedly, we are talking here because some people are scraping OSM when they could get the entire dataset for free... but I'm hoping these people are outliers, and most consume the non-profit org's data in the way they ask.
Well, it isn't a case of piracy, is it? The data exists on the website, for free, under the assumption/social contract that you are a human, not an agent of a shady enterprise wasting the bandwidth. An analogy would be the game itself being put out for free on itch.io, but then downloaded and unpacked to make an asset flip.
Seems to me eventually we might hit a point where stuff like api access is whitelisted. You will have to build a real relationship with a real human at the company to validate you aren’t a bot. This might include in person meeting as anything else could be spoofed. Back to the 1960s business world we go. Thanks, technologists, for pulling the rug under us all.
Scraping implies not API - they're accessing the site as a user agent. And whitelisting access to the actual web pages isn't a tenable option for many websites. Humans generally hate being forced to sign up for an account before they can see this page that they found in a Google search.
Scraping often uses the same APIs that the website itself does, so to make that work a lot of sites will have to put their content around authentication of some sort.
For example, I have a project that crawls the SCP Wiki (following best practices, ratelimiting, etc). If they were to restrict the API that I use it would break the website for people, so if they do want to limit the access they have no choice but to instead put it behind some set of credentials that they could trace back to a user and eliminate the public site itself. For a lot of sites that's just not reasonable.
You can't whitelist and also have a consumer-facing service. There is no reliable way to differentiate between a legitimate user and the AI company's scraper.
Yep, it reminds me of the Ferrari almost-scam that was thwarted because the target thought to verify by asking about something that was only shared in-person.
I could definitely see this. I worked for a company that had a few popular free inspector tools on their website. The constant traffic load of bots was nuts.
I don't know if the AI's have an endgame in mind. As for the humans, I think it's an internet built for a dark forest. We'll stop assuming that everything is benign except for the malicious parts which we track and block. Instead we'll assume that everything is malicious except for the parts which our explicitly trusted circle of peers have endorsed. When we get burned, we'll prune the trust relationship that misled us, and we'll find ways to incentivize the kind of trust hygiene necessary to make that work.
When I compare that to our current internet the first thought is "but that won't scale to the whole planet". But the thing is, it doesn't need to. All of the problems I need computers to solve are local problems anyway.
Arguably, trying to scale everything to the whole planet is the root cause of most of these problems. So "that won't scale to the whole planet" might, in the long view, be a feature and not a bug.
Right. If your use case for the internet is exerting influence over people who don't trust you, then it's past time that we shut you down anyhow.
For everyone else, this transition will not be a big deal (although your friends may ask you to occasionally spend a few cycles maintaining your part of a web of trust, because your bad decisions might affect them more than they currently do).
Websites previously would have their own in-house API to freely deliver content to anyone who requests it.
Now, a website should be a simple interface for a user that communicates with an external API and display it. It's the user's responsibility to have access to the API.
Any information worth taking should be locked away by Authentication - which has become stupid simple using oAuth w/ major providers.
So these people trying to extract content by paying someone or using a paid service should rather use the API which packages it for them and is fairly priced.
Lastly, robots.txt should be enforced by law. There is no difference from stealing something from a store, and stealing content from a website.
AI (and greed) has killed the open freedoms of the Internet.
The open web is on a crash course. I don't necessarily believe in copyright claims, but I think it makes sense to aggressively prosecute scrapers for DDOSing.
An optimistic outcome would be that public content becomes fully peer-to-peer. If you want to download an article, you must seed at least the same amount of bandwidth to serve another copy. You still have to deal with leechers, I guess.
There is no reason to protect against bots using regular captchas (Seems like I'm weaker than your average bot in passing those). Brave search has a proof of work captcha and everytime I face it I'm glad it's not google's choose the bicycle one. Having a captcha be a hevy process ran for a couple of seconds might be a nuisance to me who needs to complete it once a day but to the person who has to do it a lot of time for scraping, the costs might add up rather quickly. And the foundamental mechanism of it makes its effectivenes irrelevant to how much progress AI has made.
Also maybe the recent rise in captcha difficulty is not companies making them harder to prevent bots but rather bots twisting the right answer. As I know it captcha works based on other users' answers so if a huge portion of these other users are bots they can fool the alghorithm into thinking their wrong answer is the right answer.
You can rather easily set up semi-hard rate limiting with a proof of work scheme. Will very trivially affect human users, while bot spammers have to eat up the cost of a million hash reversions per hour or whatever.
Many would oppose the idea, but if any service (e.g. eBay, LinkedIn, Facebook) were to dump the snapshot to S3 every month, that could be a solution. You can't prevent scraping anyway.
We publish a live stream of minutely updated OpenStreetMap data in ready do digest form on https://planet.openstreetmap.org/ and S3. Scraping of our data still happens.
Our S3 bucket is thankfully supported by the AWS Open Data Sponsorship Program.
Would the snapshot contain the same info ( beyound any doubt ) that an actual user would see if they opened LinkedIn/Facebook/Service from Canada on an IPhone at a saturday morning (for example)? If not, the snapshot is useles for some usecases and we are back to scraping.
How long before companies start putting AI restrictions on new account creation simply because of the sheer amount of noise and storage issues associated with bot spam?
> AI can already solve captchas, so the arms race for bot protection is pretty much lost.
Require login, then verify the user account is associated with an email address at least 10 yrs old. Pretty much eliminates bots. Eliminates a few real users too, but not many.
You can't cache this stuff for bot consumption. Humans only want to see the popular stuff. Bots download everything. The size of your cache then equals the size of your content database.
But you can still make sure that you save the data in a form where generating the served webpage takes the least amount of time. For most websites this means saving the HTML - in a giant cache or with a more deliberate pre-generation setup.
Clearly not the case with most websites. And "milliseconds" are already a huge amount of time. Video games simulate huge worlds and render complex 3D graphics within 16ms or even much less with the >60 framerates that are expected these days.
You've gotten off topic here. The giant cache you speak of approaches the size of the content database when designing for the long tail of bots. A giant cache is non economic and thus not a solution unless you're an AWS salesman.
Yes, the cache ends up being bigger than the content database, but for text content that's typically not a problem. The human effort to type some text always hugely exceeds the cost of a few kilobytes of flash to store what they typed in a ready-to-serve form.
The generation process of taking the raw text and assembling the page around it is typically rather expensive for most CMS systems. Sure - it isn't theoretically expensive, but unless you want to engineer a CMS from scratch most people just pick one off the shelf and then end up having to pay the CPU time overhead of wordpress etc.
Not at all. You can either design your website so that pages can be retrieved in sub-millisecond time (and that doesn't have to mean throwing money at cloud providers) or you can cry about bots.
I must be an outlier here, but I don't keep email addresses that long. After a couple years they're on too many spam lists. I'll wind those addresses down and use them for a couple years only for short interactions that I expect spam from, and ultimately close then down completely the next cycle.
You are definitely an outlier in that you abandon email addresses deliberately. But many people do not have an old address simply because they lost access to their previous ones for one of many possible reasons, the most common one being that it was provided with a business relationship (e.g. ISP contract) that no longer exists.
That's before even getting into how you'd possibly verify email adress age, especially without preventing self-hosting.
That doesn't seem remotely compatible with modern privacy laws like the GDPR. And it certainly ads even more false negatives of people locked out because they didn't have their email leaked long enough ago.
Which doesn't include collecting information about your email and then handing it off to random third parties just because they want to use it for that.
And how exactly do you block scraping IPs? I suspect some of scrapers are just confused and are not aware of better ways to get OSM data.
Responding with a 403 error code will only lead to them changing their IP addresses.
A more effective approach might be to provide a response containing instructions on where to download data in bulk or a link to a guide that explains how to process OSM dumps.
You can literally set up your own OpenStreetMap instance in ten minutes. It's a simple 'docker run'-command. Sure, indexing will take a bit but even that can't take that long given their resources. That's just ridiculously greedy.
A while ago I very briefly tried Headway out of curiosity. This is the easiest Docker based option for the "full stack". It didn't work out of the box. Things went wrong. Which is no surprise, there's a ton of moving parts. And maybe it's not a big deal to work around but I highly doubt that it's 10 minutes of work to get everything working reliably.
I used OSMRouter maybe 7 or 8 years ago to process a couple billion routes and it was about as simple as GP described. Just need Docker and an index file. the US one was massive so I kept needing a bigger VM just to load it, but once I did I was able to make HTTP calls over the network to get what I needed. Took a few days to get a working setup but only a few hours to rip through billions of requests, and I was making them synchronously with R; could have been much faster if I was smarter then.
I needed osm data at one point. Never managed to figure out how to do it the proper way. To get data you need, you need to download massive 100Gb files, in obscure formats, and use obscure libraries. Info is scattered, there are HTTP APIs but they’re limited or rate-limited and it’s not clear if you’re supposed to use them.
I know I’m ignorant and I’m happy the project exists, but the usability in the era where devs expect streamlined APIs is not great.
I ended up using some free project that had pre-transformed osm data for what i needed.
That's kind of by design. Providing streamlined APIs requires a funding model to both host those APIs and pay an army of devops to maintain them. The OSM Foundation is intentionally small and doesn't do that. Rather, it encourages a decentralised ecosystem where anyone can take the data and build services on it - some commercial, some hobbyist, some paid-for, some free. It works really well, and IMO better than the big-budget maximalist approach of the Wikimedia Foundation.
If you're talking about the new-ish data dumps provided in protobuf format, this is a heavily optimised binary format. OrganicMaps uses these files directly to be able to store and lookup whole countries locally. With this format, the dump for France is only 4.3Gb at the time of writing.
Also, instead of downloading the whole map, you can use one of the numerous mirrors like Geofabrik [0] to download only the part you're interested in.
[0] https://download.geofabrik.de/
What non-obscure formats or libraries would you suggest for a planet's worth of geographic data?
I've also downloaded planet.osm before and parsed it on my desktop with iirc osmosis. Never used that format or tool anywhere else but it's not like OSM has so many competitors offering you large amounts of geospatial data in a freely usable way. What do you considered established mechanisms for this?
On https://www.openstreetmap.org/, click "Export" (upper-left). It lets you choose a small rectangle (click "Manually select a different area"). It gives you a .osm right from the browser.
For literally single point, on the map icons on the right, one is arrow with question mark ("Query features"). With this you can click on single features and get their data.
> I ended up using some free project that had pre-transformed osm data for what i needed.
That seems close enough to "the proper way". The OSM core devs can concentrate on providing the data in the format that existing OSM front ends are optimised to work with; if you want it transformed into some other popular format then it's great that the ecosystem already has free projects that will do that for you.
13-15 years ago I was able to download the OSM data for my country, import it in Postgre (PostGIS), run GIS query on it, then render and print my own maps. I don't remember being difficult, though indeed it required lots of disk space.
OP here. The toot was my sarcastic response after having to rate-limit and block another set of abusive scrapers aggressively hitting our website and mapping API. robots.txt be damned.
OpenStreetMap data is free to download. We published minutely on https://planet.openstreetmap.org/ and the data available via AWS S3 + torrent.
Not dissimilar to how instead of just cloning my pre-compressed repos in a simple few seconds operation, "AI" scrappers prefer to request every single revision of every single .c file through the web interface with all the useless (for them) bells an whistles.
Web interface which I have set up as cgi and therefore it will take them longer to finish scrapping than the age of the universe. But in the meanwhile they waste me power and resources.
Someone recently pointed out the Aaron Schwartz was threatened with going to prison for scraping, meanwhile there's hundred of billion of dollars right now invested in AI LLMs build from... scraping.
Put planet.osm on torrent. Allow "scraping" only through torrent. Now scrapers are sharing the network load between themselves. Not to mention improved network speed, as probably they all sit on the same AWS instance.
Our data is already publish via torrent, see https://planet.openstreetmap.org
Our data, including minutely updates is available via public S3 buckets (EU & US) supported by the AWS Open Data Sponsorship Program.
As expected. Now for the hard part: how to redirect scrapers to use torrents, and keep them uploading data? Should we write some "data scraping tutorials" specifically targeting OSM and describe using torrents as the best method?
Once, during a tech interview, the interviewer asked me to design a system for daily scraping English Wikipedia. I started with "Let's download the gzipped archive...". It turned out that the interviewer didn't know about that possibility and was waiting for the description of a complex system to download it page by page with multi-threading, canonical URLs, checking of visited pages, retries and so on. To their credit, they gave me "A" for this assignment. In the end, I got the job.
The people working for these companies are just clueless, arrogant, ignorant, unaware of others, just trying to hit some productivity target to get promoted, of course they're not going to bother checking whether there are other ways to do something avoiding annoying open source projects
In my company it’s easier to buy commercial software for $10000 than it is to donate $100 for open source voluntarily. I think they need to open up a store where you can buy donations disguised as licenses so the bean counters don’t even realize this could be free.
Same here. We are all Linux users and people linked articles about openssl being underpaid and all that (back when that was a topic), but after migrating from a paid chat tool to Signal, nobody agreed with me that we should maybe donate to Signal now. Both chat solutions are open source SaaS, but the former tool has a paid subscription for commercial use (which does nothing more than the personal version) whereas Signal calls it a voluntary donation. I still don't understand my colleagues' opinion
Paying thousands of euros for some enterprise Java software as well as Microsoft Office licenses we barely ever use, meanwhile: no problem
Never heard of any NTA or of business expenses being converted into money to myself when it's not on my account. Looking up what an NTA is in relation to taxes, they're a nonprofit themselves and don't appear to have authority of any kind besides PR https://en.m.wikipedia.org/wiki/National_Tax_Association
Regardless, if this were a concern in Germany, I'm sure our boss/director would have mentioned it on that call as a simple reason rather than finding excuses and saying we could pick a different nonprofit who needs it more to donate to
Companies donate all the time... this argument about it being considered income makes no sense, and if it did, just donate {income tax rate}% less if the company can't afford more and no problem either
That’s not my experience. We have systems that cost way into the six digits. You get an account manager that follows up and schedules a lot of meetings. But this doesn’t mean that difficult problems actually get resolved. Just a lot of talk. Even MS super premium support so far has never worked for me. They want an enormous amount of data and then go quiet.
"we're not arrogant, we just can't be bothered to do things that don't annoy others in any other way that is the way we expect it to be, and volonteers working for free should also account for our way to expect things"
If it is Russian and Chinese AI companies, they are unlikely to care. If it is Western companies, they might care, because they have more respect to the rules. Faked user agent is usually unfortunately the indication of the former.
Kind of sad that CommonCrawl, or something like it, has not removed the need for tons of different companies hitting all the servers in the world.
I guess part of it is wanting more control (more frequent visits, etc) and part is simply having lots of VC money and doing something they can do to try and impress more investors - "We have proprietary 5 PB dataset!" (literally adds nothing to commoncrawl).
I guess since it's posted to osm.town Mastodon, this is assumed to be known. Was surprised to see it without context here on HN; I can understand the confusion. Apparently most people here are already aware that one can download the full OpenStreetMap data without scraping
How come, in this era of no privacy, pixel trackers, data brokers, etc can they not easily stop scraping? Somehow bots have a way to be anonymous online yet consumers have to fight an uphill battle?
Because - and I cannot stress this enough - they are both ignorant and greedy.
Whenever I've traced back an AI bot scraping my sites, I've tried to enter into a dialogue with them. I've offered API access and data dumps. But most of them are barely above the level of "script kiddies". They've read a tutorial on scraping so that's the only thing they know.
They also genuinely believe that any public information is theirs for the taking. That's all they want to do; consume. They have no interest in giving back.
I don't know that this take is wrong, per se, but I think it's possibly a situation where the "actor with a single mind" model of thinking about corporate behavior fails to be particularly useful.
Scraping tends to run counter to the company's interests, too. It's relatively time-consuming - and therefore, assuming you pay your staff, expensive - compared to paying for an API key or data dump. So when engineers and data scientists do opt for it, it's really just individuals following the path of least resistance. Scraping doesn't require approval from anyone outside of their team, while paying for an API key or data dump tends to require going through a whole obnoxious procurement process, possibly coordinating management of said key with the security team, etc.
The same can be said for people opting to use GPT to generate synthetic data instead of paying for data. The GPT-generated data tends to be specious and ill-suited to the task of building and testing production-grade models, and the cost of constant tweaking and re-generation of the data quickly adds up. Just buying a commercial license for an appropriate data set from the Linguistic Data Consortium might only be a fraction as expensive once you factor in all the costs, but before you can even get to that option you first need to get through a gauntlet of managers who'll happily pay $250 per developer per year to get on the Copilot hype train but don't have the lateral thinking skills to understand how a $6,000 lump sum for a data set could help their data scientists generate ROI.
> Scraping tends to run counter to the company's interests, too. It's relatively time-consuming - and therefore, assuming you pay your staff, expensive - compared to paying for an API key or data dump.
It's the other way around, not? Scraping is fairly generic and requires little staff time. The servers/botnet doing the scraping need to run for a while but are cheap or stolen anyway. APIs on the otherhand are very specific to individual sites and need someone competent to develop a client.
Torrent users who do seed (assuming it’s copyrighted material) are no better. They’re just stealing someone else’s content and facilitating its theft.
If a company scrapes data, and then publishes the data for others to scrape.. they are still part of the problem — the altruism of letting other piggyback from their scraping doesn’t negate that they essentially are stealing data.
Stealing from grocery store and giving away some of what you steal doesn’t absolve the original theft.
> assuming it’s copyrighted material [...] They’re just stealing someone else’s content and facilitating its theft.
All content created by someone is copyrighted by default, but that does not mean it is theft to share it. Linux ISOs are copyrighted, but the copyright allows sharing, for example. But even in cases where this is not permitted, it would not be theft, but copyright infringement.
> the altruism of letting other piggyback from their scraping doesn’t negate that they essentially are stealing data.
It does. OpenStreetMap (OSM) data comes with a copyright licence that allows sharing the data. The problem with scraping is that the scrapers are putting unacceptably load on the OSM servers.
> Stealing from grocery store and giving away some of what you steal doesn’t absolve the original theft.
This is only comparable if the company that scrapes the data enters the data centre and steals the servers used by the OpenStreetMap Foundation (containing the material to be scraped), and the thing stolen from the grocery store also contains some intellectual property to be copied (e.g. a book or a CD, rather than an apple or an orange).
“Last night somebody broke into my apartment and replaced everything with exact duplicates... When I pointed it out to my roommate, he said, "Do I know you?”
― Steven Wright
The is a similar argument used to argue against people stealing music and movies — people would pirate content that someone else invested money to create. But the dominant attitude prior to streaming ubiquitousness, among the tech “information should be free crowd” was that torrents of copyright material were perfectly fine. This is no different. But it is different — when it’s your resources that are being stolen/misused/etc.
My opinion is that if you are building a business that relies on someone else’s creation — that company should be paid. This isn’t just about “AI” companies — but all sorts of companies that essentially scour the web to repackage someone else’s data. To me this also includes those paywall elimination tools — even the “non profits” should pay — even if their motives are non-profit, they still have revenue. (A charity stealing food from the grocery store is wrong, a grocery store donating to a charity is a different thing.)
However another aspect of this is government data and data created with government funds — scientific research for example. If a government grant paid for the research, I shouldn’t have to pay Nature to access it. If that breaks the academic publishing model — good. It’s already broken. We shouldn’t have to pay private companies to access public records, lawsuit filings, etc.
Anything they can, judging by the fact that they're hitting random endpoints instead of using those offered to developers. Similar thing happened to readthedocs[1] causing a surge of costs that nobody wants to answer for.
In the readthedocs situation there was one case that was a bugged crawler causing it to try and scrape the same HTML files repeatedly to the tune of 75TB, could also be happening here with OSM (partially).
Street names and numbers, businesses etc. associated with those streets, and stuff like that.
Say you have some idea, like...you want to build a tool that aids cargo ships, fishing vessels, or other vessels with the most efficient route (with respect to fuel usage) between ports.
The first thing you need to do, is to map all ports. There may not exist any such pre-compiled list, but you could always use map tools like OSM to scan all coastlines, and see if there are any associated ports, docks, etc. there.
Then when you find one, you save the location, name, and other info you can find.
This is pure brute force, and can naturally be quite expensive for the providers. But since you're a dinky one-man startup with zero funds, that's what you do - you can't be bothered with searching through hundreds (to thousands) of lists in various formats, from various sites, that may contain the info you're looking for.
And like most bad things that companies do, it happens inevitably. The person who doesn't reject anything that might interfere with income generation will be fired and replaced with someone who will.
Meanwhile, the owners will maintain and carefully curate their ignorance about any of those subjects.
Counter argument, its just an excuse to get rid of scraping. Google and every search engine scrapes websites, internet archive scrapes websites to archive stuff, and I scrape data when using excel to import data. Also their are people who want to archive everything.
Iv had my own stuff be scrapped. My biggest issue was bandwidth but I wasn't a big site so it wasn't a big issue.
If their scraper is sufficiently diligent, they will also download the data dump. The ultimate hope is that when the AI wakes up, it will realize that its training data has both a fragmented dataset and also a giant tarball, and delete the fragments. This sounds like one of those situations where people prefer amortized cost analysis to avoid looking like fools.
(unfortunately the historical example is poor, since Philip II proved both able and willing to make it happen, whereas AI has no demonstration of a path to utility)
I mean is it though? Even before gpt4 llm's existed....and not just from openAI.
I get not liking crawling and I hate openAI for how they ruined the term open source, but this is not new.
Iv had stuff scraped before and iv done web scrapping as well. Hell even excel will help you scrape web data. While some of the increase of training data has helped models like gpt4, its not just a factor of more data.
From my perspective as a small blogger, the general sentiment is yes, web scraping was a thing before the AI boom, but now it is an even bigger and far more careless thing. Just me two cents, though, mostly formed from chatting with other small bloggers about it. I suppose it could be argued that the size of the vehicle does not matter to the roadkill; they still get run over. But it honestly feels like the various AI platforms suddenly made web scraping much, much more fashionable.
I noticed too that now that the old dinosaur incumbents in various industrial complexes can no longer compete, they want to get rid of non-compete clauses - so they can instead poach talent or at least those who had access to the latest technologies and process of actually innovative companies.
It's been collapsing since digital piracy had first appeared. Then perpetuated by countries being elective at what kind of "intellectual property" rights they choose to respect.
"People with money" are a crucial milestone, because they were the ones who were actually actively benefiting from and upholding this institution.
Except it's not collapsing. The only legal changes have been to allow billionaires to do whatever they want whenever they want, and have been made by judges and not legislatively. You're still going to get sued to oblivion.
edit: if we let them, they're just going to merge with the media companies and cross-license to each other.
There is a whole wide world beyond just obsessing over "billionaires" in a tiny us-centric corner of the world.
One could point out China, who totally has so much respect for someone's notion of legality. Or France that had never cared for software patents. Or one could point out good old pirates, who were always relatively successful at giving a middle finger to the notion of "intellectual property". "Billionaires" are simply another straw, peculiar only in the sense that they were the pillar upholding this institution.
And speaking of "not legislatively, but judges":
1. Not every country has a common law legal system
2. Just take a look at Japan's AI legislations.
Is it really? It's all boils down to powerplay, really. And lately lots of powerful entities, from nations to corporations, seem to be poised on disrupting this institution.
What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.