Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?

jackienotchan · 2024-07-30T15:54:46 1722354886

Affected companies are becoming increasingly frustrated with the army of AI crawlers out there as they won't stick to any scraping best practices (respect robot.txt, use public APIs, no peak load). It's not necessarily about copyright, but the heavy scraping traffic also leads to increased infra costs.

What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.

hibikir · 2024-07-30T16:25:29 1722356729

The idea is not to make scraping impossible, but to make it expensive. A human doesn't make requests as fast as a bot, so the pretend human is still rate limited. Eventually, you need an account, and tracking of that also happens, and accounts matching specific patterns get purged, and so on. This will not stop scraping, but the point is not to stop it, but to make it expensive and slow. Eventually, expensive enough that it might be better off to not pretend to be a human, pay for a license, and then the arms race goes away.

Can defenses be good enough it's better to not even try to fight? It's a far harder question than wondering if a random bot can make a dozen requests pretending to be human

amiga386 · 2024-07-30T17:02:18 1722358938

I liked the analogy to Gabe Newell's "piracy is a service problem" adage, embodied in Virgin API consumer vs Chad third-party scraper https://x.com/gf_256/status/1514131084702797827

Make it easier to get the data, put less roadblocks in the way for legitimate access, and you'll find fewer scrapers. Even if you make scraping _very_ hard, people will still prefer scraping if legitimate use is even more cumbersome than scraping, or you refuse to even offer a legitimate option.

Admittedly, we are talking here because some people are scraping OSM when they could get the entire dataset for free... but I'm hoping these people are outliers, and most consume the non-profit org's data in the way they ask.

rat9988 · 2024-07-30T17:28:58 1722360538

I think this very example proves that the adage is wrong, or at least doesn't capture many things for the full picture.

pona-a · 2024-07-30T18:46:20 1722365180

Well, it isn't a case of piracy, is it? The data exists on the website, for free, under the assumption/social contract that you are a human, not an agent of a shady enterprise wasting the bandwidth. An analogy would be the game itself being put out for free on itch.io, but then downloaded and unpacked to make an asset flip.

xigoi · 2024-07-30T23:40:00 1722382800

Ironic of him to say that seeing that it’s often easier to pirate Steam gates than to get them legitimately.

thomasahle · 2024-07-31T15:35:31 1722440131

The only way I can see to make it truly expensive to scrape, is to build javascript bitcoin mining into every request.

kjkjadksj · 2024-07-30T16:40:41 1722357641

Seems to me eventually we might hit a point where stuff like api access is whitelisted. You will have to build a real relationship with a real human at the company to validate you aren’t a bot. This might include in person meeting as anything else could be spoofed. Back to the 1960s business world we go. Thanks, technologists, for pulling the rug under us all.

bunderbunder · 2024-07-30T16:46:46 1722358006

Scraping implies not API - they're accessing the site as a user agent. And whitelisting access to the actual web pages isn't a tenable option for many websites. Humans generally hate being forced to sign up for an account before they can see this page that they found in a Google search.

tedivm · 2024-07-30T16:46:58 1722358018

Scraping often uses the same APIs that the website itself does, so to make that work a lot of sites will have to put their content around authentication of some sort.

For example, I have a project that crawls the SCP Wiki (following best practices, ratelimiting, etc). If they were to restrict the API that I use it would break the website for people, so if they do want to limit the access they have no choice but to instead put it behind some set of credentials that they could trace back to a user and eliminate the public site itself. For a lot of sites that's just not reasonable.

smt88 · 2024-07-30T16:47:06 1722358026

You can't whitelist and also have a consumer-facing service. There is no reliable way to differentiate between a legitimate user and the AI company's scraper.

disqard · 2024-07-30T23:37:45 1722382665

Yep, it reminds me of the Ferrari almost-scam that was thwarted because the target thought to verify by asking about something that was only shared in-person.

brightball · 2024-07-30T16:47:36 1722358056

I could definitely see this. I worked for a company that had a few popular free inspector tools on their website. The constant traffic load of bots was nuts.

__MatrixMan__ · 2024-07-30T16:16:34 1722356194

I don't know if the AI's have an endgame in mind. As for the humans, I think it's an internet built for a dark forest. We'll stop assuming that everything is benign except for the malicious parts which we track and block. Instead we'll assume that everything is malicious except for the parts which our explicitly trusted circle of peers have endorsed. When we get burned, we'll prune the trust relationship that misled us, and we'll find ways to incentivize the kind of trust hygiene necessary to make that work.

When I compare that to our current internet the first thought is "but that won't scale to the whole planet". But the thing is, it doesn't need to. All of the problems I need computers to solve are local problems anyway.

bunderbunder · 2024-07-30T16:50:14 1722358214

Arguably, trying to scale everything to the whole planet is the root cause of most of these problems. So "that won't scale to the whole planet" might, in the long view, be a feature and not a bug.

__MatrixMan__ · 2024-07-30T17:00:35 1722358835

Right. If your use case for the internet is exerting influence over people who don't trust you, then it's past time that we shut you down anyhow.

For everyone else, this transition will not be a big deal (although your friends may ask you to occasionally spend a few cycles maintaining your part of a web of trust, because your bad decisions might affect them more than they currently do).

MattDaEskimo · 2024-07-30T16:34:17 1722357257

API-based interactions w/ Authentication.

Websites previously would have their own in-house API to freely deliver content to anyone who requests it.

Now, a website should be a simple interface for a user that communicates with an external API and display it. It's the user's responsibility to have access to the API.

Any information worth taking should be locked away by Authentication - which has become stupid simple using oAuth w/ major providers.

So these people trying to extract content by paying someone or using a paid service should rather use the API which packages it for them and is fairly priced.

Lastly, robots.txt should be enforced by law. There is no difference from stealing something from a store, and stealing content from a website.

AI (and greed) has killed the open freedoms of the Internet.

candiddevmike · 2024-07-30T16:01:17 1722355277

Invite only authenticated islands based on trust. Which seems like the end result of the rampant centralization of the internet.

zeroCalories · 2024-07-30T16:19:50 1722356390

The open web is on a crash course. I don't necessarily believe in copyright claims, but I think it makes sense to aggressively prosecute scrapers for DDOSing.

tempfile · 2024-07-30T16:44:00 1722357840

This would already be happening if we could track them.

account42 · 2024-07-31T10:17:43 1722421063

Just sue the companies hosting the attackers and let them sort it out with their clients.

danielmarkbruce · 2024-07-30T16:25:56 1722356756

Bars and coffee shops?

tempfile · 2024-07-30T16:47:29 1722358049

An optimistic outcome would be that public content becomes fully peer-to-peer. If you want to download an article, you must seed at least the same amount of bandwidth to serve another copy. You still have to deal with leechers, I guess.

mahdi7d1 · 2024-07-31T05:01:54 1722402114

There is no reason to protect against bots using regular captchas (Seems like I'm weaker than your average bot in passing those). Brave search has a proof of work captcha and everytime I face it I'm glad it's not google's choose the bicycle one. Having a captcha be a hevy process ran for a couple of seconds might be a nuisance to me who needs to complete it once a day but to the person who has to do it a lot of time for scraping, the costs might add up rather quickly. And the foundamental mechanism of it makes its effectivenes irrelevant to how much progress AI has made.

Also maybe the recent rise in captcha difficulty is not companies making them harder to prevent bots but rather bots twisting the right answer. As I know it captcha works based on other users' answers so if a huge portion of these other users are bots they can fool the alghorithm into thinking their wrong answer is the right answer.

MattGaiser · 2024-07-30T16:53:39 1722358419

Feed bad data to heavy users. Instead of blocking, use poison.

tempfile · 2024-07-30T16:56:00 1722358560

Presumes you can distinguish the heavy users. If you knew who the heavy users were, you could just block them.

bgorman · 2024-07-30T16:27:17 1722356837

Web Attestation, cryptography to the rescue.

rs999gti · 2024-07-30T16:28:29 1722356909

How? Watermark everything with a hash?

yifanl · 2024-07-30T16:40:00 1722357600

You can rather easily set up semi-hard rate limiting with a proof of work scheme. Will very trivially affect human users, while bot spammers have to eat up the cost of a million hash reversions per hour or whatever.

dartos · 2024-07-30T16:54:37 1722358477

Yep. That works well enough for password hashing algorithms to deter brute force attackers.

This is a similar situation.

skoocda · 2024-07-30T16:43:56 1722357836

e.g. HashCash

zkid18 · 2024-07-30T16:34:59 1722357299

Many would oppose the idea, but if any service (e.g. eBay, LinkedIn, Facebook) were to dump the snapshot to S3 every month, that could be a solution. You can't prevent scraping anyway.

Firefishy · 2024-07-30T19:14:34 1722366874

We publish a live stream of minutely updated OpenStreetMap data in ready do digest form on https://planet.openstreetmap.org/ and S3. Scraping of our data still happens.

Our S3 bucket is thankfully supported by the AWS Open Data Sponsorship Program.

dorgo · 2024-07-30T18:50:35 1722365435

Would the snapshot contain the same info ( beyound any doubt ) that an actual user would see if they opened LinkedIn/Facebook/Service from Canada on an IPhone at a saturday morning (for example)? If not, the snapshot is useles for some usecases and we are back to scraping.

glitchc · 2024-07-30T16:56:35 1722358595

Data from S3 isn't free though, still costs money and has a limit based on the tier you purchase.

Scoundreller · 2024-07-30T16:40:43 1722357643

Yeah, you can get dumps of Wikipedia and stackoverflow/stackexchange that way.

(Not sure if created by the admins or a 3rd party, but done once for many is better than overlapping individual efforts).

MisterBastahrd · 2024-07-30T15:56:21 1722354981

How long before companies start putting AI restrictions on new account creation simply because of the sheer amount of noise and storage issues associated with bot spam?

zild3d · 2024-07-30T17:33:24 1722360804

isn't the answer just rate limiting unauthenticated requests to a level that's reasonable/expected for a human?

thomasahle · 2024-07-31T15:38:33 1722440313

No, the scrapers can just spread over lots of IPs.

agilob · 2024-07-30T16:42:04 1722357724

lower max upload speed for certain IPs to 5kb/s

jgalt212 · 2024-07-30T17:33:17 1722360797

> What's the endgame here?

We've had good success with

- Cloudflare Turnstile

- Rate Limiting (be careful here, as some of these scrapers use large numbers of IP addresses and User Agents)

londons_explore · 2024-07-30T16:35:42 1722357342

> AI can already solve captchas, so the arms race for bot protection is pretty much lost.

Require login, then verify the user account is associated with an email address at least 10 yrs old. Pretty much eliminates bots. Eliminates a few real users too, but not many.

tempfile · 2024-07-30T16:40:26 1722357626

> require login

this is not a solution if you want a public internet (and sites that don't care about the public internet already don't have a problem)

londons_explore · 2024-07-30T16:44:48 1722357888

for read-only content, I just stick it behind a cache and let the bots go wild.

jgalt212 · 2024-07-30T17:20:32 1722360032

You can't cache this stuff for bot consumption. Humans only want to see the popular stuff. Bots download everything. The size of your cache then equals the size of your content database.

account42 · 2024-07-31T10:33:21 1722422001

But you can still make sure that you save the data in a form where generating the served webpage takes the least amount of time. For most websites this means saving the HTML - in a giant cache or with a more deliberate pre-generation setup.

jgalt212 · 2024-07-31T13:24:31 1722432271

The data structure to html conversion takes milliseconds. That's distinction without a difference.

account42 · 2024-07-31T14:36:05 1722436565

Clearly not the case with most websites. And "milliseconds" are already a huge amount of time. Video games simulate huge worlds and render complex 3D graphics within 16ms or even much less with the >60 framerates that are expected these days.

jgalt212 · 2024-07-31T22:27:54 1722464874

You've gotten off topic here. The giant cache you speak of approaches the size of the content database when designing for the long tail of bots. A giant cache is non economic and thus not a solution unless you're an AWS salesman.

londons_explore · 2024-08-01T09:58:19 1722506299

Yes, the cache ends up being bigger than the content database, but for text content that's typically not a problem. The human effort to type some text always hugely exceeds the cost of a few kilobytes of flash to store what they typed in a ready-to-serve form.

The generation process of taking the raw text and assembling the page around it is typically rather expensive for most CMS systems. Sure - it isn't theoretically expensive, but unless you want to engineer a CMS from scratch most people just pick one off the shelf and then end up having to pay the CPU time overhead of wordpress etc.

account42 · 2024-08-01T07:32:38 1722497558

Not at all. You can either design your website so that pages can be retrieved in sub-millisecond time (and that doesn't have to mean throwing money at cloud providers) or you can cry about bots.

dartos · 2024-07-30T16:55:33 1722358533

That’s just passing the buck.

Someone still needs to pay for that traffic. If it gets too much for cloud flare or whoever, you’re gonna get the bill.

account42 · 2024-07-31T10:31:54 1722421914

Traffic is not all that expensive though if you are not using a cloud provider where that's how they squeeze captured customers.

tempfile · 2024-07-30T16:54:57 1722358497

I presume OSM has already considered this and ruled it out (probably because the map should be dynamic)

_heimdall · 2024-07-30T17:37:23 1722361043

I must be an outlier here, but I don't keep email addresses that long. After a couple years they're on too many spam lists. I'll wind those addresses down and use them for a couple years only for short interactions that I expect spam from, and ultimately close then down completely the next cycle.

At best any email I have is 4 or 5 years old.

account42 · 2024-07-31T10:29:58 1722421798

You are definitely an outlier in that you abandon email addresses deliberately. But many people do not have an old address simply because they lost access to their previous ones for one of many possible reasons, the most common one being that it was provided with a business relationship (e.g. ISP contract) that no longer exists.

That's before even getting into how you'd possibly verify email adress age, especially without preventing self-hosting.

londons_explore · 2024-07-31T13:17:28 1722431848

There are commercial services to verify email address age: https://www.ipqualityscore.com/email-age-checker

They generally look at data leaks and partner with big companies to see when that email address first signed up to any online service.

account42 · 2024-07-31T14:38:28 1722436708

That doesn't seem remotely compatible with modern privacy laws like the GDPR. And it certainly ads even more false negatives of people locked out because they didn't have their email leaked long enough ago.

londons_explore · 2024-07-31T22:44:55 1722465895

GDPR has a massive carve-out for fraud prevention...

account42 · 2024-08-01T07:33:35 1722497615

Which doesn't include collecting information about your email and then handing it off to random third parties just because they want to use it for that.

azemetre · 2024-07-31T13:54:12 1722434052

How does one find the age of an email account?

mcherm · 2024-07-30T17:28:07 1722360487

This is about OpenStreetMap, so you are proposing that my minor daughter not be allowed to read a map?

stereo · 2024-07-30T20:19:03 1722370743

OpenStreetMap Foundation chairperson here.

OpenStreetMap's data is available for free in bulk from https://planet.openstreetmap.org. We encourage using these instead of scraping our site.

Scraping puts a high load on our donated resources. We block scraping IPs, but even that takes us work and time.

Respecting our time and resources helps us keep the service free and accessible for everyone.

Self-Perfection · 2024-07-31T00:27:16 1722385636

And how exactly do you block scraping IPs? I suspect some of scrapers are just confused and are not aware of better ways to get OSM data.

Responding with a 403 error code will only lead to them changing their IP addresses.

A more effective approach might be to provide a response containing instructions on where to download data in bulk or a link to a guide that explains how to process OSM dumps.

stereo · 2024-07-31T11:07:21 1722424041

You make a good point. I already have a PR to add text to our robots.txt but the 403 page is an excellent idea.

snehk · 2024-07-30T15:40:25 1722354025

You can literally set up your own OpenStreetMap instance in ten minutes. It's a simple 'docker run'-command. Sure, indexing will take a bit but even that can't take that long given their resources. That's just ridiculously greedy.

orblivion · 2024-07-30T16:59:51 1722358791

A while ago I very briefly tried Headway out of curiosity. This is the easiest Docker based option for the "full stack". It didn't work out of the box. Things went wrong. Which is no surprise, there's a ton of moving parts. And maybe it's not a big deal to work around but I highly doubt that it's 10 minutes of work to get everything working reliably.

joshe · 2024-07-30T17:04:26 1722359066

No, it's painful.

dawnerd · 2024-07-30T16:14:45 1722356085

Link then? Because last time I tried it was a bit more complex than that.

claytonjy · 2024-07-30T16:27:56 1722356876

I used OSMRouter maybe 7 or 8 years ago to process a couple billion routes and it was about as simple as GP described. Just need Docker and an index file. the US one was massive so I kept needing a bigger VM just to load it, but once I did I was able to make HTTP calls over the network to get what I needed. Took a few days to get a working setup but only a few hours to rip through billions of requests, and I was making them synchronously with R; could have been much faster if I was smarter then.

shironandon · 2024-07-30T17:37:45 1722361065

https://github.com/openstreetmap/openstreetmap-website/blob/...

dawnerd · 2024-07-30T18:56:17 1722365777

Exactly. This isn't just a 10 minute project. And thats just for the website, not the tiles, etc.

Satam · 2024-07-30T16:15:05 1722356105

I needed osm data at one point. Never managed to figure out how to do it the proper way. To get data you need, you need to download massive 100Gb files, in obscure formats, and use obscure libraries. Info is scattered, there are HTTP APIs but they’re limited or rate-limited and it’s not clear if you’re supposed to use them.

I know I’m ignorant and I’m happy the project exists, but the usability in the era where devs expect streamlined APIs is not great.

I ended up using some free project that had pre-transformed osm data for what i needed.

Doctor_Fegg · 2024-07-30T16:22:45 1722356565

That's kind of by design. Providing streamlined APIs requires a funding model to both host those APIs and pay an army of devops to maintain them. The OSM Foundation is intentionally small and doesn't do that. Rather, it encourages a decentralised ecosystem where anyone can take the data and build services on it - some commercial, some hobbyist, some paid-for, some free. It works really well, and IMO better than the big-budget maximalist approach of the Wikimedia Foundation.

hollow-moe · 2024-07-30T16:40:51 1722357651

If you're talking about the new-ish data dumps provided in protobuf format, this is a heavily optimised binary format. OrganicMaps uses these files directly to be able to store and lookup whole countries locally. With this format, the dump for France is only 4.3Gb at the time of writing. Also, instead of downloading the whole map, you can use one of the numerous mirrors like Geofabrik [0] to download only the part you're interested in. [0] https://download.geofabrik.de/

Aachen · 2024-07-30T16:20:13 1722356413

What non-obscure formats or libraries would you suggest for a planet's worth of geographic data?

I've also downloaded planet.osm before and parsed it on my desktop with iirc osmosis. Never used that format or tool anywhere else but it's not like OSM has so many competitors offering you large amounts of geospatial data in a freely usable way. What do you considered established mechanisms for this?

ks2048 · 2024-07-30T16:47:00 1722358020

On https://www.openstreetmap.org/, click "Export" (upper-left). It lets you choose a small rectangle (click "Manually select a different area"). It gives you a .osm right from the browser.

For literally single point, on the map icons on the right, one is arrow with question mark ("Query features"). With this you can click on single features and get their data.

dmurray · 2024-07-30T18:37:30 1722364650

> I ended up using some free project that had pre-transformed osm data for what i needed.

That seems close enough to "the proper way". The OSM core devs can concentrate on providing the data in the format that existing OSM front ends are optimised to work with; if you want it transformed into some other popular format then it's great that the ecosystem already has free projects that will do that for you.

GeoAtreides · 2024-07-30T16:58:39 1722358719

13-15 years ago I was able to download the OSM data for my country, import it in Postgre (PostGIS), run GIS query on it, then render and print my own maps. I don't remember being difficult, though indeed it required lots of disk space.

spywaregorilla · 2024-07-30T16:22:47 1722356567

https://wiki.openstreetmap.org/wiki/OSM_JSON

Looks pretty sensible to me?

Firefishy · 2024-07-30T19:26:24 1722367584

OP here. The toot was my sarcastic response after having to rate-limit and block another set of abusive scrapers aggressively hitting our website and mapping API. robots.txt be damned.

OpenStreetMap data is free to download. We published minutely on https://planet.openstreetmap.org/ and the data available via AWS S3 + torrent.

If you just starting out, best to start with a smaller regional extract: https://wiki.openstreetmap.org/wiki/Planet.osm

AshamedCaptain · 2024-07-30T17:52:45 1722361965

Not dissimilar to how instead of just cloning my pre-compressed repos in a simple few seconds operation, "AI" scrappers prefer to request every single revision of every single .c file through the web interface with all the useless (for them) bells an whistles.

Web interface which I have set up as cgi and therefore it will take them longer to finish scrapping than the age of the universe. But in the meanwhile they waste me power and resources.

nailer · 2024-07-30T16:08:43 1722355723

Someone recently pointed out the Aaron Schwartz was threatened with going to prison for scraping, meanwhile there's hundred of billion of dollars right now invested in AI LLMs build from... scraping.

stavros · 2024-07-30T16:37:11 1722357431

That's because the megacorps can scrape you, but you can't scrape the megacorps.

jahewson · 2024-07-30T16:47:01 1722358021

JSTOR is a non-profit run by academics. Indeed, they have plenty of money, but they’re no megacorp.

SJC_Hacker · 2024-07-30T17:05:13 1722359113

Generally it should be "More powerful entities can scrape you, but you can't scrape them back"

Google scraping JSTOR (hey, don't they do that already with Google Scholar?" is much less of a problem then JSTOR attempting to scrape Google.

bdjsiqoocwk · 2024-07-30T16:55:02 1722358502

Why go to AI LLMs? Scraping and indexing is all Google search does.

ToucanLoucan · 2024-07-30T16:10:59 1722355859

Rules for thee and not for me. Same as it ever was.

startupsfail · 2024-07-30T16:20:00 1722356400

And Markdown is the primary format.

nojvek · 2024-07-30T21:06:41 1722373601

Aaron didn’t have an army of lawyers that Megacorps do.

He took the papers and put it public. Blatant copy right violation.

LLMs are in gray waters of derative work, not verbatim copy of original.

Different judges have had varying rulings.

sanbor · 2024-08-01T19:20:46 1722540046

My understanding is that Aaron just downloaded the data, not that he made it available to the public. That’s what I found also in Wikipedia[1].

[1] https://en.m.wikipedia.org/wiki/United_States_v._Swartz#endn...

TiredOfLife · 2024-07-30T19:34:30 1722368070

Wasn't he scrapping paywalled stuff?

butz · 2024-07-30T16:44:58 1722357898

Put planet.osm on torrent. Allow "scraping" only through torrent. Now scrapers are sharing the network load between themselves. Not to mention improved network speed, as probably they all sit on the same AWS instance.

Firefishy · 2024-07-30T19:18:13 1722367093

Our data is already publish via torrent, see https://planet.openstreetmap.org Our data, including minutely updates is available via public S3 buckets (EU & US) supported by the AWS Open Data Sponsorship Program.

butz · 2024-07-31T15:02:38 1722438158

As expected. Now for the hard part: how to redirect scrapers to use torrents, and keep them uploading data? Should we write some "data scraping tutorials" specifically targeting OSM and describe using torrents as the best method?

SergeAx · 2024-08-04T01:08:54 1722733734

Once, during a tech interview, the interviewer asked me to design a system for daily scraping English Wikipedia. I started with "Let's download the gzipped archive...". It turned out that the interviewer didn't know about that possibility and was waiting for the description of a complex system to download it page by page with multi-threading, canonical URLs, checking of visited pages, retries and so on. To their credit, they gave me "A" for this assignment. In the end, I got the job.

nashashmi · 2024-07-30T17:31:38 1722360698

How about a honey pot for AI companies? Endless loop of stupidly generated content. Imagine twitter posts with artificial tweets at the end.

deely3 · 2024-07-30T19:31:03 1722367863

...and suddenly OpenStreetMap become not so open. Just like OpenAI, heh.

nashashmi · 2024-07-30T22:30:24 1722378624

“Open” for AI is not always a given privilege. “Open” for search and display is given privilege.

lnxg33k1 · 2024-07-30T15:34:55 1722353695

The people working for these companies are just clueless, arrogant, ignorant, unaware of others, just trying to hit some productivity target to get promoted, of course they're not going to bother checking whether there are other ways to do something avoiding annoying open source projects

rqtwteye · 2024-07-30T15:56:01 1722354961

In my company it’s easier to buy commercial software for $10000 than it is to donate $100 for open source voluntarily. I think they need to open up a store where you can buy donations disguised as licenses so the bean counters don’t even realize this could be free.

Aachen · 2024-07-30T16:29:15 1722356955

Same here. We are all Linux users and people linked articles about openssl being underpaid and all that (back when that was a topic), but after migrating from a paid chat tool to Signal, nobody agreed with me that we should maybe donate to Signal now. Both chat solutions are open source SaaS, but the former tool has a paid subscription for commercial use (which does nothing more than the personal version) whereas Signal calls it a voluntary donation. I still don't understand my colleagues' opinion

Paying thousands of euros for some enterprise Java software as well as Microsoft Office licenses we barely ever use, meanwhile: no problem

_Microft · 2024-07-30T16:49:21 1722358161

This old comment by patio11 (Patrick McKenzie, the one behind „Bits about Money“ and the blog at kalzumeus.com) on this topic might be of interest:

https://news.ycombinator.com/item?id=10863978

It explains why companies have a lot less problems with invoices than with donations.

Aachen · 2024-07-31T06:04:05 1722405845

Never heard of any NTA or of business expenses being converted into money to myself when it's not on my account. Looking up what an NTA is in relation to taxes, they're a nonprofit themselves and don't appear to have authority of any kind besides PR https://en.m.wikipedia.org/wiki/National_Tax_Association

Regardless, if this were a concern in Germany, I'm sure our boss/director would have mentioned it on that call as a simple reason rather than finding excuses and saying we could pick a different nonprofit who needs it more to donate to

Companies donate all the time... this argument about it being considered income makes no sense, and if it did, just donate {income tax rate}% less if the company can't afford more and no problem either

resource_waste · 2024-07-30T16:05:17 1722355517

The difference in quality is apparent though.

$10,000 and you have an account manager that will actually follow up on issues.

I recently paid $5k for software and its incredible the difference. Its like I have a part time contractor and software.

persnickety · 2024-07-30T16:13:26 1722356006

It's not clear from your comment, did you pay for commercial software or paid for an open source contributor's time?

And, regardless of the answer, what was your experience with the other option, for comparison?

rqtwteye · 2024-07-30T22:40:26 1722379226

That’s not my experience. We have systems that cost way into the six digits. You get an account manager that follows up and schedules a lot of meetings. But this doesn’t mean that difficult problems actually get resolved. Just a lot of talk. Even MS super premium support so far has never worked for me. They want an enormous amount of data and then go quiet.

resource_waste · 2024-07-31T11:12:32 1722424352

Yeah but thats Microsoft.

rqtwteye · 2024-07-31T14:12:28 1722435148

Others aren’t much better.

carimura · 2024-07-30T16:26:33 1722356793

Assuming you are at a big company, it's optimized for risk mitigation.

1oooqooq · 2024-07-30T16:06:04 1722355564

that just highlight ignorance of your company dept handling the purchase, which is exactly the point of the comment you're replying.

would they be more competent if they allowed the company to make the better "purchase"?

lnxg33k1 · 2024-07-30T16:13:39 1722356019

"we're not arrogant, we just can't be bothered to do things that don't annoy others in any other way that is the way we expect it to be, and volonteers working for free should also account for our way to expect things"

bloody hell, corporate world is unbelievable

miohtama · 2024-07-31T05:54:26 1722405266

If it is Russian and Chinese AI companies, they are unlikely to care. If it is Western companies, they might care, because they have more respect to the rules. Faked user agent is usually unfortunately the indication of the former.

sofixa · 2024-07-31T13:27:09 1722432429

> they might care, because they have more respect to the rules

Do they? Didn't OpenAI scrape everything regardless of licence forbidding reuse without attribution or for commercial interest?

ks2048 · 2024-07-30T17:01:38 1722358898

Kind of sad that CommonCrawl, or something like it, has not removed the need for tons of different companies hitting all the servers in the world.

I guess part of it is wanting more control (more frequent visits, etc) and part is simply having lots of VC money and doing something they can do to try and impress more investors - "We have proprietary 5 PB dataset!" (literally adds nothing to commoncrawl).

bofadeez · 2024-07-30T16:06:53 1722355613

Still too expensive. Why not just sell it at the prevailing rate per GB for residential IP bandwidth? A thousand max.

Aachen · 2024-07-30T16:16:02 1722356162

The joke is that you can already download it for free, no donation or bandwidth reimbursement needed

https://wiki.openstreetmap.org/wiki/Planet.osm

I guess since it's posted to osm.town Mastodon, this is assumed to be known. Was surprised to see it without context here on HN; I can understand the confusion. Apparently most people here are already aware that one can download the full OpenStreetMap data without scraping

deely3 · 2024-07-30T19:29:34 1722367774

In your opinion, profits from selling data should go to OpenStreetMap contributors or OpenStreetMap Foundation?

gosub100 · 2024-07-30T19:41:49 1722368509

How come, in this era of no privacy, pixel trackers, data brokers, etc can they not easily stop scraping? Somehow bots have a way to be anonymous online yet consumers have to fight an uphill battle?

infecto · 2024-07-30T15:57:31 1722355051

Honest question, what are "AI Companies" scraping from OSM?

edent · 2024-07-30T16:05:06 1722355506

Because - and I cannot stress this enough - they are both ignorant and greedy.

Whenever I've traced back an AI bot scraping my sites, I've tried to enter into a dialogue with them. I've offered API access and data dumps. But most of them are barely above the level of "script kiddies". They've read a tutorial on scraping so that's the only thing they know.

They also genuinely believe that any public information is theirs for the taking. That's all they want to do; consume. They have no interest in giving back.

bunderbunder · 2024-07-30T16:28:39 1722356919

I don't know that this take is wrong, per se, but I think it's possibly a situation where the "actor with a single mind" model of thinking about corporate behavior fails to be particularly useful.

Scraping tends to run counter to the company's interests, too. It's relatively time-consuming - and therefore, assuming you pay your staff, expensive - compared to paying for an API key or data dump. So when engineers and data scientists do opt for it, it's really just individuals following the path of least resistance. Scraping doesn't require approval from anyone outside of their team, while paying for an API key or data dump tends to require going through a whole obnoxious procurement process, possibly coordinating management of said key with the security team, etc.

The same can be said for people opting to use GPT to generate synthetic data instead of paying for data. The GPT-generated data tends to be specious and ill-suited to the task of building and testing production-grade models, and the cost of constant tweaking and re-generation of the data quickly adds up. Just buying a commercial license for an appropriate data set from the Linguistic Data Consortium might only be a fraction as expensive once you factor in all the costs, but before you can even get to that option you first need to get through a gauntlet of managers who'll happily pay $250 per developer per year to get on the Copilot hype train but don't have the lateral thinking skills to understand how a $6,000 lump sum for a data set could help their data scientists generate ROI.

account42 · 2024-07-31T10:44:29 1722422669

> Scraping tends to run counter to the company's interests, too. It's relatively time-consuming - and therefore, assuming you pay your staff, expensive - compared to paying for an API key or data dump.

It's the other way around, not? Scraping is fairly generic and requires little staff time. The servers/botnet doing the scraping need to run for a while but are cheap or stolen anyway. APIs on the otherhand are very specific to individual sites and need someone competent to develop a client.

carimura · 2024-07-30T16:23:43 1722356623

i think you answered the "why", not the "what". :)

BizarroLand · 2024-07-30T16:42:56 1722357776

The what is "everything they can get".

They are the modern equivalent of torrent users who don't seed.

briandear · 2024-07-30T16:58:48 1722358728

Torrent users who do seed (assuming it’s copyrighted material) are no better. They’re just stealing someone else’s content and facilitating its theft.

If a company scrapes data, and then publishes the data for others to scrape.. they are still part of the problem — the altruism of letting other piggyback from their scraping doesn’t negate that they essentially are stealing data.

Stealing from grocery store and giving away some of what you steal doesn’t absolve the original theft.

joshuaissac · 2024-07-30T17:55:41 1722362141

> assuming it’s copyrighted material [...] They’re just stealing someone else’s content and facilitating its theft.

All content created by someone is copyrighted by default, but that does not mean it is theft to share it. Linux ISOs are copyrighted, but the copyright allows sharing, for example. But even in cases where this is not permitted, it would not be theft, but copyright infringement.

> the altruism of letting other piggyback from their scraping doesn’t negate that they essentially are stealing data.

It does. OpenStreetMap (OSM) data comes with a copyright licence that allows sharing the data. The problem with scraping is that the scrapers are putting unacceptably load on the OSM servers.

> Stealing from grocery store and giving away some of what you steal doesn’t absolve the original theft.

This is only comparable if the company that scrapes the data enters the data centre and steals the servers used by the OpenStreetMap Foundation (containing the material to be scraped), and the thing stolen from the grocery store also contains some intellectual property to be copied (e.g. a book or a CD, rather than an apple or an orange).

account42 · 2024-07-31T10:47:40 1722422860

Stealing is not theft because so-called intellectual "property" is not property at all.

If anything, it's copryght holders that are infringing on everyone else's right to free speech by wanting to control the communication between others.

executesorder66 · 2024-07-30T19:49:20 1722368960

If someone made an exact replica of your car, and drove away in the replica never to be seen again, would you report your car as stolen?

BizarroLand · 2024-08-02T20:26:43 1722630403

“Last night somebody broke into my apartment and replaced everything with exact duplicates... When I pointed it out to my roommate, he said, "Do I know you?” ― Steven Wright

briandear · 2024-07-30T16:56:04 1722358564

The is a similar argument used to argue against people stealing music and movies — people would pirate content that someone else invested money to create. But the dominant attitude prior to streaming ubiquitousness, among the tech “information should be free crowd” was that torrents of copyright material were perfectly fine. This is no different. But it is different — when it’s your resources that are being stolen/misused/etc.

My opinion is that if you are building a business that relies on someone else’s creation — that company should be paid. This isn’t just about “AI” companies — but all sorts of companies that essentially scour the web to repackage someone else’s data. To me this also includes those paywall elimination tools — even the “non profits” should pay — even if their motives are non-profit, they still have revenue. (A charity stealing food from the grocery store is wrong, a grocery store donating to a charity is a different thing.)

However another aspect of this is government data and data created with government funds — scientific research for example. If a government grant paid for the research, I shouldn’t have to pay Nature to access it. If that breaks the academic publishing model — good. It’s already broken. We shouldn’t have to pay private companies to access public records, lawsuit filings, etc.

hooverd · 2024-07-30T17:02:10 1722358930

At least people who only leech torrents don't think they're doing to bring about the singularity by doing so.

indymike · 2024-07-30T16:14:55 1722356095

> ignorant and greedy.

This is going to be the title of my book on AI that I totally need to write.

noah_buddy · 2024-07-30T16:29:31 1722356971

Avarice and ignorance ;)

lovethevoid · 2024-07-30T16:23:12 1722356592

Anything they can, judging by the fact that they're hitting random endpoints instead of using those offered to developers. Similar thing happened to readthedocs[1] causing a surge of costs that nobody wants to answer for.

In the readthedocs situation there was one case that was a bugged crawler causing it to try and scrape the same HTML files repeatedly to the tune of 75TB, could also be happening here with OSM (partially).

[1] https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...

TrackerFF · 2024-07-30T16:35:54 1722357354

Street names and numbers, businesses etc. associated with those streets, and stuff like that.

Say you have some idea, like...you want to build a tool that aids cargo ships, fishing vessels, or other vessels with the most efficient route (with respect to fuel usage) between ports.

The first thing you need to do, is to map all ports. There may not exist any such pre-compiled list, but you could always use map tools like OSM to scan all coastlines, and see if there are any associated ports, docks, etc. there.

Then when you find one, you save the location, name, and other info you can find.

This is pure brute force, and can naturally be quite expensive for the providers. But since you're a dinky one-man startup with zero funds, that's what you do - you can't be bothered with searching through hundreds (to thousands) of lists in various formats, from various sites, that may contain the info you're looking for.

jaakl · 2024-08-01T04:42:17 1722487337

Just an idea: https://news.ycombinator.com/item?id=41111145

acd · 2024-07-30T17:28:57 1722360537

Content attribution where content providers get paid.

Workaccount2 · 2024-07-30T17:28:20 1722360500

I can hear the AI groaning about regular humans suddenly caring a lot about IP protection and discussing ways to add DRM to protect it.

I really hope the irony isn't lost on everyone.

exabrial · 2024-07-30T16:00:34 1722355234

Once again: Silicon Valley does not understand the concept of willful consent.

JohnFen · 2024-07-30T16:13:24 1722356004

Oh, it understands. It just rejects anything that might interfere with income generation.

pessimizer · 2024-07-30T16:33:47 1722357227

And like most bad things that companies do, it happens inevitably. The person who doesn't reject anything that might interfere with income generation will be fired and replaced with someone who will.

Meanwhile, the owners will maintain and carefully curate their ignorance about any of those subjects.

tonetegeatinst · 2024-07-30T16:36:58 1722357418

Counter argument, its just an excuse to get rid of scraping. Google and every search engine scrapes websites, internet archive scrapes websites to archive stuff, and I scrape data when using excel to import data. Also their are people who want to archive everything.

Iv had my own stuff be scrapped. My biggest issue was bandwidth but I wasn't a big site so it wasn't a big issue.

mouse_ · 2024-07-30T15:20:46 1722352846

AI = IP thieves

I don't think they're going to donate...

RicoElectrico · 2024-07-30T15:29:40 1722353380

The irony is that as indicated in the comments it's far more easy to just download the data dump for the whole planet. It's 70-ish GB right now.

klyrs · 2024-07-30T15:33:33 1722353613

If their scraper is sufficiently diligent, they will also download the data dump. The ultimate hope is that when the AI wakes up, it will realize that its training data has both a fragmented dataset and also a giant tarball, and delete the fragments. This sounds like one of those situations where people prefer amortized cost analysis to avoid looking like fools.

o11c · 2024-07-30T16:09:15 1722355755

If.

(unfortunately the historical example is poor, since Philip II proved both able and willing to make it happen, whereas AI has no demonstration of a path to utility)

klyrs · 2024-07-30T16:37:01 1722357421

Wow, y'all don't recognize parody. I really laid it on thick there, even calling the perpetrators of this madness fools, and still, whoosh. Hopeless.

o11c · 2024-07-30T17:43:02 1722361382

Poe's Law.

Given the massive destruction currently being done eagerly, I dare not ever assume parody.

llm_trw · 2024-07-30T15:52:47 1722354767

People just don't know any better.

tonetegeatinst · 2024-07-30T16:34:00 1722357240

I mean is it though? Even before gpt4 llm's existed....and not just from openAI.

I get not liking crawling and I hate openAI for how they ruined the term open source, but this is not new.

Iv had stuff scraped before and iv done web scrapping as well. Hell even excel will help you scrape web data. While some of the increase of training data has helped models like gpt4, its not just a factor of more data.

0xEF · 2024-07-31T10:23:22 1722421402

From my perspective as a small blogger, the general sentiment is yes, web scraping was a thing before the AI boom, but now it is an even bigger and far more careless thing. Just me two cents, though, mostly formed from chatting with other small bloggers about it. I suppose it could be argued that the size of the vehicle does not matter to the roadkill; they still get run over. But it honestly feels like the various AI platforms suddenly made web scraping much, much more fashionable.

foverzar · 2024-07-30T15:54:11 1722354851

Well it's nice to see the whole concept of IP finally collapsing as it should.

loceng · 2024-07-30T16:10:00 1722355800

I noticed too that now that the old dinosaur incumbents in various industrial complexes can no longer compete, they want to get rid of non-compete clauses - so they can instead poach talent or at least those who had access to the latest technologies and process of actually innovative companies.

p_j_w · 2024-07-30T17:10:03 1722359403

It's only collapsing for people with money.

foverzar · 2024-07-31T08:42:04 1722415324

It's been collapsing since digital piracy had first appeared. Then perpetuated by countries being elective at what kind of "intellectual property" rights they choose to respect.

"People with money" are a crucial milestone, because they were the ones who were actually actively benefiting from and upholding this institution.

pessimizer · 2024-07-30T16:37:44 1722357464

Except it's not collapsing. The only legal changes have been to allow billionaires to do whatever they want whenever they want, and have been made by judges and not legislatively. You're still going to get sued to oblivion.

edit: if we let them, they're just going to merge with the media companies and cross-license to each other.

foverzar · 2024-07-31T08:35:32 1722414932

There is a whole wide world beyond just obsessing over "billionaires" in a tiny us-centric corner of the world.

One could point out China, who totally has so much respect for someone's notion of legality. Or France that had never cared for software patents. Or one could point out good old pirates, who were always relatively successful at giving a middle finger to the notion of "intellectual property". "Billionaires" are simply another straw, peculiar only in the sense that they were the pillar upholding this institution.

And speaking of "not legislatively, but judges": 1. Not every country has a common law legal system 2. Just take a look at Japan's AI legislations.

asddubs · 2024-07-30T16:02:46 1722355366

that's a little optimistic

foverzar · 2024-07-31T08:21:49 1722414109

Is it really? It's all boils down to powerplay, really. And lately lots of powerful entities, from nations to corporations, seem to be poised on disrupting this institution.