Hacker News new | past | comments | ask | show | jobs | submit login
So you want to scrape like the big boys (2021) (incolumitas.com)
321 points by aragonite 14 days ago | hide | past | favorite | 159 comments



I'm a lawyer that works in the web-scraping space, and I always chuckle when I read threads like this. Almost every company that we now consider a monopolist (or their affiliates) in the tech space used scraping a part of their process to build their business, and almost every one of those same monopolists now prohibits startups and competitors from scraping their data (which, invariably, is not actually "their" data in any sort of legally cognizable sense). And so perhaps the ethics of web scraping are not so straightforward. And neither are the legal issues associated with it.

I wrote an article about that last fall that got some attention here.

https://news.ycombinator.com/item?id=37264676


Same thing with Facebook and identity. IIRC they leveraged Google’s address book to get traction, but will go after you if you try store FB social graph data long term for anything outside their garden.

You try to block the tricks you used to get growth, basically.


> And so perhaps the ethics of web scraping are not so straightforward.

It strikes me that the _ethics_ of web scraping are extremely straightforward and cognizable with a terse analysis:

* You can respond however you like to my HTTP request, and I can parse your response however I like.

Simple, traditional, common. This is the way that conversations have occurred since the dawn of human communication, no?

> the legal issues associated with it.

But aren't these, without exception, fabrics spun out of the cloth that shields established players with the threat of state violence? This is not particularly new, and seems to fit in the pathetic-and-predictable file.

Moreover, the broader cheap attempt to cast this in "intellectual" property terms, and to attach that to protection of artists and creators, warrants a very particular eye-roll for its illogic.


Do you apply this ethics to webs scraping only, or to all other network communications too?

Because if that's your general principles, you are making the internet much shittier. I still remember the old internet with open SMTP servers, easy-to-use comment forms, and forums which did not require emails and capthas. But people with "You can respond however you like to my HTTP request" attitude ruined it with spam, scam and SEO.

If you only apply this to web scraping, then where do you draw the line and why? Can you scrape at maximum rate server can support? Can you scrape if this requires active action (like account creation?) As long as you scrape, can you also post some links to improve your SEO?


> But people with "You can respond however you like to my HTTP request" attitude ruined it with spam, scam and SEO.

I don’t see how those things relate. They all have separate ethical issues. You can believe it’s ok to scrape whatever info you can find online at the same time as believing it’s not ok to scam people.


> Do you apply this ethics to webs scraping only, or to all other network communications too?

I mean... if you're keying in at 20MHz and blasting a gigawatt of noise, then yeah you've certainly run afoul of decency and just law. You're changing the physical shape of the network environment.

But if the concern is just that we don't like the bytes to which your signal decodes, or we don't like what you're doing with the response we give you, then it seems more like a speech/press issue.

The internet needs to grow resilience such that annoyances in the logical layers are easy to ignore if you have the will. But that almost certainly means that you don't get to police what people do with the content you willingly hand over, pursuant to the protocol in use.


If I say, “Hey, please don’t text me anymore. I’m going to block this number,” and you respond by buying 500 phones in five cities and text me nonstop, is that ethical?


Not sure the metaphor works here. For example most sites let Google scrape them as much as it likes, but go out of their way to block other robots. By doing so they are effectively forcing the whole world to use (or support, since smaller search engines have to piggyback on the big ones wih special status, and pay them) proprietary spyware.

In your analogy, most websites block everyone except the biggest pervert known to man.


Isn’t that a choice the website owner should be able to make?


Of course it's your choice to make.

Is someone forcing you to respond to requests you'd prefer to ignore?


yes, people like OP who get the farms of scrapers.

The website owners make their preferences clear with robots.txt, IP blocks and other antibot technology. Scrapers intentionally ignore owners' desires and force the to respond.


If crawlers are stealth DDoSing my site then I lose the ability to respond entirely.

It's your job to separate the wheat from the chaff at the boundary of your network interface. In fact, personal boundaries of all sorts, from informational to emotional to physical to economic, are of paramount importance in the information age.

Nobody (and certainly not the state) is going to erect your personal boundaries for you by ensuring justice in the face of spammy text messages (or, for that matter, hypnotic and manipulative social media). This is your job - maybe your most important job.

Just as its your job to protect your personal health and safety. Nobody (and certainly not the state) is going to do that for you.

Is there something about the trajectory of evolution of the internet that suggests to you that this is incorrect?

I observe continually (seemingly perpetually) increasing traffic, and continually (seemingly perpetually) increasing capacity for general purpose computing. I also observe enormous empathy and cyberpunk traditions in our communities, protecting each other. Do my eyes and ears deceive me?


Restraining orders are a thing for a reason. It's cheaper to harass someone out of business (intentionally or otherwise) than to compete on a level playing field.

Being a good neighbor requires restraining oneself and making requests with consideration for the other party.

Full disclosure: I worked for a price monitoring service that prided itself on crawling up to every 3 hours. Steps were always taken to mitigate the impact. Sometimes even asking hosts to allow-list the crawlers.


> Restraining orders are a thing for a reason.

Sure, but for the purposes of this conversation, saying "for a reason" regarding a function which is presently delegated to the state is fraught with all sorts of future-proofing concerns.

It seems to me that, as a baseline, we have to agree to observe the apparent trend of the internet to supplant the state - to resist its censorship and influence almost entirely - as an indicator that our long-term thinking needs to put those relatively few state functions which are essential to a peaceful society (such as restraining orders) in the purview of the internet... somehow. Maybe that will prove to be unnecessary, but in the case that the state fades, we'll be happy we had the foresight.

Internet traffic is barely (and arguably, already not) under human control as it is. And in another century, it will almost certainly be impossible to tell the machines 'enhance your calm or else'. Or else what?

I agree wholeheartedly about your qualities of good neighbor roles. But I don't think they extrapolate the way you think they do.

Consider this: at every moment, your house - your literal dwelling - is bombarded with high-level, semantic radio traffic, from way down where the messages bounce off the ionosphere all the way up to 10GHz and beyond. But this doesn't bother you. You ignore what you don't need! You draw boundaries and personally work on strengthening them - with the help of your friends and neighbors.

The internet needs help taking this shape at the application layer (and really, at all layers). And that part is up to us. We can't just throw our hands up and say "<legacy state function> exists for some reason, doesn't it?"


The government is our tool for regulating society when self regulation fails. It may be a blunt instrument and a last resort. Yet there is a place for it. We cannot entirely outsource all boundaries to individuals and private institutions.

I agree it would be ideal if the Internet could be as opt-in and benign as you suggest. Though I'm not even sure such an architecture is possible. How do you drive down the cost of listening and filtering to near zero whilst still allowing the desired signal?

And even if it were possible, consider that we do rely on governments to regulate the limited radio spectrum that we all have to share. Otherwise it wouldn't be an option to opt in to. The signal would be drown out by whomever has the strongest transmitters.


> The government is our tool for regulating society when self regulation fails. It may be a blunt instrument and a last resort. Yet there is a place for it. We cannot entirely outsource all boundaries to individuals and private institutions.

I don't know who "our" refers to here, but if humans are evolving into "the internet", or however you want to think of this creature which is emerging over the course of this century (and appears wont to accelerate over the next few centuries), then I don't think the state is "ours". We can't just cover our eyes when presented with the proclivity of the internet not to tolerate the state.

> I agree it would be ideal if the Internet could be as opt-in and benign as you suggest. Though I'm not even sure such an architecture is possible. How do you drive down the cost of listening and filtering to near zero whilst still allowing the desired signal?

Cryptography.

> And even if it were possible, consider that we do rely on governments to regulate the limited radio spectrum that we all have to share. Otherwise it wouldn't be an option to opt in to. The signal would be drown out by whomever has the strongest transmitters.

...really? Do you really believe that the state is a force for coordination and openness in radio?

The only bands which reliably continue to have these characteristics are the amateur bands, which have been defended by users for decades against constant encroachment by a state which, if it had its druthers, would've sold these bands to AT&T a long time ago.

My sense is that, if the government thought we weren't watching, they'd simply cancel the amateur radio license program. It is people standing to be counted (by taking the test) that keeps these bands viable _despite_ the FCC, not the other way around.


I was a professional web scraper. I still keep up to date with the industry.

These days, you do not make money by doing web scraping; you make money selling services to web scrapers. There are tons of web scraping SAAS and services out there, as well as dozens of residential proxy providers.

Most anti-bot mechanisms evolve so quickly that you can make a decent income just by working in a traditional software engineering role dedicated entirely to engineering anti-anti-bot solutions. As these mechanisms evolve rapidly, working for a web scraping company is more stable than pursuing web scraping as a profession.

Web scrapers get paid by projects, making it an unstable job in the long run. High-level web scraping requires operational investments in residential proxies and renting out servers. Additionally, low-end jobs pay very little. Brightdata hosting a conference on web scraping, which should indicate the profitability of selling services in large-scale web scraping.


I've long thought that the use of residential proxies for things like scraping and operating large-scale bot networks is a necessity, but I've never really dabbled in using them, so I've never confirmed my suspicions about how residential proxies are used at a scale like this. Do you know if insecure IoT devices and malware-infected consumer hardware as common as one might think for this? I can't imagine it would either be profitable or even possible to work with an ISP to acquire residential IPs, which kinda leaves me thinking that the only option for a residential proxy service would be pretty clandestine.

If you just search for "residential proxy" you'll find a lot of them are basically Raspberry Pis or similar shipped to people who are then paid for the amount of traffic that goes through it. Others are agents running on user's computers, I suspect at least some of these proxy providers aren't overly thorough about due diligence on how that agent got installed.

Is there a conference you would suggest that is the closest to scraping, generally speaking? As far as I know there isn't a scraping conference or strong community anywhere, and I'd like to learn and improve my skills.

The scientific aspects (algorithms, incl. implementations, performance evaluation) of Web crawling (including focused crawling) is covered by conferences like WWW, ACM SIGIR, BCS ECIR, ACM WSDM and ACM CIKM.

But you may refer to informal MeetUps or trade fairs; if so, google "Web Data Extraction Summit", "OxyCon Web Scraping Conference", "ScrapeCon 2024" (all past) or the forthcoming: https://www.ipxo.com/events/web-data-extraction-summit-2024/


The edge that every web scraper has is the knowledge they possess. In my opinion, conference presentations are usually too generalized or geared towards pitching services related to web scraping solutions.

There are some communities you can find in Discord, Telegram and most professional web scrapers are pretty active in LinkedIn and Twitter. The fun communities are in fact small groups of people with shared values and interests.


I've been writing scrapers on Upwork for many years. I'm sick of doing project based work and want to work at/start a scraping SaaS. Any advice?


I would recommend checking Google to see if you can find any job openings. Please remember that it is a niche industry, so there may not be many companies currently hiring. But honestly, if you are looking to make a full-time living, consider choosing another niche as web scraping jobs require you to consistently stay on top of your game. Most full-time jobs involve scraping data from big tech companies, and you are on your own to find solutions in bypassing anti-bot measures.

The irony is that before I realized it was so easy I would just open source the code - not on Github, mind you, since the likes of Akamai would DMCA pretty quickly, but playing a little bit of jurisdictional arbitrage I put it on Gitee - the Chinese copycat of Github. I don't have a background in any of this, but companies like the brag and it's not hard to put two and two together. It also was a practical way to enable me to place wagers on sports automatically - which was more or less my actual day job - and was pretty good for learning programming quickly in your late 20s.

Instead almost immediately I got inundated by sneaker botters in China and in English from somewhere that doesn't use it as a native language, judging from the idiosyncratic use. I kept the code up for a bit but took it down not because of any legal threats (good luck with DMCA-ing a platform endorsed by the CCP, even though I have no love for the party, I also find the American attitude that places intellectual property over real property in practice - from my experience as a defense attorney - to be just as screwed up in terms of priorities, just a matter of degrees. What made me take it down was the fact that I did not want to work in a customer service job or really for anyone, and judging by the requests, it was mostly consisted of "you do the work but we'll split the profits", which I can't believe anyone would fall for.

But since the internet is forever, some parts of code that specifically worked to emulate Cyberfed-Akamai from 0.8 to 2.3 are probably still floating around. My bad. I don't wear shoes normally - flip flops or nothing after having to wear a suit to work for a decade - and have no idea beyond what happens in NBA2K. Although cybersecurity firms making products that someone who learned how to program in their mid 20s and put online within 3 years and had it work should be pretty ashamed of how much they charge, considering that I haven't even taken a math course since 11th grade and had too much of an ADHD problem to watch videos or even read more than blog posts or documentation. Everything I learned, I learned by copying from Github and similar services until it worked. There must be a lot of snake oil being sold out there, maybe most of it, since the insidiousness of the whole thing is that selling bunk solutions seldom gets you in trouble anyway, while actual crime - rape, murder, robbery and the like - are largely lagging because the police simply prefer to complain about culture war bs instead of actually, you know, do their jobs. Who knew Judith Butler was THIS spot on.


Thank you very much for sharing your story. From what I know these days, sneaker bots as an industry have pretty much gone downhill. Not because of anti-bot measures, but because the entire industry has essentially shifted from retail stores to eBay resllers. Everyone is competing to buy the first batch to the point that it is not worth building a sneaker bot anymore.

How do you keep up with the industry?

It is kind of like Fight Club. There are 2-3 good communities that I lurk in. The people won't walk you through your scraping problems, but if you ask the questions to the right person politely, they often help.

Many residential proxy and scraping experts are pretty active on LinkedIn. But they do not talk about scraping data, just news around web scraping.


I’m really mixed on this. Anti bot stuff is increasingly a pain point for security research. Working in this space, I have to work against these systems.

Threat actors use Cloudflare and other services to gate their payloads. That’s a problem for our customers who are trying to find/detect things like brand impersonation and credential phish. Cloudflare has been completely unhelpful. They just don’t care.


Seconding this. Evading detection has become a real cake-walk since threat actors are able to sign up for a free Cloudflare account and then put their phishing site on their 2-hours old domain behind a level of protection backed by a $20B company. Funny that you almost never see phishing on Akamai ;)

Disclaimer: We operate in this space so we obviously have an interest in being able to detect these threats going forward.


Other than being the cheapest & easiest to use, is Cloudflare doing a particular evil here?

As a webmaster I don’t want non-user traffic except search engines. It’s a waste of money and often entails security, privacy and commercial risk.

Without Cloudflare I’d achieve only slightly less effective results using an AWS WAF, another CDN, or hand rolling solutions out of ipinfo etc.


Excuse my bias, as I work for IPinfo. Rolling your own bot detection service is something you should explore if you want near-absolute coverage.

We intentionally do not provide an IP reputation service as many sophisticated bots mimic the "good reputational" aspect of IP addresses. Usage of residential connections or essentially being vetted by CDN/cloud services makes making bot detection ambiguous.

That is why we provide accurate IP metadata information. Whenever you detect patterns of bot-like behavior, look up the metadata such as privacy service usage, ASN, or assigned company, and then start blocking them via the firewall.


They could police their content. Or if they don’t want to, they could meaningfully partner with the security industry - create a “security bots” program, respond to takedown requests in days not months, etc.

I suppose that Cloudflare scanning payloads for known malware could potentially be effective if they could make the performance work.

Closed partnerships programs are a bit concerning though. Once they’re up and running there’s an enormous economic incentive for CF to squeeze members with fees that capture the economic upside.


Cloudflare is the ultimate example of creating the problem and selling the solution.

I was under the (naive?) impression that Cloudflare a SaaS startup poster child. Do you mind expanding on your comment?

Among other things, cloudflare hosts DoS services while selling DoS protection.

Can you please elaborate with some examples ?

I think you can get a bot allowed by all of Cloudflare at https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5pR.... The blog post I read didn't make it clear if it would apply to all of Cloudflare or just customer sites though.

You can. Sort of. The good bots list is basically driven by a fixed user agent. And customers can set their preference to not allow “good bots”.

Not so good for security work.

It’s similar to their abuse reporting. They give your info to the site owner. Gee thanks, that’s just what I want to do.


Why not Akamai?


Cost.


I feel like we'll eventually arrive to some kind of micro-payment mechanism to solve this issue

> Those companies employ ill-adjusted individuals that do nothing else than look for the most recent techniques to fingerprint browsers [...] When normal people are out drinking beers in the pub on Friday night, these individuals invent increasingly bizarre ways to fingerprint browsers and detect bots ;)

What's the deal with "ill-adjusted" and "normal people"? I'm gonna say it right now, the reason why these individuals do this is because it's way more interesting and fun than building some bullshit React website for some boring business for the 20th time (this is just an example, not attacking React here, no need to freak out)

It's fun because you get to solve an actual real-world challenge and find new ways to do something. Same with things like developing exploits. Those who do this are not "ill-adjusted", they are in fact normal people that do what they are passionate about.

The whole mentality of "anyone who does something I don't like is ill-adjusted" is just absolutely insane.


That entire paragraph is a joke. That’s why there is a little wink at the end.

It's not clear if it's a joke specifically because of the addition of "ill-adjusted"

That is the joke. It’s hyperbole.

Anti bot stuff also seems to be a security threat and privacy threat: preventing users from accessing your site if using VMs, port scanning, various froms of fingerprinting


I prefer the approach of an algorithmic challenge that forces the "new visitor" to spend some CPU cycles.

It's a clear process, doesn't involve privacy risks or strange sneaky games, and tends to fail in ways that a human can at least see and report, as opposed to mysterious outages.


.. and also annoys people with slow hardware while costing very little to serious scrapers?

How much CPU time can you burn so people on 3 year old phones can see it, and how much will it cost scrapers?


Even a very slight challenge is a problem for scrapers: they have to do it far more frequently.

Its better than captchas and whatever Cloudflare does in terms of overall nuisance.


Discussed at the time:

Scrape like the big boys - https://news.ycombinator.com/item?id=29117022 - Nov 2021 (189 comments)


> Every website can access rotation and velocity data from Android data without asking for permission.

What????!!! That's nuts


Interesting. Busy building a project that requires scraping (pretty low rate)

Have been puzzling what to do about the rejection cases. A single cheap android might just fill the gap.


This tends to be a very unpopular opinion around here, but in almost all cases I find Internet scraping to be unethical and downright malicious. I'm not saying all cases, but I'm saying almost.

A lot of the actors involved tend to be hustle culture types who think they are OWED your data, regardless of the ethics, laws, being a good citizen, whatever. They will blatantly disregard terms of service and hide behind massive setups such as these to circumvent protection etc.

And the problem is, if you run any sort of business or service that is data oriented, there will be thousands of people that will do this, which will cause you to devote enormous amounts of time, effort, money, and infrastructure just to mitigate the issues involved with data scraping. That's before you are even addressing whether or not these people are "stealing" your data. People who feel they are entitled to the crux of your business aren't bothered by being nice in the way they take it - they'll launch services that will cripple infrastructure.

Whenever I deal with a scraping process that decides it wants my entire business, and it wants all of it RIGHT NOW, or in 5 minutes, I want to find the person and sit them down in a room and tell them "hey, develop your own ideas and business. Ok? Thanks"

And if you think this was a problem before, it's exponentially worse over the past few months with every Tom, Susan, and Harry deciding they must have all your data to train their new LLM AI model. By the thousands.


I use web scraping to identify and monitor fraud.

Exhibit A: https://archive.ph/0ZUA8

This website is used to recruit people to set up "lead generation" Google Business Profiles and leave paid reviews.

Exhibit B: https://archive.ph/WWZuw

This is an example of the Craigslist ad used to initially attract people to the website above.

Exhibit C: https://archive.ph/wip/7Xig4

This is one of the Google Maps contributors which left paid reviews.

If you start with the reviews on that profile, you'll find a network of Google Business Profiles for fake service-area businesses connected through paid reviews.

Web scraping allows me to collect this type of data at scale.

I also use scraping to monitor the status of fake listings. If they are removed, the actor behind them will often get them reinstated. This allows me to report them again.


I don't care if you use Web scraping to solve the Israeli / Palestinian conflict. You're not entitled to anyone's data, computers, services, etc because you've decided for altruistic reasons that it is appropriate.

Cool use case. Love it. Fascinating stuff. But if Google told you to stop, would you? Or would you instead decide to build a 5 server cluster of 200 4G modems spread across continents to continue your work? Because if you did I would assume that you've decided to move on from a cute little altruistic process into a commercial use of someone else's data to make a profit.


Wait - so you are saying that information on the public internet isn’t public? Man, I wish people would remember the origin of the web and the entire reason it exists. If you don’t want information public, protect it - otherwise, I say it’s fair game.


Remember the OP article is about a system that is designed to completely and directly circumvent protections.

If an organization puts a series of processes in place to prevent scrapers from wholesale taking data in violation of terms of service, and you develop a 5 server cluster of 200x 4G modems it's no longer "fair game" and you're directly being unethical in your use of someone else's services.


Yeah, I think it's fair to say that in the presence of anti-bot measures (whether they work or not) that the content on the website isn't public anymore.

Available to someone meeting certain criteria (student discount, senior discount) doesn't mean available to anyone. I see no reason that "not available to be consumed by autonomous agents" is somehow invalid in a way that unlimited refills is only available to humans and not robots.


I agree that there is a line at using someone else’s data to make a profit, but it is kind of ironic that you mention Google, because their exact business model is scraping websites to feed their search results and litter it with ads to make a profit. For me there is a big line between aggregating publicly available data (search results, reviews, news, job postings, etc. ) and intentionally violating terms of service like signing up for fake accounts an harvesting user data. So entitled maybe not (sites can try to prevent you from scraping), but if you make something publicly available you shouldn’t be surprised when people use it in ways you may not originally have intended (within legal boundaries of course).


>I don't care if you use Web scraping to solve the Israeli / Palestinian conflict.

Maybe you should though. It's always worth it to think about which giant's shoulder you're standing on. It's giants all the way down.


> cute little altruistic process

Maybe it is not the opinion which is unpopular, but the way it is being presented.


> Whenever I deal with a scraping process that decides it wants my entire business, and it wants all of it RIGHT NOW, or in 5 minutes, I want to find the person and sit them down in a room and tell them "hey, develop your own ideas and business. Ok? Thanks"

That's a lot of righteous anger for somebody building a business on top of other people's data.

"Broadcastify is the worlds largest source of public safety, aircraft, rail, and marine radio live audio streams."

I have no sympathy whatsoever. You're just complaining about the very thing you're doing. If it's fair for you to do that, it's fair for others to do it to you.


They volunteer to provide the data to us. Every single last one of them. Nowhere in our business model did we make the conscious decision to say "hey, look at that business, they have something, and I'm going to take it."


Reading public website data is not "taking it". It is still there.

Observing publicly available information is not theft, nor is it illegal.

Of course copyright rules apply, but that is for if you reproduce something.


reproduce something

No one is developing a 5 server cluster with 200+ 4g modems to observe publicly available information. They are using said cluster to deliberately work around blocks, rate limits, and restrictions on scrapers who are scraping content solely to reproduce the data and use it for commercial purposes (make money)


Aren't you also volunteering your data? Don't browsers just talk to your webserver and say "Hey, what do you have?" and your site responds in kind.

There's a lot of local history locked up in facebook's nostalgia groups. I want to archive it in an open format.

I want to grab new rental listings and put them in an RSS feed, so I only look at each one once.

That's my uses for data scraping right now. If that destroys someone's business, I don't actually care. Maybe it's selfish, but my right to re-format data for my own convenience outweighs their right to make a profit.


Not that I think you shouldn't do it or you're doing something wrong, but describing it as a right irks me the wrong way. You don't have any right to expect someone else's computers to work for you.


I'm not sure how to phrase it except in terms of competing rights, but I take your point.

At the point where I'm scraping, the data's on my computer though.


You could call them interests .

It's often in a business's interest to format data in a specific way to make money, for example interlacing it with ads.


Nice.


If that destroys someone's business, I don't actually care. Maybe it's selfish, but my right to re-format data for my own convenience outweighs their right to make a profit.

Exhibit A


Yeah, it's as unsympathetic framing of my position as I can offer.

But it's basically the same question as adblockers: Can I do what I want with the 1's and 0's on my own machine?

I'm not going to accept that I owe anyone a business model.


I'm not going to disagree with your use case here.

But I'm going to assume that you have some level of a conscious and you don't really mean you could give 3 shits about someone else's hard work so you can have some satisfaction at home. Because at face value that's exactly what you said.


No, I think that's fair. Unsympathetic framing, but not inaccurate. It's that whole "information wants to be free" thing.


BTW, kudos for presenting your point of view in a hostile forum and holding your own. I should have said that up front.

Is it unethical for a mouse to eat the cheese without triggering the trap?


> hustle culture types

It seems like you have this imaginary strawman that you hate and it seems like that's the foundation of why you dislike this.


No. The foundation of why I dislike it is simple. If I own some data, then I get to dictate the terms of how that data is used. Period.

“Hustle culture types” is simply a little anecdote about the types that would look you in the eye and tell you they are entitled to disregard what I said above. They’ll usually wrap it in some altruistic bs to justify as well.


Why do you put it on the open internet if you don't want machines to find and read it?

ToS is nice but you can't expect that it applies - the user (of the machine doing the scraping) might be a child which makes the potential contract automatically void, for example. Also, there are people under jurisdictions where such things have no power, or that don't recognize your rights to the data.

And the whole thing of putting data out publicly and then just expecting machines to see the pile of data and go "oh so where do I sign the ToS?" is weird...

Just put it behind a rate limited API key...


As an analogy, imagine that a gardener builds a beautiful flower garden, bisected by a cute stone path, which she invites the public to view freely, save for a single restriction; a sign reading "keep off the flower beds."

There is a well-understood social contract here. I should not drive my car along the path, even if don't crush the flowers. I shouldn't walk on the flower beds, even if that sign isn't legally enforceable. And if a runaway lawnmower, RC car, or some other machine of mine does end up in the garden, I am responsible, because it was my machine.

With websites, there is even a TOS specifically for scrapers - robots.txt. The fact that it is easy to bypass or ignore is no excuse for actually bypassing or ignoring it.

The anonymity of the Internet functions as a ring of Gyges, where since people don't face consequences (even social ones), they feel entitled to do as they will. However, just because you can do something does not mean you have a right to do something.


I think this analogy would be improved if the sign said "Please don't take any pictures." This is far more restrictive than a sign saying "Please don't take any seeds or cuttings." The latter is more understandable because such activity damages the flower garden (particularly if everyone starts taking seeds and cuttings).

Now let's say a photographer visits the flower garden, takes images, and sells them online as post cards? As long as the photographer is not hindering other people (flooding the site with repeat requests, in the analogy), it doesn't seem to be a problem.

On the other hand, let's say we don't have a flower garden, we have an art gallery or a street artist's display - or the pages of a recently published book. Now the issue is distributing copyrighted material without paying the creator... but what if there's a broad social consensus that copyright is out of control and should have been radically shortened decades ago?

The vast majority of data being scraped is not copyrightable creative work, however, so as long as you're not obnoxiously hammering a site, scraping seems perfectly ethical.


Robots.txt is definitely not any kind of ToS - some people (Google) said they will respect it. No reason to expect people even knowing about the concept - practically nobody knows about it, not even most developers.

And again - there are countries where any ToS without explicit signature or other kind of legal agreement don't apply at all.

Just like writing "by using the toilet you agree to transfer your soul for infinity" on a piece of toilet paper taped somewhere in the vicinity of a toilet gives you nothing - even if it was a more reasonable contract, nobody agreed to anything.

As for your other point, I think this is more like standing next to a highway with a sign that reads "don't drive cars here" and expecting people to stop and turn around. They didn't even see your sign at their speed and it's kinda unreasonable to expect they would be checking for that kind of a sign on a highway. At least make it properly - big, red, reflective (e.g. a Connection Reset, or at least 403 Forbidden).


Yes, there is no legal enforcement mechanism behind robots.txt. Nor do I particularly want there to be. However, most people agree that reasonable requests made regarding the use of someone's property should be followed. The capability to do something without consequences is not the same as the right to do something.

Our gardener should not need to build a brick wall around their public garden to keep your lawnmower out.


[flagged]


Is it? Just ask around. I have web app devs around me, they don't know it. Only those who actually specialize on web sites (for presentation) do.

I couldn't set up a web server to save my life and I know what robots.txt is.

Because you frequent this site where it's a very common topic. The devs around me often don't even speak English and don't care about Google that much.

What makes you think putting data on the Internet all the sudden means I unilaterally surrender the rights to my intellectual property?

If I choose to make my data available to some businesses to make discovery of it easier, and I choose to decline to allow others to unilaterally copy my data to develop a different business, that's my right. And it is unethical and unreasonable for any other person to assume otherwise that they are entitled to the same rights I granted someone else.

If I own some data, I get to the be arbitrator of the who/what/when/where on the use of the data. Period.


Sure, you can do whatever you like. Cut the connection if you don't like it. But I can do whatever I like too - read the data that your machine sent me, for example. If your machine sends my machine data it's IMHO reasonable to expect that you don't care about me having it unless we agreed otherwise. But in many countries ToS is not considered a legal contract at all - just having it on your site somewhere is not enough. Sometimes not even having users check the ToS checkmark would form a valid contract.

There are many kinds of data that can't be owned at all. Actually it's the other way around - there is a very small subset of data that can be owned. You can try to cover it under some kind of a non-disclosure clause in a contract, but again - a contract would have to exist.


[flagged]


What I'm saying is - your machine is fully capable of providing just the right amount of data to fulfill your purposes. If you don't like people taking it all, don't build a machine that gives it to them at 1 Gb/s. Stuff about some ToS or rights or IP ownership is just noise.


> What makes you think putting data on the Internet all the sudden means I unilaterally surrender the rights to my intellectual property?

Because intellectual property doesn't exist.


Scraping doesn’t imply IP violation.


Serving HTML will get you scraped. Your terms don't overrule fair use.


>If I own some data, then I get to dictate the terms of how that data is used. Period.

What if you got that data from me/users and I/we claim the same rights (like GDPR for example)? Will you still honour ownership as above?


If your business is just that you have a bundle of information and expose it over an open website, I’m not really sure how you’re able to maintain a mentality that you are somehow entitled to ownership of that information. You already put it out there, it’s now public, any illusion to exclusivity is now gone because anyone could come along at any time and make a copy without your knowledge. A moral position on this issue is even more confusing to me. Do you think that you e.g. own the knowledge on which radio frequencies are used where? Do you think you have a moral claim on ownership of (presumably unpaid) user-submitted information? I think the only legitimate moral grievance you have is high traffic volumes from inconsiderate scrapers.


Do you think you have a moral claim on ownership of (presumably unpaid) user-submitted information?

You damn right I do. I own, develop, and maintain the entire system that enabled the body of works to exist in the first place.

Do you think that you have a claim on ownership of the data because you drove by, saw what you liked, and decided that now you'll just rip the baton out of my hand?


> You damn right I do. I own, develop, and maintain the entire system that enabled the body of works to exist in the first place.

I don’t think that meets the bar. Running a website is absolutely not equivalent to the collective effort people put in to populate that website with the information that actually gives the overall artifact its value. There is a large history of outrage when similar information repository websites with user-generated content violate expectations of openness. Nevermind the fact that the actual information itself isn’t even private or proprietary, just obscure and distributed.

> Do you think that you have a claim on ownership of the data because you drove by, saw what you liked, and decided that now you'll just rip the baton out of my hand?

I wouldn’t claim ownership nor want to, when I scrape stuff I usually just want information in a different format. I’m confused as to how you think you can even “own” data to begin with. Suppose that your users uploaded songs instead of RF info, do you believe you own their music solely because they chose to share it on your site? Do you think your users would believe that?


I’m confused as to how you think you can even “own” data to begin with.

It's actually very simple. If I'm in a position to restrict access to the data, then I own it, unless there is some legal authority that has jurisdiction over me that says I must make it available to the public.


Given that you haven't fixed your problem with scrapers (given the complaints you're making right in this thread.) It's obvious you're not in a position to restrict the data-- otherwise you'd not be complaining about scrapers, and thus you don't own it.


Considering Walgreens is still fighting shoplifters, it’s obvious they’re not in a position to restrict their merchandise. They must not own it.

Well, exactly. blantonl claims that his ownership rights are based on his ability to restrict access to things which is not a mainstream view.

Your example illustrates this nicely. Walgreens owns the goods on their shelves regardless of shoplifters.


I'm glad you agree with my point that Walgreens owns their merchandise not because they stop shoplifters and restrict access, its because they purchased it and have title over it, and since GP has done no such thing they don't actually own it.

Operating a website doesn't automatically put you in that position, as evidenced by the fact that scraping does not require your consent to be possible. Ultimately there's little practical difference between someone's eyes viewing information and a program viewing that same information, a copy has been made in some form. Scraping a new site takes maybe a few hours of python to accomplish, the barrier is low.


I don't think you understand. If I decide as the owner of a site, that I don't want you scraping my business and I block you, then I am in that position. I'm automatically in that position because I can implement the blocks necessary to uphold the the terms of use of my business, or I can just do it for arbitrary reasons. Maybe you are hammering my server. Maybe I'm in a bad mood this morning and don't like that you're using Python.

I can unilaterally decide whether or not you use my business, in any way shape or form, even if I just don't like you, as long as I don't violate any laws (discrimination etc).


I absolutely understand, it's just not hard to make scraper traffic appear as (or be) legitimate browser traffic and/or simply distributed across numerous IPs. Other technical controls all have trivial circumvention methods. There is legal precedent (at least in the US) suggesting that scraping public information may be permissible under law (see HiQ Labs v. LinkedIn). Scrapers only ever need to succeed once.

Under these circumstances, how can a website operator feel any sense of practical control over scrapers?


This is kind of a silly argument. If a physical business trespasses me for shoplifting, I can just put on a disguise and go back and shoplift more. Why do business think they have control over shoplifters?

This is kind of a silly argument, for every item you shoplift: do you ask if you can take it without paying and then get granted permission?

> It's actually very simple. If I'm in a position to restrict access to the data, then I own it, unless there is some legal authority that has jurisdiction over me that says I must make it available to the public.

So, in which jurisdiction are you? Because in US courts have confirmed multiple times that scraping public websites is legal.

https://techcrunch.com/2022/04/18/web-scraping-legal-court/


I think your basic arguments are either:

- scraping is immoral

- we should bake DRM into the internet

There's no technical or legal difference between a scraping or web request, and I can't really believe that you think that non-scraping web requests are immoral, so I think that probably isn't your argument.

Moving onto DRM, I think most people don't want it baked into the internet. I think individual entities can choose to use it if they want--that's basically how you protect against scraping, so I think people irritated by having their content copied and thus devalued (or their ads replaced) should probably just do that.


> Do you think that you have a claim on ownership of the data because you drove by, saw what you liked, and decided that now you'll just rip the baton out of my hand?

Are you just trolling at this point?

_You are handing the baton over_ in an HTTP response. If you don't want to do that, then change the logic of your server.

Good grief man.


Then any store is handing over the baton because you can walk in, take merchandise off the shelf, and walk out.

That's not at all what's happening here. This is me walking in, with a polite and well-formed request, regarding a piece of merchandise: "May I have <item>?"

And the store, clearly and with a signed receipt, saying, "Here is the item you requested. Have a nice day."


I find it aptly hilarious that your own business model at broadcastify.com is recording publicly accessible radio broadcasts and then selling access to those recordings for commercial gain.


Why is that hilarious? We developed an entire community, infrastructure, system, architecture, everything, from scratch, and provide access to something that never existed in the first place on the Internet. That's a significant key difference here.

This would be analogous to you thinking ancestory.com is "aptly hilarious" for arguing against someone just scraping their site for content.

What makes you think you should be entitled to drive by the very unique house that we built, and pointing right at that house and saying "I think I'll take that all of that for myself!"


Because you fail to see the very obvious parallels to scraping. I’m not criticizing your business (I think you provide a valuable service) but your hypocritical stance on what forms of publicly available information are allowed to be gathered and repackaged.

Google’s original (and OpenAI’s) business model was also building a scraping infrastructure, system, and architecture, from scratch — and providing access to something that never existed in the first place.


It's completely perpendicular, not parallel.

Public safety communications are radio waves that are broadcasted and the ability to passively monitor them is enshrined in United States law. That is a massively key difference.

If I was sending data into your home from my infrastructure without any action from you whatsoever, and you were reaching up into the air and gathering it and repackaging it, AND the law said that I have no intellectual property rights to said data, then that's a whole different story.


Every time you use Google you benefit from scraping. Scraping is how the world works for the last 25+ years.

You are trying to draw a distinction between data that is pushed and data that is pulled, and maybe there is some economic argument there in terms of resource usage, but that is very context-dependent.

In UK listening to public radio broadcasts is illegal. I think this law is idiotic and ignore it. It seems you do too since there appear to be streams from UK on your site :)


Google benefits from legal scraping - ban them from robots.txt and they'll stop.

Please don't mix consensual and non-consensual scraping, the difference is huge.


You are scraping radio signals and selling it. It’s an exact parallel and if you fail to see this it is indeed hilarious.


It is difficult to get a man to understand something, when his salary depends on his not understanding it.

[flagged]


The analogy here is that a website that is connected to the internet is considered "free to browse" just as a radio signal is "free to listen to".

The issue isn't listening or browsing (so long as it's not DoS-ing), it's what you do with that information and whether you have permission to use the information (copyright of the host / broadcaster) in the way that you are and in the way that was intended.


So you only point that scrapping data is bad because the cost? How do you know that the site someone is scraping doesnt have fixed cost?

no, scraping data is bad because this is against owners wishes.

In US, if you broadcast, the by law you consent to be received and recorded.

If you scrape data, there is no such law. And if you get consent (say by finding the permissive robots.txt), then go ahead and scrape.


> no, scraping data is bad because this is against owners wishes

The broadcasters weren't happy about home cassette recording either, and the case went all the way up to the supreme court. If I can legally record cable, then it's nit a stretch to say I can also "record" what's on the public Internet for my own use.

Morally speaking, we have to consider the other side of the equation - operator may not be happy about being scraped, but as a user, is it okay for me to build or use a scraper-based price-comparison or price-tracking platform? I'd say yes, even though most sellers wouldn't want to have this data scraped.


I see a difference between "scrape for personal use", "scrape for public good" and "scrape to earn money from".

Everything is fine for personal use - you are choosing how to consume the websites, and if you choose to do it by extracting all the data into tables, that's fine.

Public good scraping is slightly murkier morally but I guess it's also fine? Similar to "fair use" copyright exceptions. (Unless it's commercial companies pretending to do "public good" solely for their own benefits, like AI "open dataset". Those should be banned.)

"Scrape to earn money from" is not OK. And sadly, this seems to be the majority of all scraping projects, such as: copy the sites wholesale and display your own ads on them, collect data to train AI on, for SEO (=make everyone's search results worse).

The good analogy is what would you do in a public place like a cafe: can you do your personal work? No problem at all. Can you put a non-commercial poster or sign? This may be OK. Can you earn money off it (say sell your own stuff inside)? No way.


> Scrape to earn money from" is not OK.

This is exactly how search engines like Google and Bing work though. So I don't think "earning money" is the right place to draw the line. Reproducing websites wholesale is also illegal and immoral IMO, but there's a lot of gray area between search engines and cloning the content of entire websites to slap ads on.


You realise web scraping is a legal right too?

Why is it ethical if you build upon other people's data, but unethical if others do it?

Nobody cares how valuable you think your service is. Who's the judge of what's entitled to scrape or not? If you think you're the judge, I find it somewhat arrogant.

It is even more hilarious that you defend a position that, to me, looks authoritarian and individualistic. Might not be your intention, but it's what I read.


[flagged]


> Because they GAVE IT TO ME, that's why.

You gave it them when they visited you.


Look, when you publicize information that is not a human creation or art, you are GIVING IT TO THE PUBLIC.

The berne convention intentionally left out database sui generis rights outside the scope of copyright. Only in the European Union you have the kind of protection you're looking for. And even in the EU, I've never came across a case where the law was enforced in courts. Maybe because it's a ridiculous right, in my opinion, that would make information flow disfuncional in society.


They gave you a right to resell their broadcast content?

yes, US goverment did

Please read other thread replies.


I absolutely agree. In fact, I think the problem is that like everything, there is an optimal point for efficiency, and crossing that line by making things "too easy" when it comes to data means too much power for one person to handle ethically. Absolute power may corrupt absolutely, but near absolutely power also corrupts quite nicely, too.

In short, we should have limits to amount of scraping possible, simply because humans can never be trusted past a certain point to remain ethical. After all, ethics at its first approximation is only a mechanism to improve societal cohesiveness, and it only works as long as the person doesn't have enough power to "do away" with society.


Would you make the same argument of the inverse: data gathering?


Yes, I would. There is a law of diminishing returns for all technical and scientific inquiry.

> but in almost all cases I find Internet scraping to be unethical and downright malicious.

The Web (you said the "Internet", but you meant the Web) was not envisioned to be a commercial space. Your statement is antithetical to the original idea of the open Web. It's when the MBAs joined the party circa 2k and decided to profit out of it that all of these confused and wrong opinions about what the Web should be arose and that lead to the situation today. Your statement is a vast display of zero historical context. MBAs are obviously not very concerned with history. They just want to protect their own little turd for their own little profit and vanity, which is why they now put it behind a paywall, JS, and anti-bot proxies.


> Those companies employ ill-adjusted individuals that do nothing else than look for the most recent techniques to fingerprint browsers ... When normal people are out drinking beers in the pub on Friday night, these individuals invent increasingly bizarre ways to fingerprint browsers and detect bots ;)

Why not both on a friday night?


A curious title.

"So you want to scrape like the unethical boys?" I guess doesn't scan so well. Bad boys maybe?

I'm pretty sure Internet Archive, etc don't in fact misrepresent what they are to crawl websites...


> "So you want to scrape like the unethical boys?"

What's considered ethical is a very debated topic.

An assertion that something is simply "unethical" should be seen as the starting point of a discussion, not as a self-evident fact.


If someone tells you to go away via the robots exclusion standard, and puts up bot mitigation to prevent you, blocks your IPs, etc. then clearly you do not have their consent to help yourself to the data.

I find it really hard to see how you could twist ignoring this clear lack of consent, and going to great lengths to circumvent what was clearly put into place to prevent you from doing the very thing you are doing, how you could twist that into an ethical action.

It may or may not be technically illegal to do, you're but that is not a statement about what is ethical.


Ok, you’re building a service that scrapes e.g., property rental websites to find entries that are trying to scam naive renters.

The property websites are incompetent to solve the problem, or don’t care, but either way they sure don’t want you scraping their valuable data.

Is it still unethical?


That just makes both of you wrong.


Agreed. It's kind of like when a non-profit organization argues that they are entitled to someone's data because "we're not making a profit off of it." That's ridiculous.

Try asking a startup for free software licenses or seats or whatever as a non-profit. "We're entitled to 40 seats of your SAAS solution because we're a non-profit working to solve world peace." It's definitely within the startup's pervue to respond with a no.


Surely the ethics are more complicated then just following robots.txt or not. The intended usage counts, and that isn't captured in robots.txt.


If you have a noble intent, you ask the webmaster for permission to use the data. Surely if they agree with your assessment that your intent is indeed noble, then you'll be given consent.

I run a search engine and an internet crawler. I do this all the time. To this date I've never had a webmaster that didn't permit my crawler access when I've asked nicely.


If you have a noble intent - identify members of fascist organizations - then obviously when you ask the top online fascist sites if you may scrape them to build up your list of online fascists - they will say no.

OK less provocative, you have new algorithm to identify inaccessible websites, your automation is scary good, crawling a site you can identify many issues that most sites would have to pay for a full audit to get, but now these sites have problems - if you can identify their sites as being inaccessible then they have to fix these problems due to various accessibility standards that apply in the regions they operate in. But if they don't allow you access then they can maybe make an argument they are accessible due to audit they did last year, at any rate they don't want to be forced to spend money on accessibility issues right now which it sounds like they might have to if they let you crawl their site.

Version 2 of above, some years ago I spoke about a job with a big time magazine publisher in Denmark and said one of the things that would make me a good employee is my knowledge of accessibility and their chief of development said they didn't have anyone with disabilities that used their site - so if I ask that guy to crawl their site why say yes? They have no users that would benefit!! Stop abusing our bandwidth bleeding heart guy.


All of these seem like variations of the-ends-justify-the-means, which generally tends to cut both ways in unanticipated ways.

Bullying websites into accessibility compliance will most likely lead to them following the letter of the standard without giving a second of thought as to whether the content is in fact actually accessible. It's very difficult to get someone on board with your cause if your initial contact is an antagonistic one.


This might work in cases where those with the data are engaged in noble acts, but not ever actor is.

I scrape and process websites of actors engaged in fraud. I do this to make the data more presentable to the proper authorities and to help uncover further evidence of their activities.

I suspect that asking for consent would be quickly denied and the data/evidence would quickly become inaccessible.


> If you have a noble intent, you ask the webmaster for permission to use the data.

Is Marginalia opt in, then? Surely "not having a robots.txt" ("you didn't say no!") does not equal consent. And surely you could just ask all the webmasters you are scraping from for permission, since you have noble intent.

My point is that this is just hypocritical; you are placing the moral boundary right below what you are doing, while claiming moral superiority. If you ask others (e.g. anti-search Fediverse), they would think you are immoral too.


You really see no difference between following the robots exclusion standard, doing nothing to conceal your origins and intents, and respecting blocks when they appear; vs concealing your origins and intents, willfully ignoring the robots exclusion standard, and going to great lengths to circumvent IP blocks and other bot mitigation measures?

Both of these are the same?


Why do you think the consent of someone relaying information matters in the slightest when it comes to what people do with that information?

> unethical

Using and transforming information in useful ways is unethical if it results in a profit?

That's what our brains do, too.


No, destroying incentives to produce and share information is unethical (and more importantly, self-defeating).

Brains that consume information don’t destroy that incentive, they produce it.

Intermediating that and capturing all of the value for yourself is the unethical part, just like all forms of rent-seeking.


> destroying incentives

Internet usage and content creation are increasing, not decreasing.

I continue to publish comments, code, and images that presumably get used to train models. My incentive hasn't been destroyed.

> rent-seeking

Supply and demand set the prices.

Subscription services provide value and continue to invest in their product, catalog, and/or service. Property owners handle asset ownership and upkeep problems at scale.

Inefficiencies will be met with competition, and businesses not providing value will be out-competed.

Data under-availability is an inefficiency holding us back from bigger and better things.


Tell that to the most used website in the world, which is basically a scrapping-and-sorting machine.


I can commit a code change in 2 seconds that would directly tell the most used website in the world to stop scrapping and sorting my data, and they would honor it and that would be the end of that.

I'm under no illusions that they would or would not honor that in the future, but that's the state today.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: