Hacker News new | past | comments | ask | show | jobs | submit login
Web Scraping Is Vital to Democracy (themarkup.org)
426 points by atg_abhishek on Dec 4, 2020 | hide | past | favorite | 121 comments



I think of web scraping as nothing less than automation of human work. There really is no hacking or unauthorized access involved. It's either me, using a browser, to read something publicly accessible on the internet, or it's my computer, which I've programmed to read those things for me, so I can focus on creative work or spend time with friends and family.

This applies not only to personal life, but businesses as well. Instead of hiring 100 interns to collect some data manually, you hire one programmer to automate that data collection process.

I think that the ability to automate tasks on the internet is absolutely crucial to further development of our society and limiting it in any way will be detrimental to the world as a whole. The amount of information these days is so vast, that no human labor force could possibly analyze that information and use it to drive our progress.

Disclaimer: We run a web scraping platform (https://apify.com)


> I think of web scraping as nothing less than automation of human work.

I agree that scraping should be allowed. We wouldn't have Google otherwise.

However, there's something to be said for the fact that some activities are okay at a small scale but become problematic at a large scale—particularly the scale which becomes possible when a task is automated.

For instance, I recently bought an expensive camera and have been having fun walking around the city and taking interesting photos. Many of the photos have people in them. I don't think there's any harm in this.

I store my photos in Aperture, which automatically performs facial recognition on everything in my library. Perhaps some day, if I take enough pictures, Aperture will notice that the same stranger is present in two completely different images. That might be kind of cool—I can see myself fancifully trying to imagine this person's life story. I don't think there would be any harm in that, either.

However, if I aggregated millions of photos from different sources, and used them to track people's movements across the city, that would clearly be a huge problem! Sure, it would be merely automating the work a human could do, but the scale just changes everything!


Can the case be made that recording in public is a right (as if always should be) but trying to track where everyone is at every point of time is stalking at a mass scale which should be illegal as stalking on a one to one scale is illegal?

For reference this is what one site (findlaw) has given as what constitutes the crime of stalking:

The crime of stalking can be simply described as the unwanted pursuit of another person. Examples of this type of behavior includes following a person, appearing at a person's home or place of business, making harassing phone calls, leaving written messages or objects, or vandalizing a person's property.


> The crime of stalking can be simply described as the unwanted pursuit of another person. Examples of this type of behavior includes following a person, appearing at a person's home or place of business, making harassing phone calls, leaving written messages or objects, or vandalizing a person's property.

Except for the last point, that sounds like ads :)


ads vandalize property by causing unwanted images to appear on and unwanted sounds to emit from personal property


Is a highway billboard vandalism when the owner of that billboard is allowing the ad?

Or are you referencing computers with ads in them (websites, etc.)? By that definition, anything disproving a conspiracy theorist is vandalism to them. That’s not how that works.


Vandalism usually means permanent damage. You can call the police on your neighbor for playing loud music at 3 am, but I don't think it would be correct to accuse them of vandalizing your property.


This sentiment exactly.

Sure the information is free. But to find my property records, phone number, and "aggregate them" like Spokeo, fastpeoplefinder and similar sites, is akin to digital stalking IMO


Maybe I am wrong and my take is too simplistic/naive (as I admit, I haven't thought too much about this topic), but imo aggregating publicly available data shouldn't count as stalking. Acting on that data in a way that could be constituted as "stalking" definitely is real stalking though and should count as a crime.

For example, if someone's ex (who was explicitly warned that he was unwelcome) continues showing up at the doorsteps because they remembered your address from a long time ago, then it is stalking. Them knowing the address isn't.

In light of this, I don't see how it matters whether they remembered the address from past experiences or just found it through a website that aggregates publicly available info. As long as the data was obtained legally and without breaking any other harassment clauses, why would just the knowledge of something be a crime?

I see it just like firearms. Having a firearm (in a lot of US jurisdictions) is not a crime, as long as it was obtained legally. Doing harmful actions with it (such as threatening people or shooting someone who didn't pose a threat to your life) is a crime. Having a firearm feels like just having data, in this scenario. As long as you don't use that data for criminal actions, why would just the potential of you being able to do something criminal with that data is a crime?


The case I thought of is how public records are great, but the way Spokeo and friends scrape them, post them, and sell them isn't.


Right! Ensuring a process has the right amount of friction can actually be super important, to ensure it's available but isn't used wantonly.


This is a great way to phrase it. Friction really is essential for so many things where we're gradually loosing it.


I totally agree with the first part where you say it's fine to scrape. It is just an automation.

But you don't mention what you do with this data after. 1. You store it somewhere (not in your brain) 2. You extract value out of it (directly or indirectly)

That's why I understand why it is a problem for those who publish this data.

One thing I always try to say to everyone who argues about web scraping: web scraping is not a problem, problem is what you do with the information you scraped.

Disclaimer: We crawl the web for news (https://newscatcherapi.com/)


> That's why I understand why it is a problem for those who publish this data.

I still don't understand why it's a problem for those who publish the data.

If the "data" is facts (e.g. lists of values associated with objects, like the colors available on automobiles), this data is not protected by copyright, there is no remedy if it's republished, and if a business relies on limiting access to this data, that business needs a new business plan. Charge for access to cover the costs of obtaining and organizing the data, but understand that clients are legally allowed to make copies and use them however they see fit, even if that impacts the supplier's ability to charge for access.

If the "data" is prose, it's under copyright and republishing without a license has remedies under law. Maintaining copies of articles for the purpose of processing them to obtain other data (e.g. how many nouns? what adjectives are near those nouns? etc...) isn't protected.


So if somebody compiles the data, charges an N amount for access, it's fine that some entity buys a one time access and then releases it for free?


This is usually explicitly prohibited.

If it's not prohibited for some reason, e.g. because the term of the prohibition, such as copyright, has expired, why not.


Databases are actually protected under some copyright laws in some countries.


For copying the database wholesale, yes. But the individual records, formatted on a web page, and scraped out by a bot is not copying the database files or its schema wholesale. They're still individual facts.


Are news just facts? Or do they also incorporate the bias of the author?

Sure, you could argue that news are facts, and that they should be free from copyright. But that would only work if your news is just bullet points of the factual stuff. But if any element of bias is present, it become a distinct property of the author, hence it should be copyrightable.

Of course, with today's tech, it's easier to remove the bias element from the factual data.


I'm agree, 100%.


I think of placing a zillion cameras in public places as nothing less than automation of human work. Instead of hiring a million agents to record what the population is doing, you hire one programmer to automate that data collection process.

Totally different? Yes, but it shows that scale can affect whether something is OK to do or not, at least for some (I accept that people watching the streets can be useful, but also have my doubts about doing that at scale, whether by cameras or by hiring a million agents)

With both cameras and copy-pasting web content, there’s the issue what you do with the data. If, for example, I start scraping all the articles on a newspaper’s web site, publish them on a web site, adding my own ads, most people would think that shouldn’t be legal.

If you agree, we’re now haggling over the price (https://quoteinvestigator.com/2012/03/07/haggling/). That’s where things get difficult, but I think the entire spectrum from white (scraping for this goal is fine) via grey to black exists.


Weird take, but I see this sort of thing show up a lot, so let's give it a shot:

> If, for example, I start scraping all the articles on a newspaper’s web site, publish them on a web site, adding my own ads

Using a Q-Tip to murder someone is illegal. We don't have special laws against it—it's just that murder is illegal, we have criminal punishment for murder, and the ordinary machinery of the courts and the law is sufficient to handle any instances of murder by Q-Tip. Because it being done with a Q-Tip is the least important part.

If you're scraping and republishing someone else's content, it's not the scraping part that's the problem.


If, for example, I start scraping all the articles on a newspaper’s web site, publish them on a web site, adding my own ads, most people would think that shouldn’t be legal.

Publishing? Ads?


Republishing, duh.


OK, let's reformulate that: would people be opposed to just scrapping or to the illegal republishing and adding ads?

Another similar trick: "If someone will block ads and then murder the publicist and burn his house, most people would find that it should be illegal".


Advertising isn't some zero-sum game that suddenly makes you lose because you happened to see it. Republishing is - the original publisher loses out on the value of his publication because that means either reduced ad revenue or reduced subscription. I'm guilty myself of reading republished articles, but it is a loss for the publisher.


I hope you realize that what you wrote isn't related to my point at all. It's fine, but why put that as an answer to my comment is beyond me.


While I have substantial agreement with your point of view, when web content is substantially ad-funded, automated scraping effectively bypasses the “payment”.

I say this as someone who runs an adblocker, installs them for family, and doesn’t derive income from hosting ads, so I’m not pro-ad; I just realize that while it’s not a crime for me to dump the whole bowl of waiting room candies into my backpack, that’s going to be frowned upon.


When you dump the bowl of candies into your backpack, there are no more candies left for anyone else. On the other hand, when you scrape, all the original data and its underlying use case remain just as they are. You simply having scraped took nothing away from anyone.


This is orthogonal to the purpose of scraping.

If I run a search engine which potentially links real human eyes back to you, then should I pay the "ad" toll as well?

I don't believe so.

I do believe there is an agreeable middle-ground, but Google walked away from that conversation years ago.


Advertisements are harmful to every layer of society, mostly because they prey on you to instill desires that you probably wouldn't have had on your own (it being the case that this is their entire added value proposition). They should not be tolerated, and the fact that they ever were is a travesty.

That being the case, any argument founded on "but advertisements" does not hold water.


It's not founded on "but advertisements" it's founded on "but paying for content". The fact that websites are developed by people (wages), hosted on infrastructure (hardware renting) and require countless other jobs is somehow magically forgotten in these discussions.


It's not up to me or anyone consuming content to help with figuring out the right way to pay for it. What is up to me is the choice to not be potentially exposed to malware any time I go to read the news.

While you may have a point, that point just doesn't matter within the context that we're talking about. Making companies money is not my responsibility.


It absolutely is your responsibility if you are consuming the content.

Consuming the content means agreeing to the premise that the content is paid for using ads.

If you disagree with said premise you may happily browse another website.

Malware is already illegal to install and you are free to sue the website for damages.

I don't like ads either but trying to justify that their should be a choice of browsing a website without the ads it hosts is ludicrous.


> Consuming the content means agreeing to the premise that the content is paid for using ads.

It is in fact not this way, because the content arrives with or without the ads. In the EU EULAs (the dystopian construct that you'd expect to enforce that bit of lunacy) that purport to apply to content you have already accessed are not legally valid. Leaving the legal interpretation aside, me doing one thing doesn't mean consent for something else. Believing otherwise is both unethical and amoral, stances I don't hugely feel like interacting with.


> It is in fact not this way, because the content arrives with or without the ads

But that's simply a technical implementation detail.

If the articles you read had first party ads or videos with embedded ads in them, that choice wouldn't exist like it already doesn't when you watch TV.

> Leaving the legal interpretation aside, me doing one thing doesn't mean consent for something else. Believing otherwise is both unethical and amoral, stances I don't hugely feel like interacting with.

Yes and you stealing content (consuming without paying for it) is somehow moral?

It's easy to find all kinds of free content, should it be music, movies or series but lets not kid ourselves in thinking that it is some great human right to have access to things we didn't pay for.


I wrote a scraper for myself only to keep an eye on several used truck websites. I was able to do this for craigslist, so I could run it every couple days and check the nearest 75 cities, the list of truck ads, filtering on/out certain keywords, and the pages for the interesting trucks, making a report for myself to help find the vehicle I want.

I could not do this for any of other several large sites advertising used trucks, like commercialtrucker.com, truckpaper.com, and machinio.com. I kept bumping into artificial javascript and captcha limitations that were not worth my time to try to work around.

The thing is that I will probably find what I want on a craigslist site that I can scrape, I'm getting a lot of great info from them, and I'm not going to bother with any of the ad-based sites, it takes to olong to run all the manual searches. They have effectively done a disservice to their (presumably paying) customers who want to sell their vehicles.


If AutoTempest covers your search criteria, I’ve found them to be excellent to do repetitive searches. Somewhat fortunately for you, they suck for craigslist searches, but have great coverage for all the other major sites.


> I think of web scraping as nothing less than automation of human work.

I did this with Python regarding market prices for ETFs, Stocks, and Mutual Funds.

I wrote Python scripts, one of which was to web-scrape current market values, investment distribution, &c for ETFs, stocks, and Mutual Funds for whic I was invested in. I then would have to manually port that into a spreadsheet for "my own special graphs" and such.

I'm sure there are places online which would do this for me if I logged in, entered in all of my information and more, and provide that all for me ... but this story should sound familiar. And due to personal reasons, I've been lagging behind for far too long.

My Point: This information is publicly available and and it serves my purposes. There's no reason why I should not be able to do this. Yes, it's my fault for not using the information [and yes, the information dies with me], but the point is that I automated the gathering of that information for my own self -- and "everyone" else on the planet has that same information.

I never felt like I was illegal doing this. I'm happy to know if there is a "save" way to do this type of aggregate situation otherwise.


> There really is no hacking or unauthorized access involved.

I'm not sure about that. If I had a phone book full of phone numbers (those heavy ones from 80s), would calling every number in that book to find the one I'm looking for be legal/ethical?

P.S: I agree that "Web Scraping Is Vital to Democracy".


> If I had a phone book full of phone numbers (those heavy ones from 80s), would calling every number in that book to find the one I'm looking for be legal/ethical?

Sure, why not?


I have no idea, I've never done it.


I can tell you haven't cold called ever in your life.


This article focuses a lot on _good_ of web-scraping but I don't think we even need to go that far.

Fight against web-scrapers just seems like a complete logical oxymoron: they want data to be public but also select who gets to see it. Our whole web infrastructure are based around clearly distinct public/private exchanges - there's no middleground this and yet people create these absurd hacks like captchas and fingerprints to fight the nature of the internet.

Finally everyone wants benefits of public data (search engine indexing etc) but don't really want to give anything back to the ecosystem. It's just pure greed and law, our society and government shouldn't aid it in any way, shape or form.


You just state that "there is no middle ground", then go on describing instances of what arguably is "middle ground" as "absurd hacks". What's it gonna be?

There's a POV that, on the internet, "anything goes", i. e. whatever you can do, you're allowed to do.

Then, there's a perspective that works a lot like the offline world, where any clear communication that a reasonable person would understand as denying them access becomes legally binding.

In the offline world, we derive great benefit from following the second model. Indeed, if only measures that successfully prevent people from entering your house without permissions were to count, you wouldn't need laws in the first place! You would, however, need a bunker. Which is quite a bit more expansive than a functioning legal system.

In that offline world, we have created all sorts of additional rules to balance rights for specific situations, and we rely on a canon of expectations that say, for example, that it's usually not ok to enter a private house, but you don't need explicit permission to enter a supermarket.

These are still developing for the online world, and your idea of the "ecosystem" hints at that. But you're just taking from those contradictory ideas above to arrive at the outcome you intuitively feel is "just": a bit of might-is-right when it comes to "whatever is online is fair game", followed by principled ideas of rights and obligations when websites try to defend themselves in that jungle of yours.

What's really needed is something that can, for example, distinguish between a journalist scraping Facebook to map out a terror network vs. some other entity scraping Facebook to sell your embarrassing photos to the highest bidder ten years down the road.


> You just state that "there is no middle ground", then go on describing instances of what arguably is "middle ground" as "absurd hacks". What's it gonna be?

There is no middle ground _in the protocol_ and that's why I explicitly said hacks. Any system can be rehashed into anything else with unlimited extra layers on top of it - by your definition everything is everything.

> What's really needed is something that can, for example, distinguish between a journalist scraping Facebook to map out a terror network vs. some other entity scraping Facebook to sell your embarrassing photos to the highest bidder ten years down the road.

Sorry but that sounds uneforcable and rather absurd. We have the framework in place already - if you don't want something to be public don't put it out in the public _explicitly_. To add we already have legal framework in place for copyrighted and/or protected content like photos and against any sort of malicious attacks like ddos.

Our web is getting extremely centralized and most of these majors are natural monopolies: google search becomes stronger the more data it has - google can scrape the entire web freely yet it's competition can't; facebook becomes stronger the more data is has etc. etc. One way to restore balance is to ensure that public data remains public and the ecosystem can have healthy competition and growth otherwise we're moving to a very dystopian corporate owned world.


The offline world is the world you are by default, you can't choose to put your house elsewhere. On contrary, when you put information on internet you choose to do it there (at least for website owners...).

If you want to put restrictions on the usage of your data, make people sign a contract before accessing it.

Also a contract should not have "the public" as a party or a subset of, you should be able to identify the parties you have contracted with. Else you may end up with warrants targeting everyone or a subset of...


A contract doesn't need to be signed. All it takes is a "meeting of the minds". Ever bought a coffee from a street vendor just by pointing at something / handing them some cash or similar? That's a valid contract.

As long as people keep noticing how stupid Ayn Rand is before they come of voting age, we do have some protection against surprises: you can't just sign over your house or your first-born by clicking on a cookie banner. But I'm pretty sure Facebook could make you type "I won't scrape Facebook" into a box and it'd be (civil-law) binding.


I think a closer "offline world" analogy would be if a magazine tried to enforce rules to what you can do with it after it was delivered.


I think web-scraping is ok but it should be treated as copyrighted data. Let's take a real example: We've built a company that creates ratings for doctors via a private process. We'd like the doctors to be able to show their good rating on their website via a linked widget. How do we stop a competitor from not only scraping our data but using it as their own and/or selling it to someone else?


By network effects and brand loyalty.


We're small with no funding. Our competitors have $100Ms of VC funding. How do we even get network effects and brand loyalty?


>>> Fight against web-scrapers just seems like a complete logical oxymoron: they want data to be public but also select who gets to see it.

Wouldn't it be ironic if the non-selective people used this as leverage for discrimination.


>"a complete logical oxymoron: they want data to be public but also select who gets to see it."

I think you're proving too much here. Your argument applies to all published authors, and would strike a crippling blow to their copyright.


First of all I should think copyright only restricts publishing, not reading. Obviously if you put a price on your book I have to pay it, but that's a different issue.

Secondly the Internet is best viewed as a public noticeboard purely because of the way the protocol works. There's just no getting around that. I think you'd agree that putting up a notice on a street corner and then getting offended when people read it would be viewed as rather odd, if not something else.


But published authors don't have control over who gets to see their work -- only who gets to profit from it (distribution rights, not use rights).


"In both of those instances, the pages and data scraped are publicly available on the internet-no hacking necessary-but sites involved could easily change the fine print on their terms of service to label the aggregation of that information "unauthorized.""

They could. But if they filed a claim under the CFAA without ever sending a cease and desist letter to the alleged intruder, I think the claim would be dismissed.

Is simply changing TOS enough for a CFAA claim to have a reasonable chance of success. I could be wrong, but I believe for every CFAA claim we have seen so far based on "scraping", there was some notice to stop directed specifically at the respondent. If someone thinks just changing TOS (public notice) is enough for a CFAA claim to survive a Motion to Dismiss and there is no need to also contact the alleged offender asking them to stop, then let's hear about the precedent that supports that idea. I do not think there is any such precedent, but I could be wrong.


As to precedent, there is a decision by the 11th circuit asserting such an expansive reading of the law. It's the very decision that's the subject of this appeal.

Other than that, you're mixing civil and criminal law rather liberally. I agree that it would be insane to create criminal liability for run-of-the-mill violations of ToS, and there is a decision from the MySpace era saying as much (and not even involving any changes to those ToS).

But once you have been specifically asked to not do something, by any means that would reasonably get that message across (so, not just C&D), it becomes... murky?


We'll have to wait and see but I believe there should be a relevant distinction between a publicly accessible database like a backend to a public non-governmental website versus a non-public database being served by a protected governemnt computer, such as the ones Van Buren (2019) or Rodriguez (2010) accessed. Not to mention there was an existing (employer-employee) relationship between either Van Buren or Rodriguez and the operator of the computer database: the US government. Compare that to scraping a public website; the person doing he scraping may have relationship with the operator of the website.

I could set "Terms of Use" for the data stored on protected computers I own. I could place limits on "acceptable use" of this data. But can I really argue that I gave adequate notice to all the tech companies and partners that try to access this data.


So robots.txt has no meaning?


It's not binding, but it's a hint they may listen to where you can say things like if there's a certain URL structure where your server would keep serving content infinitely and just waste the crawlers time and resources, you can hint to them they maybe they should avoid certain URLs, but hey if they wanna waste their resources crawling that anyway be my guest!


I think the main issue is with scrapers that don't adhere to robots.txt/X-robot headers.


Google doesn't either, they just don't index the paths written in robots.txt but bots still hit them.


No, because robots.txt is a friendly request and not a demand. It can never be a demand.


Is it possible for private enterprise to file suit under the CFAA? I thought the CFAA was a criminal statue.


Okay, now is a good time to put my long-time idea on paper: we need a "cloud" power of attorney (not using the term digital so that people don't confuse it for a regular one just signed digitally).

Key idea is this: access to internet and its services is essential (and at the web scale, so is automation), but bot abuse is real. It should be possible for any company to ban bots but then for a person with a account on the website to say: I am giving this tool my cloud power of attorney (best if signed through a digital govt id system, e.g. https://en.wikipedia.org/wiki/BankID) and I take responsibility for what it does; you may not block it or erect CAPTCHAs for it, it scrapes/takes actions on your website on my behalf for my personal needs. This would make running a bunch of scripts on your own NUC or Pi an inalienable right while still allowing companies to fight unfair competition and plain simple DoS attacks.


> It should be possible for any company to ban bots but then for a person with a account on the website to say: I am giving this tool my cloud power of attorney

That's basically what API keys are, aren't they?


No, many providers require to apply for API access, to pay for it, to be bound by T&C (some APIs require to you to cache data for max 24h) and also provide a crippled subset of actions through it. What I was saying is that none of that shall be allowed once a power of attorney has been given to a tool (obviously unless the website access is paid, then it's perfectly fine to charge for API access but then it should be offered on something like FRAND terms). Think of Chrome extensions detected and blocked by some websites, this would not be allowed under my proposal.


Building a distributed crowd sourced screen crawler.


I am proposing to have laws that respect our rights, not to secure rights by circumventing and disrespecting the law.


https://packetstream.io is pretty close


We use scrapers to wrap and combine 3rd party tools and increase the usability and decrease amount of work required to get something done.

Sometimes to do some task at a 3rd party tool, it requires many clicks and page loads. My scraper wrapping the tool automates most of the task and only keeps the manual task, reducing the amount of required clicks (and time) required to get the task done.

I use a mix of headless browsers (for JS-heavy apps) and raw API calls. Sometimes I even use the browser to login, trigger a single sample APi call, extract all headers and content, close the browser, and re-run that api call directly using a HTTP lib. The request body obviously gets modified on the user’s requirements. We’re bypassing the slow login process as long as the session is valid. We’re also sharing sessions/logins/accounts this way without exposing credentials to users.

It can also be done to bypass the only-one-session-per-user systems. This is done with permission from 3rd parties. They’re fine with it, they just didn’t want to provide a proper API or let us bypass the only-one-session rule because it requires code changes they’re not willing to make only for us.

Sometimes the tool breaks when the html Or api changes but it usually only takes a few minutes to modify the code to fix it.


Last weekend I made a little playwright script to checkout stock item status of decathlon products. The bike -Rodillo- Has been out of stock for several weeks.

While coding it it was briefly on stock but I was too slow with my card input hahaha and so I neeed to now automate the buy part as soon as it's available, or make that run every 5min and mail/notify me if available...

Btw this was super easy, playwright as a puppeteer successor really rocks, another thing hard to hate from MS like TS or VS Code

PS: I brought down script execcution from 7 secocnds to 2 seocnds, by not loading any unneeded stuff css images external js fonts.

I felt like a good scraper citizen doing so, it was just two linnes to block by request type on the network level, and it worked like a charm to speed up the process.


If you post a file to the WWW (World Wide Web) do not be shocked that the whole world can read and copy it. That's how it works. You're welcome.


Clearly its time to come up with rules for this activity. Picking an apple from a tree in a public park is clearly harmless. Bringing in a combine harvester to take every scrap is not. How do we set a boundary on scraping?


a website would be like an apple tree if there was an infinite number of apples in it. so you could bring a harvester and still leave enough apples for everyone.


They should sell scraping licenses, right? /s


The connection between web scraping and the case cited seems pretty tenuous. I can't say I know much about the definition of Computer Fraud and Abuse Act, but employment agreements commonly stipulate that the employee can only access information to the extent needed for his or her job. So defining this as unauthorized use wouldn't at all be surprising given how that term is used in practice.

The article then goes on to defend web scraping which is very different because (1) the person in question accessed data manually, (2) the person in question had access to confidential information, (3) the person in question used the data in a way he must have agreed not to as part of his employment. It's hard to see how someone could connect whatever precedent this case sets to web scraping.


>The connection between web scraping and the case cited seems pretty tenuous.

The link is another case, Linkedin v hiQ, that is being held pending this case and presents the same question of what counts as "accessing without authorization or exceeding authorization" under the CFAA. The dispute there is whether hiQ could scrape public LinkedIn pages.

The crux of the issue is that if instructions on how to use data that someone has access to without "breaking and entering" don't count as revoking "authorization", this law doesn't cover this officer's actions (though other laws / job requirements may). If just breaking verbal or written terms of use is enough to criminalize it, that covers a whole bunch of things we'd think of as not federal crimes, like lying about your age to set up a facebook account.


We at https://VisualSitemaps.com feel the same way.

Tons of our customers use us not just to archive a snapshot of the site but also to keep tabs on any visual and scope changes over time. All without any coding. 100% Automated.


I use web scraping on my employer's Oracle EBS instance because coordinating with the battle-axe personalities that run ERP's is such a pain in the ass I would rather have a root-canal. With Web Scraping, I don't even have to talk to them.


For others: it looks like EBS in this context means "oracle e-business suite"

and ERP means "enterprise resource planning"

(maybe everyone else knows these, but just in case)

https://en.wikipedia.org/wiki/Oracle_Applications#Oracle_E-B...

https://en.wikipedia.org/wiki/Enterprise_resource_planning


Yes, that's right, and those of us that don't have to deal with these dreadful systems are better off :-)


Okay, great point, but who will pay for that? Serving content costs money: traffic, extra load on CPU and storages, and now on serverless functions too.

One spider may be equivalent of hundreds of real users. And it spoils caches too.


Crackpot regulation proposal: scraping is legal if you favor mirror sites and afterwards create a mirror yourself and maintain it to some arbitrary standard.


I was trying to write a scraping script for a non profit to automate the work they were giving to law students to do manually earlier this year but the court system had agreed captchas that made it impossible. In my (legally uniformed) mind it felt like they were actively undermining people of lesser means by not allowing a simple script to run and essentially mandating an expensive legal team to do menial work


It's good to see that web scraping is not associated only with "hackers" and spammy websites anymore.

I'm working on simplifying web scraping for developers with https://webscraping.ai and see how important it is for almost any business or researcher.


"ScrapingIsNotACrime" t-shirts caught my eye and shared the amicus brief a few days ago here

https://news.ycombinator.com/item?id=25254499

Can't really fault The Markup for looking out for their own techniques/strategies


Major platforms don't feel that way, and they plan on using feature detection to make sure that unauthorized, or competitors', browsers can't access the content that they host. These are the same companies that walk all over the ADA in an attempt to stop scraping.


Another proof why semantic Web is dead, or is it really never there?

Now all we have is SPA 30MB main.min.js mess.


Why didn't data.gov usher in a new era of transparency?

I really thought Obama Admin's data.gov initiative signaled a sea change.

Share your data. Show your work.

Science, journalism, others, face this same crisis. Legitimacy, credibility, authenticity, accountability, etc.

It's this obvious? Why are we still talking about it?


I don't particularly have a problem with scraping, but I could really do without whoever keeps trying to scrape lots of pages very quickly from hundreds of IPs and random user agents.


I think as a scraper, we should generally respect rate-limits. Generally the issue is that it's hard to enforce that against bad users.

The few times I've written scrapers, one of the things I've done is put contact information in the user-agent so that if there is an issue, the site admins can reach out. So far that's worked out ok for me.


Web-scraping can do harm to a business if the scraper does not implement things like timeouts or sleeps between processes.

Websites and businesses do not factor bandwidth for the rogue scraper.


You can easily block a harmful scraper like that from the website owner's side.


You make the target. I’ll make the bot. Let’s see who wins.

In all seriousness, your suggestion is woefully naive in the face of any serious entity.


what if the scraper uses ip rotation or serverless workers ?


This only applies to crawlers who are set to recurse into URLs.


Web scraping does for the web what TikTok does for music. It enables creativity.

You can take a meme (i.e., song/web site) and change/combine it to make a new meme.


and so is free (FSF-style) and open source software. but apparently a healthy economy (with a 'free' market) is more important... o̶h̶,̶ ̶a̶l̶s̶o̶ ̶n̶a̶t̶i̶o̶n̶a̶l̶ ̶s̶e̶c̶u̶r̶i̶t̶y̶ ̶a̶n̶d̶ ̶s̶o̶m̶e̶t̶h̶i̶n̶g̶ ̶a̶b̶o̶u̶t̶ ̶t̶h̶e̶ ̶c̶h̶i̶l̶d̶r̶e̶n̶.̶


The corollary would be that robots.txt exclusion for certain web crawlers should be removed.


Bonus, if your site is scrapable, it is easier to preserve.


With Puppeteer and Playwright, essentially every site is scrapeable. Doesn't mean it is easy, but it can be done.


It's scrapeable if you try hard, but it probably won't get archived well.


The internet is not vital to democracy, and perhaps as we are learning more now, antithetical to it. So then it logically follows that web scraping is not vital to democracy.


I don't think the internet is antithetical do democracy. Social media on the other hand...


Web scraping == 1st Amendment.


Someone tell this to Bloomberg.


[flagged]


"Human right" is one of those sacred terms that are supposed to mean so much but mean nothing. "Human rights" are just superstitious conventions about what's good/bad. "No one can deprive you of housing" oh look I am gonna set up my tent here in this warm shop and claim human rights protection.


It's pretty typical and sad how the US is prosecuting a dirty cop who looks up confidential data in exchange for money: on behalf of the system itself rather than on behalf of the victims' privacy, any limitations on the actions of law enforcement officers in general, or public corruption. In libertarianism/liberalism's final form, the only two kinds of laws left are lèse majesté and violations of terms of service.


Not only is web scraping in danger, the trend towards HTTPS makes it harder to log traffic, and manage the flow of data to/from browsers on your own computer.


Can you please explain how you got to that conclusion?


Obviously makes it harder for others to log your traffic


Not OP, but it's very hard to intercept and monitor the traffic of an app that uses code signing, obfuscation and certificate pinning properly.

What used to be a five minute "put tcpdump on the access point / router to work" job pre-HTTPS everywhere now is a many days worth job of messing around - out of reach for all but the really dedicated, and as a result it is very hard for an user to have actual visibility over where their data flows.


I'd say this is a consequence of making our phones opaque rather than HTTPS-everywhere. Setting a HTTPS-connect proxy (I use Zap) is a 2 minute process on a desktop that lets you strip SSL easily.

Doing the same on our mobile devices is much more tougher, because we've let feature phones become walled gardens.


> Setting a HTTPS-connect proxy (I use Zap) is a 2 minute process on a desktop that lets you strip SSL easily.

Not if the app uses certificate pinning, ships its own version of a SSL library and uses code-signing and obfuscation to prevent you messing around with it.

As for the walled garden part: I agree with the general sentiment, but on the other hand I also see the lengths malware authors go to gather data from people. There really is no one-fits-all solution here, because anything that allows the user to intercept and monitor SSL communication can automatically be used by an attacker! :(


> Not if the app uses certificate pinning, ships its own version of a SSL library and uses code-signing and obfuscation to prevent you messing around with it.

Are there desktop apps that behave this way? Atleast in my experience - I haven't come across anything like this on Linux.


Dropbox does: https://knowledge.broadcom.com/external/article/169397/dropb...

And I seriously hope cert pinning gets adopted by more applications.


Many desktop apps behave this way, yes. Anything from a major tech co should at least


Does that really matter though? Isn't the proper way to evaluate what data is being shared to assume each app exploits its permissions up to the level of trust you have in the company that provided it? Eg. if you worry Facebook is stealing your contact list, then if their app has that permission, you should assume they are stealing it. No need to bother checking if they actually are. They might do it when you're not looking anyway.

As for data being sent from your device to 3rd parties, again, if you don't trust the app's developer not do to that, you also won't trust them not to be sharing it from their end where you have no way to look at the traffic.


There's a bit of circular logic here because I can't evaluate who to trust if I can't effectively monitor what the developer is doing on my machine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: