I had a look at https://github.com/philippta/flyscrape/blob/master/scrape.go. It’s just using the builtin HTTP client to fire off requests, with an identifying user agent. Which means it’s useless for scraping most real world sites you may want to scrape, unfortunately. You’ll get served a JavaScrpt challenge, or not even that (many sites will refuse to serve anything if they see a random user agent like flyscrape/1.0).
What are some examples of "most real world sites".
What are some examples of sites that are not "most real world sites".
Is HN a "real world site".
What percentage of sites submitted to HN are "most real world sites". (IME, it's a minority fraction.)
Why not just delete or change the user-agent line in scrape.go before compiling.
(Personal experience: I have been successfully retrieving information from the www for decades without including a user-agent header. The number of sites I have found that require this header is relatively small. It does not rise to the level of "most".)
HN replies usually fail to include even a single example.
Similarly, do examples provided for scraper programs and libraries ever include "most real world sites".
“Most real world sites” and “most real world sites you may want to scrape” are different, especially when weighted by information volume, so your selective quoting doesn’t help. Alexa top 100 probably contain more information, especially new information, than the rest of Alexa top 10000 combined (random guess I pulled out of my ass, don’t quote me on that), so the overwhelmingly scraping-resistant Alexa top 100 is what most scraping effort is directed against.
Anyway, I’m not interested in a pedantic debate. Most (but not all) people who have attempted to scrape any popular site, or unpopular site behind Cloudflare at above-minimum protection level, in recent years should know exactly what I’m taking about.
Edit: In case you’re not aware, many of the sites I have in mind are/were capable of defeating puppeteer-extra-plugin-stealth. User agent is only the most basic thing, like level 1 in a hundred-level dungeon.
Yeah, nobody cares that most sites will work with it. The number of sites that require it might be small, but that number tends to include significant and notable ones - the ones that most people actually wind up wanting to crawl. Cloudflare & co make this more difficult with each passing year.
OP is making a very well known day-one point about this topic. It is somewhat surprising that the library doesn't offer a way to dynamically set it.
Edit: and this isn't even getting into stuff like Tls fingerprinting, header order, etc
I've not looked at the source code, but if GP is correct, then absent JS rendering means there's little added value for me (a dude who scrapes a lot).
Real world example, I was looking at scraping unjobs.org for a friend the other day. The need for JS rendering turned the job from 15 minutes of requests and beautifulsoup into a full-blown session with selenium, geckodriver etc.
I'm not saying the linked framework isn't nice, I've not looked at it, but tools for simple scraping are plenty and easy to use.
There's a lot more that a new framework needs to do to distinguish itself. I'd love something that makes JS rendering, proxy-rotation and catchpa solving easier, in a nice package I can deploy myself.
Thank you for providing an example. As expected, retrieving the jobs listings from unjobs.org is trivial.
Below is a simple demonstration using only common UNIX utilties and minimising the number of TCP conections. No browser. No Javascript. No Selenium. No Geckodriver. No proxies.
Step 1 requires 40 TCP connections and completes in under a minute. Step 2 requires one TCP connection and completes in 10min. (NB. The connection minimisation used here, what RFCs used to call "web etiquette", is not possible using a popular headless graphical browser.) 1.htm is 1.5M, 2.htm is 40M
If provided with an example of what the formatted output should loook like, I will demonstrate how to do it quickly and easily, without Python. Quite sure the text processing methods I use to extract data from HTML are faster than Python.
But the site remains unnamed. To prove, why not let others test the theory. Tell us the Wordpress site.
IME, Wordpress sites do not require a user agent header. Contrast, for example, with Squarespace sites which do require a user agent header.
IME, if send a user agent header with a particular value, then some sites will block, depending on the value. Whereas if _do not send_ a user agent header, then almost all sites will accept.
The so-called "developers" who publish code and commentary about "scraping" almost always include a user agent header. Usually they use fake values. They try to guess the "correct" values to send.
Im not referring to sending fake values. Im referring to not sending the header at all. No "spoofing" is involved. This works for me, for decades, across thousands of websites.
It appears to accept a "starting URL" and to follow links.
Opening many TCP connections is arguably still a reason why website operators try to prevent crawling (except from Googlebot IPs). As for scraping, it can be done with a single TCP connection. Perhaps "developers" instead opt to use many TCP connections and then complain when they get blocked.
Colly is a great scraping library if you are a Go developer.
Flyscrape on the other hand is a ready-made CLI tool that aims to be easy to use even for someone who is a little familiar with JavaScript. It just happens to be written in Go, but that should not matter to the end user.
It does not have full feature parity with Colly but most use cases should be covered.
Looks like it doesn't have the possibility of running it as a particular browser etc. Which I guess makes it fine for a lot of pages, but also a lot of scraping tasks would be affected. Am I right or did I miss something?
Yes, this is correct. As of right now there is no built-in support for running as a browser.
What is possible though, is to use a service like ScrapingBee (not affiliated) and set it as the proxy. This would render the page on their end, in a browser.
What happens if 'find()' returns a list and you call '.text()'. Intuition tells me it should fail but maybe it implicitly gets the text from the first item if it exists.
Either way, I think you create a separate method 'find_all()' that returns a list to make the API easier to reason about.
This looks like something I could use. Maybe not revolutional, but I do that from time to time, and even if only for organizational purposes it seems to make sense to store that stuff as a bucnh of configuration files for some external tool, rather than a bunch of python-scripts that I implement somewhat differently every time.
Right now I'm just wrapping my head around how this works, and didn't try it hands-on yet, but I struggle to evaluate from the existing documentation, how useful this actually is. All examples in the repository right now are ultimately one-page scrappers, which, honestly, would be quite useless to me. Pretty much every scraper I write has at least 2-3 logical layers. Like, consider your HN-example, but you want to include top-10 comments for each post. Is it even possible? Well, I guess for HN you could just get by using allowedURLs and treating default function as a parser for the comment-page, but this isn't generic enough. Consider some internet shop. That would be (1) product category tree, sometimes much easier to hard-code, rather than scrape it every time; hard-coding often is generative (e.g. example.com/X/A-B-C, where X is a string from the list, A, B and C are padded numbers, each with a different range) (2) you go into each category, retrieve either a sub-category list (possibly, js-rendered, multiple pages) or product list (same applies) (3) open each product url, do the actual parsing (name, price, specification, etc). Each of json-object from (3) often has to include some minimal parsed data from level (2) (like category name)
More advanced, but also way to popular to imagine a generic web-scraper without it: in addition to some json-metadata you download pictures, or pdf-files, etc. (Sometimes you don't even need metadata.) Maybe just text files, but the result is several GBs, and isn't suitable to be handled as a single json-object, but rather a file/directory tree.
Is any of this possible with this tool?
Also, regardless of being it useful for my cases, some minor comments:
1. Links in docs/readme.md#configuration don't work (but the .md files for them actually exist).
2. I would suggest making "url" in the configuration either a list, or string|list. I suppose, that pretty much doesn't change the logic, but would make a lot of basic use-cases much easier to implement.
These days, I'm not even using Go for scraping that much, as the webpage changes makes me crazy and JS code evaluation is a lifesaver, so I moved to Typescript+Playwright. (Crawlee framework is cool, while not strictly necessary).
Its been 8+ years since i started scraping. I even wrote a popular Go web scraping framework previously: (https://github.com/geziyor/geziyor).
My favorite stack as of 2023: TypeScript+Playwright+Crawlee(Optional)
If you're serious in scraping, you should learn javascript, thus, playwright should be good.
Note: There are niche cases where lower-level language would be required (C++, Go etc), but probably only <%5
How does that help you mitigate when a site changes? If you’re fetching some value in a given <div> under a long XPATH and they decide to change that path?
I started putting data-testid attributes in my web app for automated testing using playwright. Prevents me from breaking my own script but it sure would make me more scrapable if anyone cared. Well.. I guess I only do it on inputs, not the rendered page which is what scrapers care most about.
Unless you start a war against scrapers, you don't need to worry about that as I'll always find a way to scrape your site as long as its valuable to 'me'. Even if it requires Real browser + OCR :)
Oh I know I couldn't prevent it. But if you wanted to scrape me, you'd have to pay the monthly subscription because everything is behind a pay wall/login. And then you'd only have access to data you entered because it's just that kind of app :-)
Don't know about the poster, but I try to find divs and buttons in a fuzzy way. Usually via element text. Sometimes it mitigates changes, sometimes it doesn't. It's a guessing game. Especially when they start using shadow elements or iframes in the page. If I'm looking for something specific like a price or dimensions, I can sometimes get away with it by collecting dollar amounts or X x Y x Z from the raw text.
Crul looks nice, though, you cannot imagine how many startups that I've seen failed doing a very similar thing as Crul. Wouldn't rely on it.
The problem is complex: Humans generating messy pages
Thank you for the positive acknowledgment and insightful observation. As one of the creators of Crul, I fully understand the challenges inherent in this intricate business and software domain. Our initial emphasis on the browser abstraction layer, predating APIs such as SOAP, REST, GraphQL, etc., serves as a data driver and stateless cluster for interpreting DOM nodes. While we initially lacked programmatic extensibility for custom browser control, as you rightly pointed out, addressing complex edge cases often requires such a feature. Looking ahead, we are exploring the possibility of opening up the core, starting with "Krull," the browser cluster. We welcome feedback to gauge interest in this development.
I like web scraping in Go. The support for parsing HTML in x/text/html is pretty good, and libraries like github.com/PuerkitoBio/goquery go a long way to matching ergonomics in other tools. This project uses both, but then also goes on to use github.com/dop251/goja, which is a JavaScript VM and it's accompanying nodejs compatability layer and even esbuild, in order to interpret scraping instruction scripts.
I mean, at this point I am not sure Go is the right tool for the job (I am actually pretty confident that it is not).
A pretty neat stack of engineering, sure! This is cool, niely done. But I can't help but feel disturbed.
Your comment was posted 4 minutes ago. That means you still have enough time to edit your comment to change it so it contains real URLs that link to the project repos for the packages mentioned:
(Please do not reply to this comment of mine—if you do, I won't be able to delete it once the previous post is fixed, because the existence of the replies will prevent that.)
Looks interesting, and thank you for sharing this! One common issue with scraping web pages is dealing with data that is dynamically loaded. Is there a solution for this? For example, when using Scrapy, you can have Splash running in Docker via scrapy-splash (https://github.com/scrapy-plugins/scrapy-splash).
Thanks! As mentioned in another comment, currently there is no build in support for this yet.
As a workaround one could use a service like ScrapingBee (not affiliated) as a proxy, that renders the page in a browser for you.
Surely, relying on a service for this is not always ideal. I am also working on a small wrapper that turns Chrome into an HTTPS proxy, which you could plug right into flyscrape. Unfortunately it is very experimental still and not public yet. I have not yet decided if I release it as part of flyscrape or as a separate project.
Not only can you, in my experience it is substantially less drama and arguably less load on the target system since the full page may make many many other requests that a presentation layer would care about that I don't
The trade-offs usually fall into:
- authing to the endpoint can sometimes be weird
- it for sure makes the traffic stand out since it isn't otherwise surrounded by those extraneous requests
- it, as with all good things scraping, carries its own maintenance and monitoring burden
However, similar to those tradeoffs, it's also been my experience that a full page load offers a ton more tracking opportunities that are not present in a direct endpoint fetch. I mean, look how many "stealth" plugins out there designed to mask the fact that a headless browser is headless
But, having said all of that: without question the biggest risk to modern day scraping is Cloudflare and Akamai gatekeeping. I do appreciate the arguments of "but ddos!11" and yet I would rather only actors that are actually exhibiting bad behavior[1] be blocked instead of everyone trying with a copy of python who have set reasonable rate limits
1 = this setting aside that "bad behavior" can be defined as "downloading data that the site makes freely available to Chrome but not freely available to python"