Hacker News new | past | comments | ask | show | jobs | submit | more b6z's comments login

  /* XPM */
  static char * treachery_xpm[] = {
  "10 10 1 1",
  "       c #0000FF",
  "          ",
  "          ",
  "          ",
  "          ",
  "          ",
  "          ",
  "          ",
  "          ",
  "          ",
  "          "};
X Pixmap format. Save as treachery.xpm. (Made using Gimp, because it was faster than writing by hand. Sorry. But I have generated large pixel files programmatically in the X Bitmap format, which is similar but only one bit color information. Easier than writing a GIF.)


I admit being a bit disappointed that a well-known disadvantage of web scraping was not mentioned: Web scraping is fragile!

Web sites change, web frameworks evolve, and just some subtle reordering of some <divs> or renaming of CSS classes, and your perfect scraping code from yesterday will break tomorrow -- maybe not leaving you empty-handed, but probably missing some data or delivering the wrong one.

If there is an API you can use, use it. If your budget allows to pay for API access, buy it. APIs tend to be more stable than scraping, and the data provider will probably inform you if it changes. Contacting them might even get you more interesting data, as not every column they have in their database might become published on the web site.


I've found the opposite to be true -- when an entity is maintaining an API and their website with the same data, the website is their core business. The API is prone to being incomplete, buggy, subject to sudden deprecation, unreasonably rate limited (crippling access to some objects below what a casual human user has), and so on.

Conversely, overall document structure doesn't change much over time. I know it _can_; there's a social contract that APIs should change slowly while documents can change whenever, but that isn't what I observe in the wild. Even on fairly major redesigns, the overall structure has minimal edits.

A technique I've used before (wasted effort in hindsight since web pages are stable and I never have to update my scrapers) is to come up with several semantically different ways of accessing a piece of data on a page. It serves two purposes; you can recover from small page changes by having the different methods vote, and you can detect most kinds of page changes by noticing discrepancies, notifying yourself that the scraper needs to be updated soon.


> the website is their core business

Granted, but there are lots and lots of ways they can break scrapers in the pursuit of their core business, such as a website redesign. For example, moving from static HTML to a web framework would require your scraper to actually run the JavaScript to generate the DOM in the state that a reader might view it in, and this is quite a lot more complicated than walking the static HTML.


Its not too complicated, you just need a headless browser. Having done a ton of web scraping projects, I’d recommend just starting with this approach as even sites that look pretty static use Javascript in subtle ways.


Data is usually embedded in json or available from an internal api when it's an SPA. Headless browser resources are pretty huge. When doing large scale scraping, headless browser should be a last resort


Using a headless browser for scraping is a lot slower and resource intensive than parsing HTML.


I don't find this as a concern - in all the scraping I've done, the only bottleneck was the intentional throttling/rate limiting, not the speed and resources spent by the headless browser; a small, cheap machine could easily process many, many times more requests than it would be reasonable to crawl.


Sure, but it might be the only way to get the data.


It might be, but _starting_ a scraping project with a headless browser might be excessively expensive if you don't need the additional features.


"only" is a bit of an overstatement. The data is always coming from somewhere, it just depends on how much effort needed to reverse engineer the JavaScript code path to the data


For example, moving from static HTML to a web framework would require your scraper to actually run the JavaScript to generate the DOM in the state that a reader might view it in

Or, as is often the case, the content is already there or fetched via an API in far more easily-consumed JSON format that you can use directly.


That’s my point.

Granted, lots of APIs make it prohibitively difficult to authenticate such that it’s easier to simply scrape. Such is the case with just about every Microsoft product I’ve ever used, most recently the XBox Live API. I genuinely wonder what kind of nonsense goes on in Microsoft design review meetings.


> moving from static HTML to a web framework

Looking at this sentence, I have the impression that it is nowadays taken for granted that "web framework" means "front end web framework". I come from a time in which it was perfectly fine to generate static HTML via a (server-side) web framework.


That's correct, I was referring to front-end web frameworks.


> this is quite a lot more complicated than walking the static HTML

Certainly more resource-intensive.


I've found that the breaking point is for websites that consume their own public APIs. On those, the API is usually very well maintained, documented, and stable.

Those that don't use their own APIs almost always end up with an open API in the state you describe (except maybe the very big players like FB, where the open API is overall good).


That is actually a good criteria for code quality in general. Don’t prepare a Java method which is not used. Because it will _never_ be right. Just implement today’s story, and leave the rest for another day. Same goes for rarely used functions: They are usually very buggy, where as function in the middle of everyone’s workflow are flawless. Hence the work of a good product owner is to streamline everyone as much as possible on a few central functions. But of course, in an enterprise environment, there are a few functions that are required to work (XML export, backup and restore...)


It's the why that holds that interests me.

The best explanation I've come up with is that as a naive developer, it's impossible to know the nuances of any sufficiently complex process or workflow.


I think it really depends on the type of product that the business makes money on. If one of the main products is data then I'd wager their API will have significantly more information compared to their website. If they make money via the website then yes, they're less inclined to spend resources on the API.

All of our own websites are built on-top of the same public API that everybody else uses and scraping used to be a nuisance. It was also confusing because they would be able to get more data using the same free account just by using the API instead of scraping. Exactly like the OP mentioned we only show a small number of properties via the website but most scrapers never took the time to actually compare API vs website.


I think this depends entirely on what you are indexing. From my experience with some 100 ish scrapers for news sites a few would break literally every day.

And the only thing we really wanted was article title and date.


My guess is that it depends if the API is seen as customer interface or implementation detail.

People are usually hesitant to constantly change how a customer interacts. All to willing to change internal details.


About 15 years ago, I worked on an add-on for a large online service, using their official, versioned, documented API.

We had to build a test suite just to verify that the API working as expected, because they would break it so often, and also the documentation didn't always match reality.

This was a paid API we used on a pay-per-use basis, IIRC, and had official support for.

In the beginning, we had a false sense of security about the version numbers and such. The first couple of breaks seemed to be "just this one time". Then we realized that it was happening all the time, and so the test suite. (I was a junior, so I can't take credit for this work, just a witness.)

API is often no more reliable than scraping human UI most of the time, with the added disadvantage of being second-level importance.

Personally, I've tried to combine human UI with API as much as possible. For example, I added a feature for being able to post via direct URL entry, like so: http://example.com/hello+world

most browsers will convert the spaces for you too, so you can just type your text into the address bar.


Approaching the web as an end user, I have also found this to be true. Most websites rarely change their document structure in such a way that breaks simple text-editing scripts. Keyword: "Most". In most cases no specialised tools or libararies are needed for extracting text or other resources. Again, keyword: "most". Personally, just because there may be a few exceptions does not mean I am going to change a strategy that works almost 100% of the time.

Understanding "web APIs", which did not exist when I first starting using the www in 1993, other than as a way to try to control and/or monetise scraping continues to escape me. I do like the increased usage of "endpoints" though, serving only data with no markup. Although XML and JSON are too bloated compared to something sensible like netstrings.

Similarly, on the client side, I fail to understand all the parsing tools and libraries and related promotion; it is just as easy to break any solution that depends on them and in many cases they are obviously overkill, more brittle than simple scripts using generalised text-editing tools.

One example is "jq". In many cases it is clearly overkill and is slower than sed.

https://stackoverflow.com/q/59806699

As a data source, the web is messy. "Standards" cannot be relied on 100%. Some people try to pretend the web is clean and can be tamed, or they "give up" because it is not "perfect" and things can break. Getting hands dirty works the best and most things do not break if kept simple, IME


I find that's the difference between public API endpoints and those APIs written for the page/app itself... if it's tightly coupled and the specs aren't published to the public for general use, I treat it as breakable.

That's just my own take. I've worked in environments with stronger versioning, and depending on your data needs and structures it can work. It's usually not worth it for most use cases though.


I've long wanted a really robust way of defining page areas for scraping, that could handle even relatively major HTML shifts.

My best idea has been to simply maintain a collection of "reference" URL's (e.g. of different products or articles) and identify unique start/end text for those specific instances.

Then automatically extract as many possible different "rules" for locating the desired content (pure structure and ordering, class hierarchies, classes/ids, surrounding text, etc.) and find the ones that are consistent across different instances.

And then just use those rules until they break on the reference page... and when they break, develop new ones.

I'm curious if anyone's built this type of thing?


I've seen a few academic papers and a few closed products that convert your selection of content you care about into a scraper capable of acquiring that content in the future. Last I checked there wasn't anything readily available as a FOSS library for doing so.

I'm having trouble finding those papers at the moment, but here are a couple commercial products that sound similar in spirit to what you're describing.

https://scraper.ai/

https://www.diffbot.com/ (kind of)

Edit: I hadn't searched recently enough. See the sibling comment recommending this library. Haven't used it yet, but at first glance it looks nice. https://github.com/alirezamika/autoscraper/


I've tried autoscraper now, and I don't like it (not yet).

(1) Its wrapper generation code isn't much more advanced than that similar data will be similarly nested in similar parent blocks. It looks more brittle than I'd like.

(2) It has zero tests, comments, docstrings, types, or any other niceties so far (and minimal documentation).

(3) When things go wrong it strongly prefers returning no information and not throwing any errors. None of the examples in the README actually run (or rather, they give you a `None` response that's all but useless) without changes.


despite being undocumented, it really works well. I tried the readme examples and all work. maybe you didn't update the wanted list in the examples because it has changed in the page. IMO the biggest problem is the lack of js enabled content support.


> it really works well

It works well enough. I tried a few other sites and had mixed results even when providing the raw HTML so that I knew its http logic wasn't the issue.

> maybe you didn't update the wanted list in the examples

Yeah, that was my only real problem with the readme examples. Those could just as easily be provided as local data (e.g. how `sklearn.datasets` works) so that the end user starts with working code, especially since there are no errors/warnings/etc when anything goes wrong.

> IMO the biggest problem is the lack of js enabled content support.

Haha, unless I'm seriously misunderstanding you this is one of the only things I don't mind :) Since you can pass raw html to the library, you can use your favorite headless browser to navigate (or in the happy case just load a non-interactive js-enabled site) to your content and pass it through to this library to do the data extraction. I rather like those features being decoupled and kind of wish this library didn't attempt to do any of the crawling itself. I know that's just a personal preference, but it's my account, so I'll say what I like about it.


didn't see that feature, makes sense now :)


This project was recently submitted to r/python: https://github.com/alirezamika/autoscraper/


I've had good luck just running this sort of thing as an offline process, especially for external dependencies. We used to get 'pink' for this lookup and now it's "<span>pink </span>" or somesuch.

It depends on the SLA, of course, but it's cheaper to check every few hours than on every request, and you get a couple of alerts instead of a constant stream of them.


Interesting approach using multiple locators, might use it in the future. Although in my limited experience I agree that the avg site which usually needs scraping keeps its "interface" mostly unchanged.

I guess a lot of stuff I've needed to scrape are from old CMSes or sites where its viewed as part of a cost center they're unlikely to invest in and just maintain it


I've actually had the opposite experience.

After 'scraping' some forums via their APIs for weeks, I ended up realising that the data and metadata given to me by the API was so restricted (e.g. providing 'recent comments' instead of all comments) that a pure vanilla web scraping approach became the preferred option.

I agree with all the points you mention about the shortcomings though and your argument is sound. This is my opinion in the other direction, APIs come with an element of trust.


You both are right. Apis are extortion and scraping is fragile. I once amused myself by crating random invisible divs when generating a server side html page. It made scraping resulting files impossible, and made no difference to the look of the page.


Eh, you might be surprised how effective it is to just use regular expressions instead of something that parses the DOM. Usually there is something to key off of, and while regular expressions aren't good for parsing HTML, they still work just as well as they always have for matching text patterns, which is often what scraping ends up being.


Somewhere, an evil pony is twitching his ears.



a better approach is dynamically generated css, absolute positions and randomized html output. In my SEO days, I wrote a tool to do this exact thing for a different reason.

We had a network of sites, and they all looked the exact same to the user, but to google, each site had a completely different structure, and it kept the network safe for years before a google employee (or we assumed google employee - @google.com email) signed up for the service without us knowing, and discovered the entire network by placing a large order which gave him links across the entire network. Within 1 week of them signing up, our entire network of 10k domains was dead, and everything they linked to was delisted from google. We had to shut down the network, and refund all unused credits from our customers.


Is rule 1 of trying to game PageRank not to avoid Google email addresses?


I didn't say I was smart. Also if you block @google emails you will just get them signing up with another email address, so why even try to block them. And once they pay for a service, you cant not deliver, because that would constitute fraud.

We tried to just fly under their radar, and avoid any automated trigger that would arouse their suspicion.


And your conscience still allowed you to sleep at night?

All this seems like a nice illustration of how the web ecosystem encourages parasitic behavior on so many fronts. It's sad.


What do you mean? We sold links on websites filled with unique content we paid for...

We were masking those domains from google because google penalizes selling backlinks to justify paying for their ads. My conscience is quite clear. When google delisted our network, we refunded our customers and moved on to a smaller invite only network that ran well for years. I left that company 6 or 7 years ago, but I'm sure they are still making some money off hosting and managing private blog networks.


> And your conscience still allowed you to sleep at night?

The average American endorses slavery in their clothes, Christmas decorations and electronics. By comparison creating some bad links in a search engine is so low on my list of moral failings it doesn't even register.


endorses is too strong a word here. I get your intent, but as an American I don't endorse anyone having slavery in their supply line.


I agree, perhaps "facilitates" might be a better fit


> And once they pay for a service, you cant not deliver, because that would constitute fraud.

Underperform, and then offer them a goodwill refund if they ask for it?


SEO is already to snake-oily to do something like that. We had principals, we did our best for our customers and our clients. If we did fail in trying our best at least we could sleep knowing we weren't actively trying to screw over our paying customers.


That's so awesome lol


I'm surprised I don't see this discussed more in the context of web scraping, but XPath is not only much more powerful, but can also be made robust against such techniques.

Sure, if you change the page structure enough you could defeat it, but it would require more than just adding a few divs. XPath easily lets you mix and match matching against not just CSS classes, but also the page's structure itself, inner text, attributes, and so on. As a result, you can get some really powerful queries without having any kind of complex post-processing of the results.


Xpath is one of the dinosaur technologies that I didn't learn until some time ago, and man was it a great way to find the right element, and pass that to the tool doing traversing and other things - being able to find a div that contains a string in a forest of divs was so damn nice


I doubt it was impossible. For example, the first step a scraper might perform is to delete all invisible divs.


Random invisible divs aren't likely to defeat a moderately motivated scraper, though. Depending on what they're looking for, it could be as simple as getting the inner text of a sufficiently high-up element and matching a regex.

More complex scrape defeating measures I've seen are blobs of JS that need evaluating in order to generate URL parameters (all that needs doing is extract the JS and run it in a JS engine, if you don't want to drive a headless browser, with care of course!) or that need a captcha defeating (just buy some deathbycaptcha API calls).


I've seen this approach backfire a bit too. Rather than having to scrape web content, my work is reduced to pulling out my favorite sandboxed JS interpreter bindings, running the snippet, and extracting the rich object they just created with exactly the data I wanted. You only need a headless browser if there's a meaningful interplay between the JS and the rest of the site.


My favorite is when they provide JSON structures of the data in the included page JavaScript. That's easy mode scraping. :)


Haha, that'd be even better for sure.


I add a delay on the server side for IPs that seem scrappy and throw heavy javascript to blast off the resources. So far, it seems to work well in some cases.


For the purposes of their research and study, web scraping is stable enough, given that many websites (especially government sites), aren’t overhauled frequently. And many government sites, like those run by coroners, aren’t likely to have APIs.

Contacting the agency for the data is always a good step, but even if they are responsive, they may not be willing to email you on a daily basis with data updates.


That is a good argument, and I should have mentioned it, yes. For a one-off job, web scraping will probably be the best choice, and maybe even the fastest to implement. I have done my own share of web scraping for personal projects (and thus know about the fragility), but I didn't care much about broken results in the long run.

But in the article they mentioned re-running their program to update their data, so it could be a long-term effort. And anyone planning a long-term project reading that article and taking their advice should at least be warned about this possible problem.


We used to despise web scraping and tried to dump data from an industrial system by listening to raw Modbus RS-485 traffic. Turned out that there was no way to get the register maps from the vendor nor the firmware.

We ended up writing a scraper for the main unit. Once they have been installed and configured, they're likely not going to be updated unless absolutely necessary (such as when adding new, unsupported controllers to the bus).


Going way back in time, well before www and soap, I recall having to develop an application that would talk to another system using serial comms, effectively pretending to be a VT52-type terminal. Had to send commands, verify responses, send subsequent commands based on available options, etc. That was used to integrate between two different types of Telco exchange equipment, for provisioning of X25 circuits on the national network in the UK. Supplier of the kit did not have any form of API, just a terminal-level application, so client-side code had to pretend to be an operator. Ahh, those old days, using green-screen terminals, eh


Been there, done that.... even to hack an old ciscos into a bodged-up production/testing enviroment (take config from production, change two addresses from production to testing values, and apply the whole config to a testing cisco).

The good thing with ciscos and most of the old technology is, that once you write that script, it works for years... commands never change, outputs never change, some perl, a regex or five, and you're done.

Doing the same with a webpage, where it's "pride month" today, "womens day" tomorrow, "day against aids" the day after, and each means a new div, a new popup, a new redirect, a backend update inbetween, to make the new banner possible, etc., is a pain in the ass.


There are multiple ways of doing web scraping, and some are definitely more fragile than others.

I've found fully specified XPaths to be a mistake for example. It only takes one tiny change on the page to mess up the script. On the other hand, despite numerous warnings that it would be a disaster I've found I have a lot of luck maintaining regexes, even after major page reworks.


Sure, but A) As implemented in code, the path doesn’t necessarily have to be fully explicit in terms of tags. You could look for a child with text containing rather than a specific class or id, for example (or get fancier with semantic similarity on tags or text). B) There’s a trade off between speed and fragility that could make a difference if your tree is deep enough that traversing it iteratively is slow enough compared to a long xpath that it becomes a limiting factor. Granted you don’t typically encounter this in a standard issue html tree but the lxml docs, for example, correctly note that xpaths can be way faster when the nesting is super deep.


brittle, but in a way more stable, in the sense that the provider can't cut off access suddenly without also changing their website. Also the breaks that do happen tend to be trivial to fix. The entire project needs to be managed differently. I'd look at another frequently updated project (youtube-dl comes to mind) where breaks are _expected_


Yep, we've had exactly this issue with the web portal on one of our applications. A customer used some 'quick and dirty' convenience tool that used web scraping to invoke the portal application pages, and wrapped that up with part of their admin workflow. When the pages changed, even slightly, it broke their admin tool. We had said not to do this, but the workflow/web-scraping tool was surely quicker and cheaper to implement than custom API development within that organisation.


Good point. Yet a good remedy is to add real-time asserts on your scraped data and get yourself notified as soon as you encounter unexpected data formats.


Keep in mind this is a research publication. In general research systems aren't meant to be maintained long term. You do the study, or make the proof of concept, and if it has long term value then you find a long term solution.

I know they mention maintenance, but having been in both fields what a web person means by maintenance and what a research person means by maintenance are an order of magnitude different.


Still, para 7 excerpt. " Once your program is written, you can recapture these data whenever you need to, assuming the structure of the website stays mostly the same."


Sounds like job security for us web scrapers... xD


The easy solution is to use unit testing on known URLs to ensure that all scraping cases are handled correctly.


Also, consider rate limits as well. Some websites may limit by IP, user-agent or by burst.


"Web scraping is fragile!"

Can you give any specific examples (sites and data needed from them).


One example, unfortunately not very scientific: Years ago, I had a bunch of Web comics I visited daily. To make this more efficient, I had a script on one of my servers which scraped the corresponding sites in the morning and produced the daily collection on a single page.

Every few weeks or months, one of the comics would just not appear or not be updated (i.e. there was the same strip day after day). I had to update the code each time, checking the source code and finding the new page and location of the image. Later, some pages started using JavaScript to load the image file, and I lost interest in sophisticating my script. So, no single page comic collection for me anymore. :)


Have you tried Dosage? https://dosage.rocks/

Set up a cron job to run `dosage @` every day, it will check for new comics and download them.


FWIW, I’ve generally had good luck with RSS feeds for comics.


Can you name the sites?


'brittle' for sure, if you're requiring periodic updates.


The article seems to recommend using an api if possible.


I absolutely agree with you. I have more trust into my local European ISP than into Google, Cloudflare and such.


Nobody is asking you to trust Google more than you already do (if you don't run Chrome, this doesn't impact you at all). But the real question is: do you trust your European ISP more than anybody else that might ever run DNS for you? That seems like an extraordinary amount of trust. What DoH allows you to do is trust someone, anyone else to safely run DNS for you, without exposing anything to your ISP. That's a capability you think Europeans shouldn't want?


Same for me.

From Kromtech's article I deduced that this only happens when a docker daemon (or kubernetes interface) is exposed to the Internet and an attacker uses that to download and start a docker image on the victim's host. Then they can bind mount a host directory like described and attack the host computer.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: