Hacker News new | past | comments | ask | show | jobs | submit login

This was my approach too and it's been working great. Nowadays data isn't rendered directly into HTML anymore, it gets downloaded from some JSON API endpoint. So I use network monitoring tools to see where it's coming from and then inferface with the endpoint directly. I essentially wrote custom clients for someone else's site. One of my scrapers is actually just curl piped into jq. Sometimes they change the API and I have to adapt but that's fine.

> I understand companies can put roadblocks to hinder this

Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.




> Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

Cloudflare Bot Protection[1] is a popular one. The website is guarded by a layer of code that needs to be executed before continuing. Normal browsers will follow through. It can be hard to bypass.

[1]: https://www.cloudflare.com/pg-lp/bot-mitigation-fight-mode/


I have a codebase that defeats cloudflare protection. Felt like I had keys to kingdom.


So that would break text browsers too, right? :(

And users with JS disabled for privacy reasons.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: