Hacker News new | past | comments | ask | show | jobs | submit login

Impressive feat! Does reddit have rate limiting, or other hurdles in place similar to the hoops youtube-dl has to jump through? Curious what your thoughts are about maintaining a project like that.



As history has shown, you can only do so much to stop this. If you perfectly mimic the GoogleBot and use google IP ranges by hosting on google cloud, they either take an SEO hit or let you bot them at the end of the day. GoogleBot looks like a DDoS attack a lot of the time too

You can also go the route of looking like a pool of users, then it's just a game of cat and mouse and one providers don't really have time to play


> As history has shown, you can only do so much to stop this.

History has shown you can stop this well enough. Try accessing e.g. instagram; bibliogram attempted this, the project is now discontinued.


This is true in the case of content that doesn't care about SEO. Reddit cares very very much about SEO, so it can never truly block bots.


The google scraper ips are very different no?

> If you perfectly mimic the GoogleBot and use google IP ranges by hosting on google cloud


You are right, apparently they do publish the ranges https://developers.google.com/search/docs/crawling-indexing/...


I've only done some rudimentary rate limiting checks and it doesn't seem like they do. Though I haven't pushed it far (~1000 rpm).

In any case, my plan is to deal with it if it becomes a problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: