Hacker News new | past | comments | ask | show | jobs | submit login

Does PadMapper respect the robots.txt file?



PadMapper uses 3Taps.com to get the data. 3Taps gets Craigslist data through Google. Google's bot respects robots.txt.

So, yes.

[edit for clarity, and for misspelling clarity]


Well... "sort of", then. PadMapper respects the robots.txt for GoogleBot, not for PadMapper.


Presumably 3Taps violates Google's ToS to get the data from them ?


I would really love to know that kind of shady proxy bot army they have implemented in order to scrape Google on such a scale.


This type of comment is common. Not only on HN but on forums in general.

While I agree the web is full of lowlifes engaged in web development, many of them in porn or some other area that appeals to base instincts, I find this comment perplexing. Because it is so subjective, yet it tries to seem objective by focusing on some random criteria.

Google employs a "bot army" to scrape the entire web. So what?

If the comment was something like "I don't like Company X." Or even "I don't like Company X because...", it would make sense to me.

But that is not how this common type of comment goes. Instead it suggests that bot=evil, i.e. any sort of automation or any sort of data collection by anyone other than [your favorite company] is "shady".

That's crazy. IMO.

It's what a company does with the data that matters.

Anyway, I'm not keen on 3Taps because they are not provinding bulk data, only API's that require "developer keys". Why?

Either you are going to democratise data, or you are just another schemer trying to find ways to collect infromation about people, in this case people using "your API".

I don't want API's I want the data. I can make my own interfaces thank you.


I thought it was weird that CL didn't add a Disallow line for padmapper to robots.txt from the start (just from a PR perspective).

But robots.txt has no special legal authority, it's just a convention used to communicate a publisher's intent. I'm pretty sure the C&D letter made it 100% clear that CL did not want Padmapper crawling their site or using their data.


Padmapper doesn't crawl Craigslist. That's not how it happens.


...any more. Now they are using a third party, but at the time Craigslist sent the C&D they were scraping the site directly.


I know, but I thought the existence of robots.txt was why Google is allowed to crawl sites. If a site disagrees with the crawling they can add a robots.txt entry and Google will honor it. It at least shows that you are giving the publisher an option.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: