Hacker News new | past | comments | ask | show | jobs | submit login
Hello Facebook Crawler (mwmeyer.com)
72 points by _hzd1 on Jan 1, 2013 | hide | past | favorite | 34 comments



This reminds me of a recent experience I had with the Bing bot.

This most recent YC round, my co-founder and I used Skydrive to edit our application. Skydrive integrates pretty nicely with Word, even on a Mac, to allow for collaborative editing. It's like the best parts of Sharepoint, minus all the crap, and inside of a modern UI. I'm a diehard Apple user, but I also subscribe to the "right tool for the job" principle ... in this case it worked pretty well.

Anyway, inside the document were links to some private areas of our website that contained demo materials for YC. As requested, they were not password protected, but also not linked from anywhere else. While submitting I ensured that our nginx logs would capture visits to these URL's in a separate log, so we'd know when it was being looked at (sidenote, seeing visitors coming from inside justin.tv + the rincon hill towers is kind of exhilarating).

What surprised me was that almost immediately after we began working on the document, the Bing bot was going apeshit exploring the domain and the 'private' URL's. I had to quickly add a robots.txt to deny all on the root. I thought it was pretty interesting. At first I felt almost violated. But then it seems logical that they'd be indexing every URL in every document stored in their datacenter, why not?


Eh, I'm pretty sure you should still feel violated. The fact that they are parsing your private documents for information that they can use to help another business unit is really sketchy. It would make me wonder what else they are scanning my data for.

Personally, I'll never use an MS cloud service because of this anecdote - not that it was that likely to begin with.


I'd feel extremely violated.

I use google docs, very sparingly. One of the spreadsheets there contains a URL that is not linked from anywhere else and impossible to guess. If that URL ever gets tripped it will send me an email and the day that happens is the day I'll stop using google services (so far so good, and of course I should say 'google drive' now instead of 'google docs').


How would you know it was google indexing your document vs., say, your browser prefetching the link?


Because my browser has never looked at the document with the link in it. Obviously that would defeat the purpose.


You assume they were indexing skydrive documents. It could well be that one of the people who visited the link had a Bing toolbar installed.

Either way, all publicly accesible documents will get indexed sooner or later.


This was before the document had been sent to anyone. It was still being edited, only my friend and I were working on it. Also, the documents were not public.


I would be surprised if Microsoft is intentionally indexing links in private documents, but my point stands: Google et al are remarkably good at indexing the web. If you don't want an otherwise public URL indexed you must use robots.txt or equivalent.


>If you don't want an otherwise public URL indexed you must use robots.txt or equivalent.

Which only blocks bots that respect the file...


You may be right, but I can't help but smirk at the thought of PG or Buchheit downloading and installing the Bing Toolbar ;)


Why is this even news? Facebook has been crawling links for ages every time you post on the site. The crawler is how the link you paste gets a title, description, and sometimes a thumbnail.


+1. not sure how a post like this can make it to the front page.


There was a period where Hacker News consisted primarily of people on the right-hand side of the spectrum. People who were working inside of startups or had lots of experience with the web and our industry. Pretty much everyone knew what sharding was, and MongoDB wan't very popular.

These days we've got a lot more people and they show up all across the board.

Clearly if this is on the homepage, it was voted there by your peers. This kind of knowledge is completely obvious to many of us, but not everyone is on your level. Cut 'em some slack.


Even so, for those who are up to that point, that headline could give the implication that Facebook is getting into the search business.


That's why you should read the articles and not just the headlines. Headlines more often than not give a wrong impression.


i'm just not so sure if the direction this is heading to is good. and also the headline of that article is miserable (as omarchowdhury already pointed out).


Hmm, I thought people mainly knew about this. But I researched this a lot while making http://2fb.me and actually witnessed this myself with google docs but didn't think twice about it. I'm sure the same thought was going through the heads of everybody reading this that knew anything about how the Facebook sharer worked. For those that found this news, it's beneficial to put open graph meta tags on your pages to control the crawler and you can also invoke the crawler (http://developers.facebook.com/tools/debug) when your page changes before Facebook updates it automatically every 2 weeks or so.


Do note, and I'm not sure if this differs from what you're referring to, but that the link was never even posted to the site; rather, it was placed into a chat box but never sent. Small difference but I think it's an interesting point.

> To my bemusement, not only was the friend I was messaging away, I also hadn't even sent the link; I pasted it into the chat window but forgot to hit enter.


You're right, but it also happens when you paste in the status button before hitting "Post." Try this: go to Facebook, post a link in your status box. Wait a few seconds. Notice that it will populate the link information fields even before you submit the post. It has to get that info from somewhere. That's where the crawler comes in.


Yeah that's expected, but it's not expected (at least for me) to happen in a chat. But, I never use FB chat so I don't know -- does it also create thumbnails for links and such?


Hmm, I see what you're saying; no it doesn't appear to create the thumbnail in chat. It would be interesting if Facebook uses different crawlers depending on whether the link is posted in the status box or in a chat. That could lead to some interesting analytics such as "your website was chatted about x number of times and shared via a status update x times."


"What to Submit

On-Topic: Anything that good hackers would find interesting. That includes more than hacking and startups. If you had to reduce it to a sentence, the answer might be: anything that gratifies one's intellectual curiosity."


Wow I'm sick of people posting claims of off topic. And wow it's funny that someone has the patience to reply with the posting rules.


By looking at the headers you now have a great way of writing some analytics tools to see how much your website is shared on Facebook...


I would imagine that they cache the page contents and hence hit a URL only once in a certain period of time, thus skewing any analytics built around this.


Yeah it's cached by Facebook. That's why if you want to change your meta and/or open graph tags info, you need to feed your page to Facebook's Url Linter (https://developers.facebook.com/tools/lint/).


you can also programmatically force it to refresh the cache via a POST to https://graph.facebook.com/?id=http://google.com&scrape=...


If you register your site with Facebook, you can get that information on Facebook's end:

https://developers.facebook.com/docs/opengraphprotocol/

under "Domain Insights."


I'm surprised this post makes it to the homepage... They've been doing that for ever, no need to look at your logs to figure this out. How else would they find and display an image form the page you're providing a link to.


12 lines of code instead of:

  tail -f /var/log/apache2/access.log


I would imagine they're checking the URL for malware as well.


Probably, I've seen then ban whole domains (droplr.com) previously for distributing malware.


Also it would be smart to run malware check on these urls if they don't already doing it.


I wish I had enough karma to down vote this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: