I understand the point of using a special user-agent to crawl webpages for indexing, but search engines should use a "regular" browser UA string (full JavaScript, etc. to simulate an actual browser) occasionally. From a different IP range too, of course. If the contents of the page are wildly different, penalise the site.
I've seen stuff in the robots.txt get crawled anyways if enough people link to it. In Google's results it will still only show up without any contextual information though.
Google won't crawl it, but they can still include the link in search results, usually with the title guessed from the way it was referred to on another page, and no description.