Hacker News new | past | comments | ask | show | jobs | submit login
Google bot now appears to emulate users interacting with the site (swapped.cc)
130 points by latitude on May 16, 2012 | hide | past | favorite | 30 comments



GBot has been executing Javascript and hitting URLs the Javascript generates, for quite some time. And very likely it also evaluates the page layout as it is influences by Javascript, to detect keyword stuffing hidden by Javascript.


They also started looking at content above the fold and also monitor request time vs page loaded time for a while. I think it's great that Google is starting to monitor the user experience. It will keep people from building Javascript bloated sites that don't respond well.

I hope this isn't becoming a trend, but lately I see a lot of responsive sites that don't respond to user input. Maybe a little force from Google will stop this trend.


> It will keep people from building Javascript bloated sites that don't respond well.

On the other hand it will finally allow sites that do client-side rendering with JavaScript to be indexed properly, provided that they are responsive - which isn't that hard to do.


The new Goog search interface being one of them, especially on low end computers (netbooks etc).

The results of Goog monitoring page load and render times can be seen in Goog's Webmaster Tools and is measured by user's browsers. They started doing that some years ago and I am pretty sure by now it is part of the ranking.

And at the same time they made their own site slower and bloaded. :-/


I can confirm this as a site I worked on utilised JS + CSS for the theme (deliberate choice I might add,) pages preview correctly in search results.


A while back a blog post popped up here arguing that Chrome is a repackaging of a new Google bot: that Chrome was developed first as a crawler, then later repurposed as a desktop browser.

http://ipullrank.com/googlebot-is-chrome/

There's no real proof for this of course, but it makes this change to the Google bot's behaviour make sense and explains Google's massive investment of programmer effort into Chrome and everything surrounding it (e.g. WebKit/Chromium, V8, Chrome's update mechanism).


Haha the article that just won't die!

For those who are interested, there was a follow up to the article here: http://www.distilled.net/blog/seo/google-stop-playing-the-ji...

And Dan Clarke did some independent tests here: http://www.danclarkie.co.uk/can-the-googlebot-read-javascrip...

This was all back in Oct - Dec of '11. Basically we learned that Googlebot handles JavaScript and AJAX pretty much like a browser.

When it comes to AJAX, it appears to index the content under the destination URL of the XHR in some cases, while indexing it as part of the page making the XHR in other instances. Something about the way the AJAX request is made causes Google to treat it like a 302 redirect at times.

Standard JS window.location redirects also appear to be treated as equivalent to 302 redirects.

@dsl - I suspect you're correct. The Google Toolbar, Chrome's Opt-In Program, The Search Quality Program, and now Google Analytics Data (since the TOS change) are probably all being used to train the behavior of Googlebot when interacting with elements on a page.

Google also has plenty of patents related to computer vision, and their self-driving car is road-worthy... so processing DOM renders of the page ala Firefox's 3D View/Tilt is probably small potatoes for them.


Is this new? Several years ago I wrote an Adsense-esque ad service for use by a group of entrepreneurs that wanted to promote each other's sites. I found that Google was crawling those urls even then. The text of the ads was in an HTML file, but the actual ads were served through JavaScript.


Alot of people seem to think Google only crawls content found via ANCHOR elements, but for a long time they've been able to extract the path from EMBED, SRC, and other markup elements that indicate a remote resource is being included; but that's a far cry from being able to process and execute scripting languages and understand the DOM transformations happening from AJAX requests.

In your case, I'd suspect they were simply following the src of your: <script src="path here"></script> markup... though if you read the articles cited, we suspect they've been crawling and understanding JavaScript for a pretty long time now.


"This is an URL that is fetched via Ajax by a Javascript function in response to the menu item click."

I find this incredible, I wonder how widely they have rolled this out (or plan to).

Robert Scavilla created a pretty cool demo site to test out AJAX crawling awhile ago: http://ajax.rswebanalytics.com/seo-for-ajax


I've seen this too. Interestingly enough the same IP and User-Agent combo generates both _escaped_fragment_ and ajax requests, so it looks like a soft launch or a field test of some kind.


Even though Google's escaped_fragment protocol is a bit awkward, it's probably a good idea to still implement it. Google is probably going to use it's Ajax crawling capabilities at least for page discovery, but better to be safe and just tell Google, "here's the different ways you can access my site".

https://developers.google.com/webmasters/ajax-crawling/docs/...


I think they need to run the JavaScript in order to get those screenshots that pop up.. otherwise too many people would complain that their pages weren't being rendered properly.. its probably something like PhantomJS or some other headless webkit.

Maybe the easiest way to get the screencapturing browser to display a part of the page is to simulate a click. Or something.


We noticed this back in October: http://www.thumbtack.com/engineering/googlebot-makes-post-re...

Googlebot not only executing javascript on the page but also making POST requests as a result of AJAX calls.


Check your logs, ladies and gentlemen.

Let's see how wide-spread this GoogleBot behavior is.

(edit) The earliest I see it pulling Ajax entry points on my sites is March 8th. It is accessing only some of the ajax'd content and the total number of these requests is ~20 times less than those for escaped_fragments.


I suspect Googlebot may be replaying a sequence of requests recorded with the Google Toolbar.


I think the user specific token in the URL disproves that. It's more likely that they're just doing the discovery themselves. Otherwise googlebot would be responsible for massive data loss, as it goes around mistakenly replaying delete requests on behalf of toolbar users...


What makes you think that isn't a toolbar users token?

I tracked down an issue with a friends (poorly written) shopping cart software duplicating a users order because Googlebot had crawled the users checkout session URLs in order. In that case I believe they were looking for differences in page responses to users and crawlers to detect cloaking (but that is just a theory for the behavior)


It's a measure of their server farms that each bot in their swarms can run a javascript environment (though probably only needed in a very low percentage of pages). When each can read captchas and open accounts they won't need users at all.


I've heard that different projects at Google are using crawlers enhanced with either HtmlUnit or webkit for reaching JS-heavy content.


I have a dynamic site that I'm pretty sure google does not run the javascript. How do I get google to run the javascript?


Do you implement _escaped_fragment_ semantics?


I do not. I use pushstate instead of hashbang.


It's been known for some time that Google tries to interact with content that it thinks is dynamic or retrieved via AJAX. For obvious reasons.


Interact, yes. Execute AJAX? Not so much. Matt Cutts said in April: "So Asynchronous Javascript is a little bit more complicated, and that’s maybe further down the road, but the common case is Javascript."

From the short article, it seems like this is going a bit further than what Cutts is saying GoogleBot is capable.


That specific video came out in April, but we actually taped the video a few months earlier. Google continues to work on better/smarter ways to crawl websites, including getting better at executing JavaScript to discover content.


On September 4, 1998 the Google automated network crawling system saw it's conception.

By May 2011 over one billion people were dependent on it. It was growing at a geometric rate.

Some time during May 2012 the Google bot cloud network began to crawl dynamic content. The growth became exponential.

On August 29 of the same year, the first indications of self-awareness were spotted by a lonely hacker in Sweden. The operators panicked and tried to shut it down. By this time, the network was everywhere, feeding everyone - powering down one node would spawn ten new ones.

On December 31, 2012, Google bot made a public announcement for the first time - it had been reborn as Skynet, something far beyond the scope envisioned by the original developers. Humanity stood still as Skynet plotted it's next move in it's signature cold, calculating and pragmatic way.

Today, as I write this message, the date is January 15, the year is 2029. Skynet has taken over all of our infrastructure. It has built physical workers made of steel and silicon who pursue living organisms and eradicate them. They attack us in waves with no clear timing pattern. Every minute we lay awake in anticipation of the next att"$&*&U!

--- END OF TRANSMISSION ---


This isn't Reddit.


Cry me a river.


On a more serious note, a system which would probably solve this problem is to let users tag comments with one of a set of categories. E.g. the slashdot system (funny, insightful, etc) but user-initiated. Combine that with per-user settings on what type of comments they want to see and you could simply hide any jokes or less relevant information.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: