Making AJAX Applications Crawlable

patio11 · on April 19, 2010

This kind of rubs me wrong, in that it is nakedly making Google's engineering problems into the Internets' engineering problems, there isn't even a scintilla of the usual "You're just making it better for your end users and our crawler just happens to benefit" fig leaf, and compliance will be about as optional as complying with HTTP because Google is navigation on the Internet.

noibl · on April 19, 2010

"making Google's engineering problems into the Internets' engineering problems" -- indeed, and more specifically, saying that because Google can't/won't embed a JS engine in their crawler, publishers should embed one in their webserver. For which they suggest you use Java or, alternatively, Java.

retube · on April 19, 2010

Is that right? I have been wondering to what extent the Google crawler renders js. Given that it's Google, I was imagining that they probably do pretty much full DOM rendering of pages, as so many pages are now rendered dynamically. Of course, it's difficult to simulate human-interaction with the page, so I guess this is their solution to that - the web designer leaving a signpost.

prodigal_erik · on April 19, 2010

I don't get it. They're requiring authors to provide HTML content that lives behind a URL. How is this different than just requiring the application gracefully degrade to a plain HTML mode that's usable without JavaScript, and crawling that?

gojomo · on April 19, 2010

It's similar in effort required, but this more clearly tells Google to display as target URLs (and send searchers directly to) a site's #!fragment-ed pages, rather than their degraded/simple pages.

(Of course, the site could accept visits to degraded pages, detect AJAX-capable browsers, then redirect the users to the preferred AJAXy URLs... but perhaps they don't even want the non-AJAX URLs to appear in normal use.)

stan_rogers · on April 19, 2010

It shouldn't require a redirect to "grow" the context around the fragment, though -- which is kind of the point of progressive enhancement.

gojomo · on April 19, 2010

You can see a problem that the Google convention handles better than 'degrade to simple pages with distinct URLs' with the 'Noloh' site touted elsewhere in this thread.

Consider one of their AJAXy-#fragment pages:

http://www.noloh.com/#/features/

Search for a phrase on that page: ["NOLOH generates only the absolutely necessary concise"]

Google finds and sends you to their 'simplified' version of the same page:

http://www.noloh.com/?features/

...which upon visit "grows" itself with a fragment to...

http://www.noloh.com/?features/#/features/

Ugh! Does the site really want people on that URL, possibly bookmarking and sharing it? Probably not; they could be using a redirect on first Google-visit.

Try clicking to another page from the double-feature page, like FAQs. You wind up at:

http://www.noloh.com/?features/#/features/&faqs%2F=

Ugh, it just keeps getting worse.

Using the Google #!fragment convention, the initial URL appearing in the index could be the simple/direct:

http://www.noloh.com/#/features/

Some sites will want that. One canonical #fragment-filled URL collects all the inbound traffic/linkjuice.

asnyder · on April 20, 2010

Fair enough, but you have to remember that we needed this to work on all servers and browsers since 2007. This Google implementation was just released. Surely we could've done a URL rewrite for our specific server, but we try our best to showcase how NOLOH will operate without the need for any tweaks, as many of our users operate on shared servers without any access to rewrites.

Oddly, you don't make any mention that we effectively solved this issue automatically for our users and that they've been able to have their full websites searchable by Google. You were able to do a search for our content, and guess what we didn't need to do ANYTHING from a site development standpoint for that to work. Sure, without a rewrite it can be ugly, but the content was fully searchable.

Frankly, have you seen some of the URLs that major websites such as amazon.com, or other generate? Criticizing us for showing how the URL would look without rewrites is really nitpicking our site, however we do want to thank you for pointing out a minor issue, we should not have the &faqs%2f you saw above, coming from a search engine and navigating. It should be http://www.noloh.com/?features/#/faqs.

Sure enough we'll be implementing the Google style approach in NOLOH, and best of all NOLOH developers need not change anything. Their apps were searchable since 2007, and will continue to be searchable with newer and better methods for the forseeable future.

stan_rogers · on April 19, 2010

That's the bad behaviour of a specific implementation (more than the context of the fragment is "grown") and the resulting URL cruft is entirely unnecessary. No redirect, at least at the browser end, should be necessary to serve the fragment in context.

asnyder · on April 19, 2010

Interestingly enough NOLOH (http://www.noloh.com) made AJAX applications crawlable years ago. If you're interested in the specifics, see http://dev.noloh.com/#/articles/Search-Engine-Friendly/

Disclaimer: I'm a co-founder of NOLOH.

gruseom · on April 19, 2010

I took a look. Your approach is to detect requests from spiders and respond with plain HTML content rather than the content wrapped in Javascript, etc., that a normal user would get. You address the obvious question, "But isn't that cloaking?" by saying no, the content itself is the same, so nobody should object. Fair enough, I happen to agree with you, but our opinion is irrelevant; what matters is whether Google consider this practice legitimate. Can you (or any HNer) tell me definitively whether they do or not? And has their policy changed recently with the introduction of this new crawlability spec?

endergen · on April 19, 2010

Has google actually started doing this? I thought this was still in an exploratory stage.

mmastrac · on April 19, 2010

Facebook started using this style of URL a little while ago but I can't get it to serve the static equivalent up yet. My bet is that they'll be the first big name to ship this in production.

robryan · on April 19, 2010

Wouldn't it be a lot easier for Google to build a JS parser, and just examine the requests and follow it through as a normal browser would?

korch · on April 19, 2010

Google has to do this, it's an existential threat to their money maker PageRank. Imagine a few years out, after most of the web is no longer organized as static documents linked together(nice for crawling!), but transforms into a real-time evolving mish-mash of web API's, re-re-aggregators, and interconnected web services. i.e. the top 10 social networking sites will account for 90%+ of user activity, and it's their APIs & data we'll all be using.

You can't crawl that.

Anyone else worry that Google's inevitable grand "evil" act will ironically be them holding back the web from transforming into this? Microsoft could have totally killed Google by hitting the fast-forward button on Ajax in 2004, by leveraging IE to make the whole web into the "Deep Web." You can't sell ads on what you can't crawl—just see what Apple is doing with iAd to carve out mobile.

gcb · on April 19, 2010

thats a cloackers wet dream.