Does Google execute JavaScript?

mootothemax · on Jan 1, 2017

I'm convinced that Google has several Googlebots that are run depending on how popular a site is.

That is, new and low traffic sites are crawled by less intelligent bots, and as the site gets more visitors or better rankings, more complicated and resource intensive bots are deployed.

How this might work with the most popular sites out there, the Amazons and Wikipedias of this world - I'm not so sure about that. If I were in charge, I'd be tempted to have customised bots and ranking weights for each of these exceptional sites.

Sadly the chances of getting a real answer on this in my lifetime are close to zero.

dom0 · on Jan 1, 2017

Or perhaps there are also heuristics in place to determine which strategy to follow, ie. heuristics to see whether executing JS would be worth it, would yield additional content. So, say, when crawling documentation, where JS doesn't give any of that (eg. Sphinx' JS search), it could decide - nah, not doing JS, not worth it.

I'd expect that there are also other heuristics and different strategies for crawling to better handle eg. content presented by one of the popular CMSes.

valarauca1 · on Jan 1, 2017

    heuristics to see whether 
    executing JS would be worth 
    it, would yield additional 
    content.

You are literally describing vanilla page rank. If a large number of links are found to a page, but that page doesn't contain the contents the link rate suggest it should contain... either link rate has failed, or JavaScript should be executed.

matt4077 · on Jan 1, 2017

I recently put a little browser for themes for the hyper.app terminal online (https://hyperthemes.matthi.coffee). It was just something I used to try out Elm, and there's no reason to believe Google would regard it as anything different than a generic new site.

If you look at Google's cached version, you can see that the JS is executed (although it fails trying to download the actual data): https://webcache.googleusercontent.com/search?q=cache:hN5yCk...

Edit: as has been pointed out below, the cached version is just the same as the original and the JS gets executed on your end. This doesn't show weather Google also executes it during its crawl.

gildas · on Jan 1, 2017

> If you look at Google's cached version, you can see that the JS is executed

Correct me if I am wrong but when I look at the cached version of the homepage (http://webcache.googleusercontent.com/search?q=cache:hN5yCky...), I don't see that the JS has been interpreted.

matt4077 · on Jan 1, 2017

The "Network error" in the top left is a JS result. Also, since you linked to the source view, you'll see that there actually isn't anything in the body but the script being loaded, whereas "Full version" shows the user interface was correctly initialised.

gildas · on Jan 1, 2017

Disable JavaScript in your browser when viewing the cached page. All you will see is an empty page.

matt4077 · on Jan 1, 2017

Now I'm feeling a bit stupid, will slowly walk away and hope I can get by with "it's Jan 1st and I didn't get much sleep" as an excuse...

j_s · on Jan 5, 2017

First, thank you for taking the time to stick your neck out a bit and share an example.

Second, thank you for responding to someone pointing out that you were wrong without putting yourself on the defensive.

I see this entire sub-thread as a positive; glad we all could learn along with you!

ncouture · on Jan 1, 2017

You're at least partly right about sites being crawled differently depending on popularity. I think the factors may not be limited to popularity alone, but we see this behavior documented in the crawling rate documentation Google provides its users/clients so there is no reason it couldn't apply to other "expensive" actions their crawlers do.

How this might work with the most popular sites out there?

We see it in on-page answers that provide extracts of pages with the answers to questions asked in search phrases that include a reference to the document they were sourced from.

Matt Cutts used to qualify sites like Wikipedia as "reputable" to the eyes of the search engine.

manojlds · on Jan 1, 2017

You could join Google :)

dx034 · on Jan 1, 2017

The chances of becoming a part of the search team are still very low. Even most Googlers won't know the exact details.

lern_too_spel · on Jan 1, 2017

But within Google, you could ask somebody and get the right answer.

user5994461 · on Jan 1, 2017

And outside Google, you could look for Google employees on linkedin who may be in position to have the answer and ask nicely ;)

tehlike · on Jan 1, 2017

no it is not :) you can move around in the company.

lprubin · on Jan 1, 2017

Another, faster way to see what JavaScript google can crawl on your website is the google search console (previously known as google webmaster tools). They have a fetch as google button that allows you to enter a url on a site you own and see a visual rendering of how google crawlers see your page. It even gives you a side by side comparison of what the crawler sees vs what a user sees.

m0th87 · on Jan 1, 2017

My company has seen major discrepancies between what this tool gives back vs how the page ultimately ends up getting indexed.

nrjdhsbsid · on Jan 1, 2017

My pet theory is that Google actually developed chrome as a web crawler and that consumer release was to ensure that Google would always be able to crawl pages (since sites would always want to work properly with chrome)

It also explains why they effectively killed flash and Java applets. They were competing technologies that weren't owned by Google and not crawlable. If they would have taken off, Google's position as top search engine could have been in danger

icebraining · on Jan 1, 2017

Nope, Googlebot has been able to crawl Flash for almost a decade: http://searchengineland.com/google-now-crawling-and-indexing...

Girlang · on Jan 1, 2017

and Chrome supports Flash out of the box!

teddyh · on Jan 1, 2017

If that was true, you would think they would have responded better to the threat of Facebook than with Google+ and their distasteful pushing of it on all their platforms.

What do I mean with “the threat of Facebook”? In the old days, before today’s large “social media” sites, people made their own web pages on places like GeoCities or on simpler social-media-like sites like LiveJournal, etc. Those sites all had content and linked to each other. This is the web which the Google search engine and its algorithm was meant to find things in, and it worked very nicely, as it took advantage of the links other people had made to your site as a proxy for relevance in search results for your site. People making small web pages about their favorite topics (with lots of links to other people’s pages, since information was hard to find) could slowly and easily transition into making larger and larger reference web sites with lots of information, thereby attracting lots of incoming links from others, which in turn enabled people to find the information using Google’s search engine.

Compare this to now. Firstly, people having a Facebook account have no place to simply place information, no incentive to simply make a web page about, say, tacos or model trains, because that’s not what Facebook is about. Facebook is about the here-and-now, and whatever is yesterday is forgotten. As I understand it, there is no real way, in Facebook, to make a continuously updated page with a fixed address for people to go to as a reference point about some subject, or at least people are not directed towards doing this as part of their online activity (as opposed to in the past, when it was basically the only thing which people could do). Secondly, this makes it so that people have no natural path going from using Facebook to creating a larger web site with information, and there are no smaller model train or taco Facebook “pages” which could have links to your larger site and thereby validate its relevance. Thirdly, even if this second point was false, Google could not use these Facebook pages, because Google cannot crawl them. These pages are all internal to Facebook, and Facebook has every incentive to not allow Google to crawl and search this information. Facebook would much rather people used their own site to search, and thereby gaining all of Google’s sources of income: User monitoring and advertising.

nrjdhsbsid · on Jan 1, 2017

The web closing off is very bad for Google but I'm rooting for them to try stem the trend since it's also bad for users.

Putting walled gardens around large portions of the net will inevitably result in endless mergers and an eventual "website fee" that users pay to one of three highly entrenched monopolies that control all the content. It will be back to the cable TV days in no time

jupiter90000 · on Jan 1, 2017

Very interesting points. Kind of a tangent, I had kind of felt nostalgic about the days of the personal/hobby websites you appear to allude to that seemed so prolific before Facebook/etc, and wondered why things had changed so much.

It makes sense to me that now someone who used to be motivated to build a site about their life or a topic of interest now may often just sign up for a service like FB and occasionally do a post with an article, picture, etc about themselves/their interests. It seems to require much less effort for folks, which is perhaps why they do this. I lament the change somewhat and wonder what the future holds for this type of thing.

dispose13432 · on Jan 1, 2017

> It seems to require much less effort for folks, which is perhaps why they do this

It's also much easier to find readers.

Compare:

Facebook: become friends with your co-workers, now they see your picture

Blog: Please go to jupiter90000 (that's how many zeros?).blogspot.com to see my once a week pic updates!!

teddyh · on Jan 1, 2017

I agree, but this is not an inherent difference of the open web vs. a walled garden, only a difference of implementation of the software involved. Indeed, the developers of the Web and its browsers were aware of this problem, and they though that they had solved it, using something they called “Bookmarks”. Now, as implemented, bookmarks may not be easy enough, and there have been other ideas, like RSS feeds, which tried to improve upon the idea. Just don’t think that this difference is inherent and set in stone. New features could be developed.

ebbv · on Jan 1, 2017

Google did not kill Flash. Chrome came bundled with Flash already installed. Apple made the biggest push to kill Flash and once support for HTML5 video was solid and widespread the main use case for Flash was gone.

Java applets just always sucked and only masochists/sadists ever used them.

camus2 · on Jan 1, 2017

Adobe killed Flash itself, not Apple, not Chrome,not anti-flash troll X or Z on the internet.

matt4077 · on Jan 1, 2017

You're arguing from a different definition under which nobody but Adobe ever had the possibility to "kill" flash. I doubt there's much disagreement that Apple's refusal of flash on iOS was the most significant factor in its demise. Without it, Adobe would to this day happily rake in the cash.

It's also a good reminder of the positive influence Apple has often had. Remember that "no flash" was the "no Esc" of its time, and that they had to go all-out to defend it (http://www.apple.com/hotnews/thoughts-on-flash/). 6 years later, it stands as one of the rare cases where a proprietary technology was replaced by an open standard.

ams6110 · on Jan 1, 2017

I read the comment as implying that Adobe killed flash by releasing it as a steaming pile of insecure resource-hogging shit, rather than as a secure, efficient plugin.

If Flash didn't kill your battery and expose you to new security vulnerabilities every few weeks, it likely would have continued to this day.

coldtea · on Jan 1, 2017

Eventually, they pulled the lever, yes.

But Adobe touted Flash on mobile as some kind of essential technology, when it was 100% crap.

If Apple offered it on iOS, Adobe would have been milking it to this day...

camus2 · on Jan 2, 2017

Flash actually runs on iOS :

http://www.adobe.com/devnet/air/air_for_ios.html

So iOS do support Flash, just not its browser.

Gigablah · on Jan 1, 2017

And here I was thinking that it was Apple and the iPhone that effectively killed Flash. Silly me.

nrjdhsbsid · on Jan 1, 2017

They disliked flash for different reasons. Google for search and apple because flash games were too powerful, jobs wanted those games turned into iPhone apps.

The offical reasons were speed and security, which was obviously a farce since all the major browser vendors wrote their own PDF viewers which is both slow and known for tons of security issues as well.

They could have just as easily wrote their own fast and secure implementations of flash and Java instead of disabling them. Especially for Java given Google's deep experience with it and that they already have their own runtime.

Apple may have fired first on flash but google recently disabled all NPAPI plugins in chrome as well. The reason... "Security". I guess code not made by Google isn't secure enough to be used on the internet?

scarface74 · on Jan 1, 2017

Well let's remember that while Adobe claimed that they could have gotten Flash to run on an original iPhone in 2007, that when Flash finally came to Android, it required a 1Ghz processor and 1 gig of RAM. The first iPhone had a 128MB of RAM and a 400Mhz processor. it wasn't until 2012 that the first iPhone had 1Ghz+ processor. The very reason that Apple could get away with slower processors was because Apple was dependent on slower more processor intensive VM solutions like Java and Flash.

frivoal · on Jan 1, 2017

I was maintaining a port of the Opera browser for Japanese feature phones in 2007. We supported flash in 11Mb of RAM: 6 for the browser, 5 for flash. That included the screen buffer IIRC. We ran regular desktop web sites under these conditions just fine (with 2007 expectations regarding interop). The experience may not have been great, but that was much more an issue of tiny screens without touch support than anything else. Had Apple wanted to, they would absolutely have been able to support flash on the iPhone. We could on much worse hardware. Not doing so was about controlling the platform, not about technical limitations.

http://www.operasoftware.com/press/releases/mobile/opera-del...

scarface74 · on Jan 1, 2017

That was Flash Light not real Flash. If Adobe could get full fledged Flash running in a 128Mb/400Mhz first generation iPhone then why couldn't they get it running on similarly equipped Android devices when they had the opportunity?

thesandlord · on Jan 3, 2017

You can run real Flash Player 9 on a Nokia N800/N810, which had the exact same specs (128Mb/400Mhz). It ran Debian Linux with a Firefox based browser.

https://en.wikipedia.org/wiki/Nokia_N800

It actually ran quite well! I used to play flash games all the time.

angry_octet · on Jan 1, 2017

This is ridiculous. Flash is a hack on a cludge, 20+ years of legacy code, fiddling with raw chunks of memory for speed and with no type system enforcement. You couldn't do animation or FMV with javascript in the past, these were features only recently supported by browser or standards. The only spec for flash is the implementation.

Google went to significant efforts to sandbox it, but that still caused problems, not least with additional complexity. As for this idea that NPAPI should continue to live... Yes "security" is a problem with NPAPI.

As for Google fixing Java... You may have heard of Oracle? And the fact that they are .... not nice?

dispose13432 · on Jan 1, 2017

>They could have just as easily wrote their own fast and secure implementations of flash and Java instead of disabling them. Especially for Java given Google's deep experience with it and that they already have their own runtime.

Yup, Google should:

1. Reverse Engineer a proprietary language/API (Flash).

2. Write a secure VM for it (Flash and Java), trying to hit a moving target.

or

Get everyone to use an open, standardized tech (HTML5/js).

Can you give me one reason Google should prop up Flash?

ComodoHacker · on Jan 1, 2017

>The offical reasons were speed and security, which was obviously a farce

Security wasn't a farce at all. "Flash is a big hole" (which is true) translates to "Chrome is a big hole".

morganvachon · on Jan 1, 2017

I modded you up, not because I agree with everything you wrote, but because I want people to post counterpoints instead of brigading you. I think you're mistaken about Google not being able to index Flash; they've had that capability for about eight years to my knowledge.

I do agree with you that Google's security model is broken in many ways, both on the web and in Android.

nrjdhsbsid · on Jan 1, 2017

Thanks, I might not be right about any of it but saying bad things about apple and google, or both at the same time :) tends to result in downvotes.

As in most things there's probably some truth on all sides.

_hyn3 · on Jan 1, 2017

Steve Jobs killed Flash by refusing to put it in the iPhone.. Google nailed down the coffin. (And we owe both our eternal gratitude)

heipei · on Jan 1, 2017

I'm using Google Chrome in an instrumented fashion on https://urlscan.io/ and I've always thought that using Chrome that way would make a pretty good crawler. Maybe this is me being naive, but it's certainly not surprising that crawlers do execute JavaScript at this point in time.

codedokode · on Jan 1, 2017

Google executes JS but maybe not on every website. If you have a JS error reporting tool on such site then you can get reports from Google IP addresses. I saw them first maybe 4 or 5 years ago.

Executing JS everywhere would require a lot of CPU time and I think Google prefers not to do that when possible. And indexing a JS app is a very complicated task anyway (it is difficult for a robot to even find navigation elements if they are implemented as div's with onclick handler instead of links) so you better use sitemaps to make sure the bot can find content.

And I don't think it is necessary to index rich apps. It makes no sense to index a ticket search app (the data become outdated too fast) or an online spreadsheet editor. Just make indexable pages as server-rendered HTML pages and put their URLs into a sitemap.

Also Google looks for strings in JS code that look like URLS (e.g var url = '/some/page') and crawls them later.

tyingq · on Jan 1, 2017

Google's announcement when they started parsing javascript: https://webmasters.googleblog.com/2014/05/understanding-web-...

jefftk · on Jan 1, 2017

They had already been doing it for a while by 2014. For example, I saw them doing it on my site in January 2012: https://www.jefftk.com/p/googlebot-running-javascript

(Disclosure: I work for Google, on unrelated stuff, though I didn't at the time I wrote that blog post.)

andrewstuart2 · on Jan 1, 2017

This is a subject that really irks the engineering side of me. It's utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content.

Why is it that Google doesn't get flak for not discovering content that's engineered to send the absolute minimum over the wire, cache intelligently in localStorage and IndexedDB, and scale well by distributing the appropriate amount of rendering work to the client agent? Why can't I expose a (JSON/)REST-API-to-deep-link mapping and have Google just crawl my JSON data and understand (perhaps verifying programmatically some percent of the time) that the links they show in search will deep link appropriately to the structured JSON content they crawled?

It's such a waste of talent and resources to force server-side rendering. There's obviously the resource cost of transmitting more repetitive content over the wire, and requiring servers to do more work that the client could do. (Yes, even with compression this will still be a higher cost, because more repeated sequences reduces the value of variable-length encoding). But more than that, what bothers me is that there's this false truth that server-side rendering is a requirement for modern architectures, which must result in hundreds of thousands of wasted engineering hours trying to enable the idea of server-side and client-side rendering with the same code.

This is not about time-to-first-byte either. Yes, the user-perceived latency matters, but the idea that server rendering even solves this problem is again utterly false. Sure, the time to very first byte ever may be faster, but that's not a winning long-term strategy unless you never expect your client to request the same content twice (or come back to your site at all). When properly cached and synchronized, the client-side-only app has many orders of magnitude faster TTFB, because it's coming from disk or even memory, and can be shown immediately. The only thing left to do is ask the server "what's new since my last timestamp?"

All of these benefits seem to be completely disregarded 99% of the time because the golden "SEO" handcuffs are already on. I really hope we can get away from this mindset as a community and rather let the better-engineered and sites with the best and fastest UX over time will start driving search engine technology, instead of the other way around.

paulddraper · on Jan 1, 2017

> result in hundreds of thousands of wasted engineering hours trying to enable the idea of server-side and client-side rendering with the same code

It this problem really that difficult? Why?

Why should your code care if it is running on my computer or yours?

Isomorphic JS has been around for years. Build your product on bloated tech stack relying on a increasingly poorly planned web of dependencies, and I'll agree it could be challenging.

> Why can't I expose a REST API to deep link mapping and have Google just crawl my REST API

They do crawl REST APIs. Specifically, ones using Hypertext Markup Language.

> cache intelligently in localStorage and IndexedDB

Speak of hundreds of thousands of wasted engineering hours...HTTP caches are simple and straightforward. IndexedDB leaks memory in Chrome so badly that Emscripten had to disable it (https://bugs.chromium.org/p/chromium/issues/detail?id=533648 https://github.com/kripken/emscripten/pull/3867/files). Mozilla advised developers not to adopt Local Storage due to the inherent performance issues. (https://blog.mozilla.org/tglek/2012/02/22/psa-dom-local-stor...) And how many wasted hours went into WebSQL?

> utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content

Actually, it makes a lot of sense. Content needs to be discoverable. Hosting a complex language in a VM where the slightest deviation from the 600-page specification (and that's just for the core language...not the browser APIs) causes failure -- that's not "discoverable". It's like putting up a billboard with one giant QR code, just because that makes it easier to develop the content.

andrewstuart2 · on Jan 1, 2017

Isomorphic. Thank you, I was searching my brain for that word for like half an hour. :-)

> Why should your code care if it is running on my computer or yours?

It shouldn't. But my users already care about perceived latency, and that is directly limited by the speed of light. My users want feedback as quickly as possible that their input has been received, and that something is happening in response. Thanks to the speed of light, this would ideally take place instantly right in front of their eyeballs. That can't happen yet, so as much as can realistically happen on my user's CPU, memory, and storage is the next best thing.

> They do crawl REST APIs. Specifically, ones using Hypertext Markup Language.

What I meant to say was JSON, so I'm contributing to my own pet peeve of saying "REST" and meaning "JSON." :-p

HTML is awesome and does a wonderful job of letting me mark content in a way that it can be efficiently rendered, semantically, and be both human-readable and marginally machine readable. There are two problems, though. The first is that full documents (since the article points out AJAX is not performed by Google) are incredibly repetitive and wasteful, especially when retrieving the same content fragments multiple times.

The second is that it is strongly coupling content and presentation, two orthogonal concepts, much earlier than is optimal. Sure, you can cache full documents and display them when requested again, but the more common case is that a large subset of what I just displayed to my user will be displayed again, with one new item, but has still invalidated my cache because the granularity is at the full-page presentation level, and not the business domain object level. If, instead, I cache and render business objects on the client side, I can be more intelligent and granular with my caching strategy, react much more quickly to my users' feedback, and have a much smaller impact on their constrained devices. Not only that, but transmitting structured business objects instead of presentation-structured content lets me more efficiently reuse that data across devices for which HTML may not be the most effective way to present the data to them.

My personal architectural bents aside, the truth remains that content discovery agents (e.g. indexers) should not be treated as content delivery agents with such a huge influence on content format. This ends up creating (IMO) too much influence over external engineering decisions, rather than allowing engineers to think critically about the right architecture that gives users the best possible experience.

Most importantly, I'm not saying that all the engineering effort should be placed upon the discovery agents. Of course there are limits on how much they can discover on their own, and (as always in matters involving many parties) there need to be good conversations about the state of things, and what we think is the right direction to go to support each other and our users. It's just been my opinion lately that this is not so much a conversation anymore as a unidirectional stream of "best practices" coming from a single group.

paulddraper · on Jan 1, 2017

Yeah, I understand the server-side rendering vs. client-side updating, and the design benefits of API-driven development. And unfortunately, a lot of popular JS frameworks haven't done a great job about helping with these.

Closure Library/Templates was meant to render server-side and bind JS functions after render, or create client-side dynamically. (Interestingly, the historic reasons were performance, not SEO.)

React and Meteor have good server-side stories. Angular 2 is getting one.

I would say there is a lot of low hanging fruit in just avoiding most client-side JS. Take http://wiki.c2.com/ -- the "original" wiki. That should all be static. Same with blogs, documentation, and lots of other public, indexable content.

inlined · on Jan 1, 2017

[Disclosure: I work at Google but don't work on anything related to the crawler]

All this anger aside, I'm actually pretty impressed with the world we live in and proud of my company. Think about how far we've come that merely crawling and indexing the vastness of the internet is so mundane now. Now we should expect the whole internet to be downloaded and executed. That's got to be a great security and integrity problem. Surely someone had tried to break out of the sandbox. Can that be abused to affect SEO of other sites. The easy answer is "spin up a new VM for each page" but that would slow the indexing process down by orders of magnitude.

andrewstuart2 · on Jan 1, 2017

I'm not sure where you're sensing anger. The thread so far is a pretty great example of the discourse I've come to really appreciate on HN. Sure, disagreement may be uncomfortable or feel awkward to read at times, but I think it's easily for the best. I'd much rather have somebody disagree with me and give good reasons than just blindly agree.

andrewstuart2 · on Jan 1, 2017

Yeah, I think I'm probably thinking much more heavily of the heavily-data-driven, dynamic web application use case since that's the kind of thing I've been working on for 5+ years now. I imagine that the vast majority of the internet content actually consists of much more long-form prose that doesn't benefit quite so much from a deferred-rendering approach since it varies little if any from user to user. In fact, that would probably be an overall systemic loss since now the same work is being done many times to render the same content, when it could be done once and cached for all.

And I don't expect Google or anyone to be able to support every edge case, either. I really would just like some sort of better solution that involves a global minimum of effort to achieve the same thing -- indexing what the user actually sees (non-private info, at least), and helping users discover sites that will give them a great experience and not just sites that give indexers a great experience.

hlandau · on Jan 2, 2017

If you're developing websites which are inoperative if the user agent does not support JavaScript, your development practices are broken.

I browse without JavaScript by default, and if a page doesn't load properly because someone decided to implement not a web page but a web page viewing client-side web application, I usually just leave. Then there are terminal browsers like lynx.

Moreover, reimplementing a web browser's navigation logic for a specific site is silly. It will in all likelihood be less reliable than a web browser's navigation logic. Moreover, it will always be slower for the initial page load than just serving a normal web page.

Yes, maybe you'll make things slightly faster for subsequent page loads. But consider that initial page loads from search engine referrals may well be the most important case, latency-wise. And if you do server-side rendering with progressive enhancement, you can have your cake and eat it if you really want to implement your own navigation logic with pushState, etc.; serve the static page and enhance it with an async script.

nolok · on Jan 1, 2017

Uhh, the answer to all your "why"s is pretty much" because that would be an open door to abuse"? For better or worse, Google has figured that the best way for them to have the most accurate indexing is to get things the exact same way browsers do, and figure it out from there.

andrewstuart2 · on Jan 1, 2017

On the point of indexing what the user sees, we agree completely. On the statement that Google does this now, that's not actually true. The current fact is, obviously from the article content, that my browser can do things that Google won't. Namely, AJAX, which is critical to truly scalable pages that cache and perform minimal delta requests. Even the JavaScript required using Google-specific webmaster tools.

It's fairly clear that just by developing those webmaster tools, Google is effectively (and understandably to a point) saying that they won't try to create engineering solutions to certain problems. Or that at this point it's not a sound financial investment because, after all, people will come to them because they're the biggest game in town.

If you're referring to my alternate crawling strategy suggestion of mapping a deep link structure into a structured REST URL structure, that's just an optimization I'd love to see. Really, I just think that search indexers should index what my users see, regardless of my engineering decisions.

jefftk · on Jan 1, 2017

    It's utterly ridiculous that engineering and efficiency
    decisions are so deeply affected by whether or not the
    largest search engine will properly index your content.

There being more sites that can only be crawled and indexed well if you run js is probably good for Google relative to competitors. Anything that makes crawling the web harder increases the barrier to entry, making it harder for sites like DuckDuckGo to serve search results from their own crawl.

(Disclosure: I work for Google, on unrelated stuff.)

qyv · on Jan 1, 2017

Google brings something that no amount of fancy engineering and elegant solutions can provide: New Users. Because of that it is necessary to dance to their SEO tune.

andrewstuart2 · on Jan 1, 2017

Isn't it ironic, though? Google exists because they had the best solution in 1998 for helping users discover the content that already existed.

Of course, now that they're gigantic, they now are the primary force that's adding unneeded engineering complexity to keep content in a format they can already read.

So who's really helping whom?

tyingq · on Jan 1, 2017

>So who's really helping whom?

Offtopic, but Google's propensity to migrate anything popular to their own hosted content is a better example of this. They find ways to present the good stuff without the end user ever visiting your site. At some point, this starves off the sources.

andrewstuart2 · on Jan 1, 2017

"That's some nice data you go there, guys. It'd be a shame if... somebody scraped it and kept that page view and ad revenue for themselves."

Again ironically, this will put Google out of business if they keep it up, unless they can start to collect all that data on their own or otherwise incentivize content producers to allow them access.

Senji · on Jan 1, 2017

You do know, you can give google an xml map of your site.

jmkni · on Jan 2, 2017

> It's utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content.

I would argue that it is nothing to do with Google.

Google's official line is that that will index your content, even if your entire frontend is a big client side only SPA. The explicitly tell you to focus on having good content that people actually want to read, and little else.

As others have suggested in this thread, it is likely that they developed Chrome for this reason. My understanding is that they basically crawl the web using a headless version of Chrome, and take a snapshot of the augmented HTML straight away, after 5 seconds, and after 10 seconds.

Other search engines aren't so clever, and so the work to have your important content available even when Javascript is disabled is done so you will show up on Bing, Baidu, Yahoo, etc also.

To go further, I reckon Google would love it if you didn't do this, as it would give them an edge over their competition which has less advanced capability to crawl the web.

mopper51 · on Jan 1, 2017

Not everyone uses JavaScript, and as a user is prefer basic HTML websites over JavaScript bloat every day. Also note from the article, that the conclusion seem to be that Google does in fact index JavaScript content, but only for trusted websites. I very much like this decision, because JavaScript (and Ajax in particular) is so easy to abuse, and it's never a win for anybody when you're redirected to a bloated website that downloads 100s of MB over 100s of requests and from multiple domains. Also, they're a billion dollar company, I'm sure they've already A/B tested enough to decide that indexing JavaScript is not good :)

pul · on Jan 1, 2017

IT has now also executed ajax:

https://www.google.nl/search?q=site%3Adoesgoogleexecutejavas...

linkregister · on Jan 1, 2017

Good, uncomplicated article.

If you can get Google servers to execute Javascript, that sounds like a possible attack vector. It's likely that Google runs these in a proprietary feature-sparse interpreter.

The lack of AJAX would make it difficult to leak information about the black-box interpreter.

franze · on Jan 1, 2017

For a more in depth look at how Google treats JS best watch this talk https://youtu.be/JlP5rBynK3E by Googles John Müller at an Angular conference.

binaryanomaly · on Jan 1, 2017

While google is certainly the main search engine most people use, isn't it to some point also very important what other engines such as bing, yandex, baidu etc. do.

If you have a professional website you want to be found also by these other engines. Until also these support javascript you may end up with a hybrid SEO architecture anyway which means nothing was gained?

coldtea · on Jan 1, 2017

>If you have a professional website you want to be found also by these other engines.

Do you really? Except if you are interested in the Chinese market.

In fact the inverse is probably more true: if you run one of those other search engines, then you want to be as good at indexing any particular site as Google is.

binaryanomaly · on Jan 1, 2017

Well at first, Google is the major player but not at 100% market share, some figures say 80%. Different by countries, continents...

Second supporting other platforms is also important, think of FB, Twitter, etc.

So either this really becomes the new standard that everyone supports or you may end up with a hybrid or traditional approach if you care about other platforms as well - imho you should.

chinathrow · on Jan 1, 2017

I've checked a site I know which is using nothing but Angular/JS on the frontend towards PageSpeed Insights [1] and it fully failed that test - no results visible. Also, the whole page is not indexed but the root URL itself. No page snippet preview, nothing.

[1] https://developers.google.com/speed/pagespeed/insights/

latenightcoding · on Jan 1, 2017

Nicely done! I have been writing crawlers for a while now and executing Javascript is very expensive and slow, even for Google. When I crawl the web I usually run javascript from a headless browser only on top priority sites.

elorant · on Jan 1, 2017

What I find very useful is to run Firefox through Selenium with a plugin installed to disable pictures. Then it's blazing fast.

crispytx · on Jan 1, 2017

Best post I've seen on Hacker News. I've always played it safe and never assumed Google would index content displayed dynamically with JavaScript, but now I know!

xg15 · on Jan 1, 2017

If I click the link from the article that leads to the webcache version, I get "yes, but embedded only".

If I click the link within that page that leads to the exact same webcache url, I get "yes, embedded and external but no ajax".

If I google the site, the preview text is the non-changing portion of the text only ("This is an experiment to...") - not even a "No".

I think Google is just trolling us.

avitzurel · on Jan 1, 2017

Beyond the theory, the talks and the articles. I have multiple 100% JS rendered pages (blank page with no javascript).

Google is crawling and indexing them with zero issues.

tyingq · on Jan 1, 2017

I wonder what Google does to avoid indexing too many pages. There are a fair number of SPAs and software like shopping carts that have a large number of checkboxes, pulldowns, knobs, dials, etc...that both change the content and the current url query params.

appleiigs · on Jan 2, 2017

I think it avoids indexing too many pages by not triggering the checkboxes, pulldowns, etc.

Through Google Console/Webmaster Tools, you can tell Google which query params your website uses. But for my modest website, google only uses the page query (?page=3) and I don't notice it using the other queries.

zitterbewegung · on Jan 1, 2017

I tried this on bing and duckduckgo and they only have in the description the body text.

nicc · on Jan 1, 2017

I'm loading i18n before showing any content, and was afraid Google wouldn't index the content, but it didn't have any problem doing that.

tgtweak · on Jan 1, 2017

This is a great way to test a hypothesis, and a good experiment.

I'll mention that there is a ule that was added a few years back, in backbone.js days, that urls with /#!route anchors will enable (read: force) ajax requests and JavaScript from the spider. Still remains a helpful way to force caching/indexing of JavaScript-only pages in Google.

_d8fd · on Jan 1, 2017

One random thought... Google goes to SomeWebsite.com. The site has only enough HTML to load a big ol' JavaScript app, which Google slowly crawls. Well, that JS app makes a bunch of AJAX calls. There's no reason I can think of that would prevent Google from remembering which AJAX calls were made, and then just crawling the URLs for those calls on subsequent visits. Why load SomeWebsite.com's JavaScript.com every time you want to index the site, when you can just remember that the JS calls SomeWebsite.com/some-endpoint.json? Sucking the JSON out of an endpoint might even be faster that indexing regular HTML. Haven't written a lot of crawlers, so I'm mostly guessing here.

codedokode · on Jan 1, 2017

Crawling AJAX data alone makes no sense because it could be just a piece of JSON and Google needs a rendered HTML page with an URL it can show in the results. If you have some data that are not available at a separate URL (e.g. they are loaded when user presses a button), they will not be indexed.

Shanea93 · on Jan 1, 2017

Because then the crawler would be making the assumption that the main page with the JS loader on it would never change. If it never loads that page and always assumes it should be requesting the XMLhttpRequest endpoints, then changes to the site as a whole might never be indexed.

icebraining · on Jan 1, 2017

You still have to run the JS on the original page to know which text loaded from the JSON actually gets shown. You might have N pages loading the same file but showing different things.

tigras2 · on Jan 1, 2017

We got website, react+babel+ajax, we monitor those ajax requests because of bad scrappers:) And we constantly see googlebot. At least 1k request per day, google bot agent from google ip range. So yes google does ajax and does understand packed react also.

GoToRO · on Jan 1, 2017

My experience is that content hidden behind js will get indexed later, if at all, and it will be updated not as often. Also they will run js for bigger sites first, and not so much for smaller sites.

jorblumesea · on Jan 1, 2017

I think one of the most frustrating things about indexing from Google is the complete lack of transparency. I understand that it helps Google slow down the arms race of search engines, but it also means that devs doing 100% banal work need to sift through mountains of rumors and spin up sites to test assumptions.

I have literally heard every combination of practices with regards to SEO and have no idea what is truly correct. Every source contradicts each other, Google employee statements contradict those, etc.

jotto · on Jan 1, 2017

If you don't want the deal with the ambiguity of whether your AJAX will run or not, I'll shamelessly suggest https://www.prerender.cloud/ which is helping a few sites who couldn't get Google to execute their AJAX.

obvio · on Jan 1, 2017

here's an interesting experiment from a while ago http://searchengineland.com/tested-googlebot-crawls-javascri...

tldr; google indexes js generated content.

faragon · on Jan 1, 2017

Is that safe? E.g. exploits, privilege scalation, etc.

jwatte · on Jan 1, 2017

My pet theory is that part of the anonymous usage data Chrome sends back, is digested page contents that go into pagerank. And such browser level digesting would be on rendered pages (after JavaScript execution.)

I have no other reason to believe it is true other than it's what I would do to distribute the job of crawling the web to my users if i were Google :-)

codedokode · on Jan 1, 2017

I think that is unlikely because then SEO specialists would generate and send such data to improve the positions of their websites.

But Google could use this to discover new URLs that are not linked anywhere.

a3_nm · on Jan 1, 2017

Some URLs are private by design, so I think it would be an awfully bad idea to discover new URLs in this way.

codedokode · on Jan 1, 2017

There were cases when such private URLs got into search engines. You should protect such URLs with authorization or at least hide them behind a POST form. Or block them with robots.txt.

Also if a user follows a link from such page, the URL would be leaked via Referer header so it is not secure anyway.

puddintane · on Jan 1, 2017

I feel robots.txt is the most effective to prevent a private site from being crawled by Google followed up by an inline <meta name="robots" content="noindex"> tag.

puddintane · on Jan 1, 2017

Chrome by default searches Google with whatever you type in the top bar - so by not disabling that private URLs are sent to them and then are scanned (however if the correct robots.txt file is setup on those URLs then Google Crawlbot would stop at that point).

I believe it was this setting (I have it disabled as well as other options - however I believe it's only this option that sends data back) [1].

[1] https://i.snag.gy/cmkGzp.jpg

Fogest · on Jan 1, 2017

Yeah, it would make sense. There has to be more reason to giving people a free web browser other than just that it uses Google search.

neurostimulant · on Jan 1, 2017

Chrome does send domains/urls entered in the omnibar to google (I think someone did the experiment to test it several years ago), but sending out page content to google? If it were true it'll cause a huge legal and privacy problem, especially in places with tight privacy law like Europe.

samsonradu · on Jan 1, 2017

Exactly, imagine if Google sends the page content after you log into your bank account.

4ad · on Jan 1, 2017

And yet this is exactly what happens with automatic translation.

stuckagain · on Jan 1, 2017

Microsoft did exactly that with their Bing toolbar.

https://www.google.com/search?q=hiybbprqag

matt4077 · on Jan 1, 2017

The reason is Google's interest in advancing the web platform. They're afraid of content migrating into either native apps or Facebook, and Chrome is their (somewhat successful) attempt to make the open web competitive.

user5994461 · on Jan 1, 2017

> Does Google execute JavaScript?

Yes.

There are sites that can't be loaded without javascript that are indexed fine by Google. The only explanation is that they run some javascript.

korzun · on Jan 1, 2017

They have been for almost three years...