Rendering AJAX-crawling pages

thoop · on Dec 8, 2017

Todd from Prerender.io here. We always knew this day would come eventually :) We are currently serving around 60 million prerendered pages to crawlers every day, with Google being about half of those requests. We are recaching around 1 billion pages every month in PhantomJS/Headless Chrome. Google is the only crawler executing a meaningful amount of JavaScript so Bing, Baidu, Yandex, Facebook, Twitter, and other SEO crawlers still need prerendering.

For anyone that will need to update their own crawlers to match Google’s new javascript crawling, we’ve opened up our Prerendering engine that uses Headless Chrome at https://prerender.com. You can capture HTML, Screenshots, PDFs, or even HAR files from any web page with just an http request to our service. So it’s super easy to add javascript crawling to any crawler with Prerender.com (and it’s open source https://github.com/prerender/prerender).

For our Prerender.io customers, this announcement just means that Google will stop crawling ?_escaped_fragment_= URLs so they won’t request prerendered pages anymore. Instead, Google will just execute the javascript directly and index the result.

We’ve always recommended that our customers use the escaped fragment protocol, so it will be a smooth transition as Google slowly stops crawling the ?_escaped_fragment_= URLs. No changes need to be made if you are currently using Prerender.io. Keep an eye on our twitter (@prerender) and we’ll give updates on Google’s transition.

The one thing to look for when Google starts executing your javascript is keep an eye on your Google Webmaster Tools for your number of pages crawled by Google. In the past, we’ve seen that Google crawls much slower when executing the javascript themselves. Hopefully javascript websites don’t take a hit in number of pages crawled daily since that can affect large sites having all of their pages up to date in Google's index.

exikyut · on Dec 9, 2017

This isn't so much a service question as a technical bogglement, but I'm very curious what sort of heavy lifting you need to do 60 million renders a day.

Assuming renders take 7-10 seconds at worst, that means (if I've got my math right!) that you need to do between (60m/(86400/7)=4861) and (60m/(86400/10)=6944) renders per second in order to keep up. (86400 = seconds in a day)

...Ahahahaha :)

Given that a single Chrome instance on my new-but-not-particularly-amazing i3 box can be sluggish at the best of times... I have no idea what sort of tolerances Xeon(?)-class hardware (possibly running Xen? :P) have to running multiple entire copies of Chromium... I initially wondered if you needed 1000 compute instance, then I realized maybe you only needed 400, now I honestly don't know at all.

--

I'm also curious how using Headless Chrome and PhantomJS is working out. As in, genuinely interested. IIUC my understanding is that PhantomJS has pretty much wound down, while Headless Chrome is fractionally different enough from Chrome that it's possible to tell which one you're running on (https://news.ycombinator.com/item?id=14936025). I've been idly curious about "perfectly sandboxing" webpages so they honestly can't tell they're not in a "normal" PC/laptop/mobile environment, and my impression is that I'd have to start with a _very_ carefully configured copy of normal Chromium in order to do it.

--

I must admit that I got curious at what 60m monthly renders looked like against the pricing structure... but couldn't really figure it out, it's not a simple enough exponential curve (and I can't math for nuts). Single-stepping through the pricing algorithm was very interesting though ($1522 for enterprise, huh cool).

--

PS. The view-source link at the bottom is unfortunately broken; Chrome blocked opening such URLs recentlyish. Fixing it will likely require, ironically, a little server-side renderer :)

--

EDIT: One last thing, note https://news.ycombinator.com/item?id=15882066 from this thread

thoop · on Dec 9, 2017

Yep, we have LOTS of servers :) We pretty heavily cache pages too.

Headless Chrome is great and we're super thankful that the Chromium team put the work in! PhantomJS is good... it just doesn't have all of the latest features, like ES6. So it was really helpful that headless Chrome came along right as people started using more ES6.

Yeah, Chrome did break the opening of view-source URLs a while back for our https://prerender.io/ buttons on the bottom of the homepage.

subhobroto · on Dec 9, 2017

Todd,

prerender is almost what I have been looking for a while! I am now having a great weekend!

I want to scrape a JSON object within a script tag in an AJAX heavy webpage - if prerender removes all script tags, does that mean I can't use prerender for my project?

Is there a way to tell prerender not to remove certain scripts?

PS: I intend to use prerender to replace scrapy-splash middleware. Does your team have any plans to build a scrapy middleware?

misterbowfinger · on Dec 9, 2017

So do you have to pivot now? Seems like this will hugely impact your US/Europe customers

thoop · on Dec 9, 2017

We'll keep working on improving Prerender.io and we'll continue serving our customers exactly as we are now. So no pivot for Prerender.io.

Prerender.com is sort of a pivot, but more like just opening up our infrastructure to other crawlers that would like to execute javascript too. So we're expanding our offerings :)

ascorbic · on Dec 9, 2017

I'd be really interested in how you planned for this sort of thing? I'd imagine it would be quite scary to have a business model that you knew would probably disappear in an instant.

laktek · on Dec 9, 2017

Interesting pivot! I built a similar set of services (page.rest, screen.rip & pdf.cool) on top of Chrome Headless too.

Do you think Google indexing also uses Chrome Headless now to index the pages?

pixelmonkey · on Dec 8, 2017

Their support for rendering JavaScript is, IMO, that they have turned Chrome's engine into their web crawler.

We see some signs of this with the chrome headless project and with the fact that Chrome, masquerading as a mobile viewport, can be used as an effective mobile crawler, even when the mobile UX is provided primarily by JavaScript.

I think they still use a plain HTTP request based crawler for "most" sites (mainly for speed), but then flip on Chrome-based crawling for popular sites and for sites that seem to be JS-heavy. I see no reason why, long term, Chrome wouldn't become the primary crawl/render engine for Google.

thoop · on Dec 8, 2017

It looks like they are currently using Chrome 41 (https://developers.google.com/search/docs/guides/rendering) when rendering pages. I agree that with all of the work on headless Chrome, they should move towards using headless Chrome in the future.

exikyut · on Dec 9, 2017

Iiiiinteresting.

I wonder if all. the. security. patches. from every subsequent Chrome milestone regularly get backported to M41?

Obviously it's sandboxed to the hit. Poking the sandbox and seeing what it was made of would be VERY interesting though.

gildas · on Dec 21, 2017

update: John Mueller from Google said [1]:

"If you can only provide your content through a 'DOM-level equivalent' prerendered version served through dynamic serving to the appropriate clients (ed. note: e.g. Google bot), then that works for us too."

[1] https://groups.google.com/d/msg/js-sites-wg/70GqODR-iN4/foUz...

_pctq · on Dec 8, 2017

I love how the title of this article is an understatement. Google abandons ajax crawling scheme because... it will render javascript in all pages. This is an awesome news :)

notatoad · on Dec 8, 2017

so, they're actually just removing the option for websites that use still use a /#! url structure to send a pre-rendered copy of the page.

not quite the same thing as abandoning ajax crawling.

lmkg · on Dec 8, 2017

They are abandoning the ajax crawling scheme. "Scheme" in this case in the sense of "documented interface," not in the sense of "we plan to do things." Easy confusion though, since the scheme didn't see much uptake that I'm aware of.

_diyu · on Dec 8, 2017

Good thing they aren't abandoning AJAX Crawling Scheme support though. Specialized lisp dialects are the lifeblood of our industry.

hamandcheese · on Dec 8, 2017

Your joke is either not well received or over peoples heads, but I chuckled.

exikyut · on Dec 9, 2017

Thanks; now I went back and reread and noticed the capitalization :)

_diyu · on Dec 9, 2017

Never change, Hacker News.

Phrodo_00 · on Dec 8, 2017

Websites have never really had the option to send prerendered content on /#! urls, since the part after # is not sent to the server. Closest thing would be to have javascript request a prerendered view from another url and replace the current view, but that doesn't sound useful.

gpvos · on Dec 8, 2017

I think that is what they mean with the ?_escaped_fragment_= URLs: they defined a way to map the #! URLs to URLs that could be sent to the server to get a prerendered page. But I may have guessed wrong, I don't really know.

mrskitch · on Dec 8, 2017

If you're looking to roll your own HTML rendering https://browserless.io/ is great for doing just that and allows you to use puppeteer to do the HTML generation.

You could even go so far as to remove all unnecessary JS and CSS, but that'd require a bit more elbow grease.

zeveb · on Dec 8, 2017

It's mildly amusing that this page itself doesn't render without JavaScript enabled.

JavaScript is killing the static Web — and IMHO that's a terrible thing. In part, I imagine Google like JavaScript pages because it's a barrier to the entry of another search competitor.

pmontra · on Dec 9, 2017

The content is hidden inside a noscript tag inside a div.loading with visibility: hidden. However setting visibility: visible doesn't help. The quick workaround is editing the noscript section from developers tools, delete the noscript tag and keep its content.

By the way, if Google executes the JavaScript in the pages it means that they could be mining some coins for those sites with coin-hive and the like? They probably check what they run and how long they run it, but people can be very smart if the goal is making money.

zeveb · on Dec 11, 2017

> The content is hidden inside a noscript tag inside a div.loading with visibility: hidden.

Why would someone do that? I'm reminded of this exchange from the Hitchhiker's Guide to the Galaxy:

> “But the plans were on display…”

> “On display? I eventually had to go down to the cellar to find them.”

> “That’s the display department.”

> “With a flashlight.”

> “Ah, well, the lights had probably gone.”

> “So had the stairs.”

> “But look, you found the notice, didn’t you?”

> “Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.”

sheraz · on Dec 9, 2017

I think the good people at Scrapy [1] have done a great job in keeping pace with making open-source crawler components that handle javascript. It is called Splash [2], and I use it for rendering out screenshots.

[1] - https://scrapy.org

[2] - https://github.com/scrapy-plugins/scrapy-splash

exikyut · on Dec 9, 2017

There is, incidentally, a "render this page" link at the bottom of the page.

A meta tag to autodetect the lack of JS and auto-redirect could be interesting feature...

(The idea being the JS deletes the tag from the page)

jwr · on Dec 8, 2017

It's so great to read an article like that one and realize halfway that none of it applies to you. [Smug mode on] Having your site/app in ClojureScript+React+Rum, with flawless server-side rendering is quite nice.

feelin_googley · on Dec 8, 2017

Google itself uses the fragment meta tag scheme in Google Groups (DejaNews) so users can access certain newsgroups without Javascript. Will this be deprecated too?

gdulli · on Dec 8, 2017

Haha I long for the days of DejaNews! Imagine StackOverflow but without the bullshit. And Google Groups is obviously terrible.

I'd say that Google totally ruined Deja but if I remember correctly it had already declined before the acquisition.

kevin_thibedeau · on Dec 8, 2017

> without the bullshit

Save for the endless cross-posting. Out of control trolls. and the FTDSOJ thread.

exikyut · on Dec 9, 2017

Google's giving me OCRed PDFs, Chinese text, JSON dumps and license plate numbers when I google that acronym. Think I fell off the end of the index there.

What's it mean?