Hacker News new | past | comments | ask | show | jobs | submit login
Rendering AJAX-crawling pages (googleblog.com)
116 points by gildas on Dec 8, 2017 | hide | past | favorite | 32 comments



Todd from Prerender.io here. We always knew this day would come eventually :) We are currently serving around 60 million prerendered pages to crawlers every day, with Google being about half of those requests. We are recaching around 1 billion pages every month in PhantomJS/Headless Chrome. Google is the only crawler executing a meaningful amount of JavaScript so Bing, Baidu, Yandex, Facebook, Twitter, and other SEO crawlers still need prerendering.

For anyone that will need to update their own crawlers to match Google’s new javascript crawling, we’ve opened up our Prerendering engine that uses Headless Chrome at https://prerender.com. You can capture HTML, Screenshots, PDFs, or even HAR files from any web page with just an http request to our service. So it’s super easy to add javascript crawling to any crawler with Prerender.com (and it’s open source https://github.com/prerender/prerender).

For our Prerender.io customers, this announcement just means that Google will stop crawling ?_escaped_fragment_= URLs so they won’t request prerendered pages anymore. Instead, Google will just execute the javascript directly and index the result.

We’ve always recommended that our customers use the escaped fragment protocol, so it will be a smooth transition as Google slowly stops crawling the ?_escaped_fragment_= URLs. No changes need to be made if you are currently using Prerender.io. Keep an eye on our twitter (@prerender) and we’ll give updates on Google’s transition.

The one thing to look for when Google starts executing your javascript is keep an eye on your Google Webmaster Tools for your number of pages crawled by Google. In the past, we’ve seen that Google crawls much slower when executing the javascript themselves. Hopefully javascript websites don’t take a hit in number of pages crawled daily since that can affect large sites having all of their pages up to date in Google's index.


This isn't so much a service question as a technical bogglement, but I'm very curious what sort of heavy lifting you need to do 60 million renders a day.

Assuming renders take 7-10 seconds at worst, that means (if I've got my math right!) that you need to do between (60m/(86400/7)=4861) and (60m/(86400/10)=6944) renders per second in order to keep up. (86400 = seconds in a day)

...Ahahahaha :)

Given that a single Chrome instance on my new-but-not-particularly-amazing i3 box can be sluggish at the best of times... I have no idea what sort of tolerances Xeon(?)-class hardware (possibly running Xen? :P) have to running multiple entire copies of Chromium... I initially wondered if you needed 1000 compute instance, then I realized maybe you only needed 400, now I honestly don't know at all.

--

I'm also curious how using Headless Chrome and PhantomJS is working out. As in, genuinely interested. IIUC my understanding is that PhantomJS has pretty much wound down, while Headless Chrome is fractionally different enough from Chrome that it's possible to tell which one you're running on (https://news.ycombinator.com/item?id=14936025). I've been idly curious about "perfectly sandboxing" webpages so they honestly can't tell they're not in a "normal" PC/laptop/mobile environment, and my impression is that I'd have to start with a _very_ carefully configured copy of normal Chromium in order to do it.

--

I must admit that I got curious at what 60m monthly renders looked like against the pricing structure... but couldn't really figure it out, it's not a simple enough exponential curve (and I can't math for nuts). Single-stepping through the pricing algorithm was very interesting though ($1522 for enterprise, huh cool).

--

PS. The view-source link at the bottom is unfortunately broken; Chrome blocked opening such URLs recentlyish. Fixing it will likely require, ironically, a little server-side renderer :)

--

EDIT: One last thing, note https://news.ycombinator.com/item?id=15882066 from this thread


Yep, we have LOTS of servers :) We pretty heavily cache pages too.

Headless Chrome is great and we're super thankful that the Chromium team put the work in! PhantomJS is good... it just doesn't have all of the latest features, like ES6. So it was really helpful that headless Chrome came along right as people started using more ES6.

Yeah, Chrome did break the opening of view-source URLs a while back for our https://prerender.io/ buttons on the bottom of the homepage.


Todd,

prerender is almost what I have been looking for a while! I am now having a great weekend!

I want to scrape a JSON object within a script tag in an AJAX heavy webpage - if prerender removes all script tags, does that mean I can't use prerender for my project?

Is there a way to tell prerender not to remove certain scripts?

PS: I intend to use prerender to replace scrapy-splash middleware. Does your team have any plans to build a scrapy middleware?


So do you have to pivot now? Seems like this will hugely impact your US/Europe customers


We'll keep working on improving Prerender.io and we'll continue serving our customers exactly as we are now. So no pivot for Prerender.io.

Prerender.com is sort of a pivot, but more like just opening up our infrastructure to other crawlers that would like to execute javascript too. So we're expanding our offerings :)


I'd be really interested in how you planned for this sort of thing? I'd imagine it would be quite scary to have a business model that you knew would probably disappear in an instant.


Interesting pivot! I built a similar set of services (page.rest, screen.rip & pdf.cool) on top of Chrome Headless too.

Do you think Google indexing also uses Chrome Headless now to index the pages?


Their support for rendering JavaScript is, IMO, that they have turned Chrome's engine into their web crawler.

We see some signs of this with the chrome headless project and with the fact that Chrome, masquerading as a mobile viewport, can be used as an effective mobile crawler, even when the mobile UX is provided primarily by JavaScript.

I think they still use a plain HTTP request based crawler for "most" sites (mainly for speed), but then flip on Chrome-based crawling for popular sites and for sites that seem to be JS-heavy. I see no reason why, long term, Chrome wouldn't become the primary crawl/render engine for Google.


It looks like they are currently using Chrome 41 (https://developers.google.com/search/docs/guides/rendering) when rendering pages. I agree that with all of the work on headless Chrome, they should move towards using headless Chrome in the future.


Iiiiinteresting.

I wonder if all. the. security. patches. from every subsequent Chrome milestone regularly get backported to M41?

Obviously it's sandboxed to the hit. Poking the sandbox and seeing what it was made of would be VERY interesting though.


update: John Mueller from Google said [1]:

"If you can only provide your content through a 'DOM-level equivalent' prerendered version served through dynamic serving to the appropriate clients (ed. note: e.g. Google bot), then that works for us too."

[1] https://groups.google.com/d/msg/js-sites-wg/70GqODR-iN4/foUz...


I love how the title of this article is an understatement. Google abandons ajax crawling scheme because... it will render javascript in all pages. This is an awesome news :)


so, they're actually just removing the option for websites that use still use a /#! url structure to send a pre-rendered copy of the page.

not quite the same thing as abandoning ajax crawling.


They are abandoning the ajax crawling scheme. "Scheme" in this case in the sense of "documented interface," not in the sense of "we plan to do things." Easy confusion though, since the scheme didn't see much uptake that I'm aware of.


Good thing they aren't abandoning AJAX Crawling Scheme support though. Specialized lisp dialects are the lifeblood of our industry.


Your joke is either not well received or over peoples heads, but I chuckled.


Thanks; now I went back and reread and noticed the capitalization :)


Never change, Hacker News.


Websites have never really had the option to send prerendered content on /#! urls, since the part after # is not sent to the server. Closest thing would be to have javascript request a prerendered view from another url and replace the current view, but that doesn't sound useful.


I think that is what they mean with the ?_escaped_fragment_= URLs: they defined a way to map the #! URLs to URLs that could be sent to the server to get a prerendered page. But I may have guessed wrong, I don't really know.


If you're looking to roll your own HTML rendering https://browserless.io/ is great for doing just that and allows you to use puppeteer to do the HTML generation.

You could even go so far as to remove all unnecessary JS and CSS, but that'd require a bit more elbow grease.


It's mildly amusing that this page itself doesn't render without JavaScript enabled.

JavaScript is killing the static Web — and IMHO that's a terrible thing. In part, I imagine Google like JavaScript pages because it's a barrier to the entry of another search competitor.


The content is hidden inside a noscript tag inside a div.loading with visibility: hidden. However setting visibility: visible doesn't help. The quick workaround is editing the noscript section from developers tools, delete the noscript tag and keep its content.

By the way, if Google executes the JavaScript in the pages it means that they could be mining some coins for those sites with coin-hive and the like? They probably check what they run and how long they run it, but people can be very smart if the goal is making money.


> The content is hidden inside a noscript tag inside a div.loading with visibility: hidden.

Why would someone do that? I'm reminded of this exchange from the Hitchhiker's Guide to the Galaxy:

> “But the plans were on display…”

> “On display? I eventually had to go down to the cellar to find them.”

> “That’s the display department.”

> “With a flashlight.”

> “Ah, well, the lights had probably gone.”

> “So had the stairs.”

> “But look, you found the notice, didn’t you?”

> “Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.”


I think the good people at Scrapy [1] have done a great job in keeping pace with making open-source crawler components that handle javascript. It is called Splash [2], and I use it for rendering out screenshots.

[1] - https://scrapy.org

[2] - https://github.com/scrapy-plugins/scrapy-splash


There is, incidentally, a "render this page" link at the bottom of the page.

A meta tag to autodetect the lack of JS and auto-redirect could be interesting feature...

(The idea being the JS deletes the tag from the page)


It's so great to read an article like that one and realize halfway that none of it applies to you. [Smug mode on] Having your site/app in ClojureScript+React+Rum, with flawless server-side rendering is quite nice.


Google itself uses the fragment meta tag scheme in Google Groups (DejaNews) so users can access certain newsgroups without Javascript. Will this be deprecated too?


Haha I long for the days of DejaNews! Imagine StackOverflow but without the bullshit. And Google Groups is obviously terrible.

I'd say that Google totally ruined Deja but if I remember correctly it had already declined before the acquisition.


> without the bullshit

Save for the endless cross-posting. Out of control trolls. and the FTDSOJ thread.


Google's giving me OCRed PDFs, Chinese text, JSON dumps and license plate numbers when I google that acronym. Think I fell off the end of the index there.

What's it mean?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: