Hacker News new | past | comments | ask | show | jobs | submit login
Google bot delays executing JavaScript for days (ifixit.com)
128 points by dbeardsl on Oct 22, 2012 | hide | past | favorite | 37 comments



    > If you're removing code or changing an endpoint,
    > be careful you don't screw the Google bot, which
    > might be "viewing" 3-day-old pages on your
    > altered backend.
An interesting proposition. Personally, unless I was operating in some sector where keeping Googlebot happy was key to staying competitive and there was solid evidence it could hurt my page rank, I don't think I'd be prepared to go to this length. Google is doing quite an atypical thing here compared to regular browsers and I'd like to think Google engineers are smart enough to account for this type of thing in the early stages of planning.

They have a difficult cache invalidation problem here. The only way to find out if the Javascript in use on a site has changed is by checking if the page HTML has changed. And on top of that, the Javascript can change without any noticeable change to the HTML.


obligatory: "There are only two hard problems in Computer Science: cache invalidation, naming things, and off-by-one errors."


Googlebot also does some other crazy stuff. Like looking at url patterns and then trying out variations.. they're almost trying to sniff URLs!

For example if I have a page: www.domain.com/xyz/123

Googlebot (without any links to other pages, will actually try URLs like) www.domain.com/xyz/1234 www.domain.com/xyz/122 www.domain.com/xyz/121 and so on...

It's crazy how much 'looking around' they do these days!


I believe that one's mostly a search for duplicate content - looking for URL parameters that don't make a difference.


I'm not too surprised. I've got Googlebot still requesting old URLs even through there are no incoming links to them (that I know of) and they've been either 404 or 301 redirected for six months. I even tried using 410 Gone instead of 404, but it made no difference.


To just reiterate this further, I am still 301'ing urls that have been dead for nearly 5 years. I still get requests in for them. I don't want to 404 them in fear of losing that slight bit of traffic so I just 301 them. I am really surprised they don't remove these urls from their cache and I can't think for the life of me why they don't?


They might have some obscure incoming url from somewhere else on the net.


Webmasters should show that. If you 404 the page it'll appear in the errors pane after some time and show incoming sources.


It sounds like the only thing requesting the 404`ed page was google bot, which I do not believe tells you the referrer. If this is true, then it would mean either that google does not clear their cache (which I doubt), or that the link exists somewhere on the net, but in a place where no human would find it. I've done some work with web crawlers, and it you fall into that type of hole alot more often than I would expect.


I'm not sure I understand, why wouldn't Webmasters show that one hard to find link if Googlebot found it?


I removed a whole section from the site at the same time. Webmaster Tools shows the incoming links for every page in the section as other pages in that section. It's a whole loop of pages linking to each other and generating inbound links even though they all no longer exist and haven't for many months.


Yeah, Webmasters is reaally slow to update. Thankfully they offered a way to delete old pages. If they come back though, then it should show the source of the link.


Same here.


Your users may be, too. It's not unusual for me to open my sleeping laptop several days later and expect the open web pages to work without refreshing them.


What might be a good idea for Javascript-heavy web apps is to make an Ajax call to the server to see if a refresh of the page is required.


Please don’t even call out to the server, unless I actively interact with the page. I sometimes open 50 or 60 browser tabs at once, and when I unsleep my laptop or connect to a new wifi hotspot, many of them try to simultaneously make such ajax calls, which prevents any other web pages I want to open from loading until those calls either make it through or time out. Occasionally I have to SIGSTOP my main browser and open a different one if I need to access something online right away. Even pages of ostensibly static content, like years-old news articles, now are littered with "web 2.0" doodads on them which do this kind of crap.


Please don't take this the wrong way but why do you have 50-60 browser tabs open? Are you using browser tabs as some sort of todo list? Isiteffective?


I currently have 42 tabs open in Chrome (in two separate windows). Yes, it's like a todo list. It would work better if I had a mechanism to de-duplicate tabs that were open to the same URL - sometimes I find that I've got three tabs open all monitoring the same CI build. But I also keep several JIRA bugs open in tabs so I don't forget them, two Gmail accounts and twitter, I've got five-ish articles that I was in the middle of reading, some reference documents for a few projects I'm in the middle of, a couple of youtube videos that I haven't found time to watch yet...


There have been days when I have had more than 100 browser tabs open. Users should be able to have as many tabs as they want.


That was not the question - I am paying to have a small office to reduce my distractions, and allow focusing. I am just interested in how people manage the distractions, and with 100 tabs open I would presume that they are not all open regarding the same single task at hand.


They aren't.

Most of them are open from other tasks. But that doesn't make them a distraction. They're there for when the current task finishes and I can go back to them.

The alternative is to close them all down and reopen them, which is vastly more time consuming.


Please don't do that. I left that tab open on purpose, I'm half way through reading the page - if the page refreshes I likely lose my position


You don't have to refresh the page, you could make it so that your next page click loads the full page instead of using ajax/pjax.

quick pjax e.g.:

  <html data-lastupdated="1234567890"...

  $.getJSON('/lastupdated.json', function(lastupdated) {
  	if(lastupdated > $('html').data('lastupdated'))
  	{
  		$('a[data-pjax]').removeAttr('data-pjax');
  	}
  });


This is exactly what I meant by my initial comment - I should have been more clear.


You don't have to force a reload, you can just suggest that the user reloads. Eg: "Please reload the page to get the latest version"


It's a call that was being made (only) when the page loaded.


I wonder if it is Google's visual site previews/thumbnails that you get when you click on the arrow at the side of a search result, that are doing this.

Perhaps Google fetches the crawled page from the cache and then renders that for the previews?


This was my first thought, and seems likely. They do several forms of analysis on their cache. It could even be some engineers running tests or queries that require rendering the page or at least bootstrapping the DOM.


Is this surprising? I'd expect the possibility of this sort of behavior from any system that was vaguely Map-Reduce-y and operated on the scale of data that Google's indexing does.


I'm wondering if some of the simpler cache-busting tricks would force google update their cache. For example, somescript.js?v=201210221559.


That's not the issue here, we include the md5 hash of the content in the url of every javascript / css asset. New pages had all the correct (brand new) urls. The issue is that Google is executing javascript on html pages they downloaded days ago. The only solution I can see is to fire off cloudfront cache expiration requests for all old assets. But that negates the simplicity of including the hash of the content in the url.


Is it possible that people are looking at the page from Google's cache? I'm thinking the 3taps kind of 'web site scraping that doesn't look like web site scraping'


Hmm, that's interesting. I don't think so, though, because the user-agent on the requests is the googlebot:

    From: googlebot(at)googlebot.com
    User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)


Well an interesting check would be to look at one of your pages in the cache that fires an AJAX call and see where that call comes from. I agree it would be 'weird' if it came from Googlebot instead of the browser looking at the cache.

At Blekko we post process extracted pages of the crawl which, if they were putting content behind js could result in js calls offset by the initial access but 3 days seems like a long time. Mostly though the js is just page animation.


Would it make sense that loading from the cache makes a call to the origin server?

I just checked one of my sites which loads available delivery dates via ajax through the google cache, and yep, it caches that as the dates are when the cache was taken.


Did anyone else get really bad font rendering running Chrome on Windows 7?


Yes, it's a known issue and has been a problem for a long time. Quite ironic that it's particularly bad with Google Web Fonts.

http://code.google.com/p/chromium/issues/detail?id=137692




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: