> If you're removing code or changing an endpoint,
> be careful you don't screw the Google bot, which
> might be "viewing" 3-day-old pages on your
> altered backend.
An interesting proposition. Personally, unless I was operating in some sector where keeping Googlebot happy was key to staying competitive and there was solid evidence it could hurt my page rank, I don't think I'd be prepared to go to this length. Google is doing quite an atypical thing here compared to regular browsers and I'd like to think Google engineers are smart enough to account for this type of thing in the early stages of planning.
They have a difficult cache invalidation problem here. The only way to find out if the Javascript in use on a site has changed is by checking if the page HTML has changed. And on top of that, the Javascript can change without any noticeable change to the HTML.
Googlebot also does some other crazy stuff. Like looking at url patterns and then trying out variations.. they're almost trying to sniff URLs!
For example if I have a page:
www.domain.com/xyz/123
Googlebot (without any links to other pages, will actually try URLs like)
www.domain.com/xyz/1234
www.domain.com/xyz/122
www.domain.com/xyz/121
and so on...
It's crazy how much 'looking around' they do these days!
I'm not too surprised. I've got Googlebot still requesting old URLs even through there are no incoming links to them (that I know of) and they've been either 404 or 301 redirected for six months. I even tried using 410 Gone instead of 404, but it made no difference.
To just reiterate this further, I am still 301'ing urls that have been dead for nearly 5 years. I still get requests in for them. I don't want to 404 them in fear of losing that slight bit of traffic so I just 301 them. I am really surprised they don't remove these urls from their cache and I can't think for the life of me why they don't?
It sounds like the only thing requesting the 404`ed page was google bot, which I do not believe tells you the referrer. If this is true, then it would mean either that google does not clear their cache (which I doubt), or that the link exists somewhere on the net, but in a place where no human would find it. I've done some work with web crawlers, and it you fall into that type of hole alot more often than I would expect.
I removed a whole section from the site at the same time. Webmaster Tools shows the incoming links for every page in the section as other pages in that section. It's a whole loop of pages linking to each other and generating inbound links even though they all no longer exist and haven't for many months.
Yeah, Webmasters is reaally slow to update. Thankfully they offered a way to delete old pages. If they come back though, then it should show the source of the link.
Your users may be, too. It's not unusual for me to open my sleeping laptop several days later and expect the open web pages to work without refreshing them.
Please don’t even call out to the server, unless I actively interact with the page. I sometimes open 50 or 60 browser tabs at once, and when I unsleep my laptop or connect to a new wifi hotspot, many of them try to simultaneously make such ajax calls, which prevents any other web pages I want to open from loading until those calls either make it through or time out. Occasionally I have to SIGSTOP my main browser and open a different one if I need to access something online right away. Even pages of ostensibly static content, like years-old news articles, now are littered with "web 2.0" doodads on them which do this kind of crap.
I currently have 42 tabs open in Chrome (in two separate windows). Yes, it's like a todo list. It would work better if I had a mechanism to de-duplicate tabs that were open to the same URL - sometimes I find that I've got three tabs open all monitoring the same CI build. But I also keep several JIRA bugs open in tabs so I don't forget them, two Gmail accounts and twitter, I've got five-ish articles that I was in the middle of reading, some reference documents for a few projects I'm in the middle of, a couple of youtube videos that I haven't found time to watch yet...
That was not the question - I am paying to have a small office to reduce my distractions, and allow focusing. I am just interested in how people manage the distractions, and with 100 tabs open I would presume that they are not all open regarding the same single task at hand.
Most of them are open from other tasks. But that doesn't make them a distraction. They're there for when the current task finishes and I can go back to them.
The alternative is to close them all down and reopen them, which is vastly more time consuming.
I wonder if it is Google's visual site previews/thumbnails that you get when you click on the arrow at the side of a search result, that are doing this.
Perhaps Google fetches the crawled page from the cache and then renders that for the previews?
This was my first thought, and seems likely. They do several forms of analysis on their cache. It could even be some engineers running tests or queries that require rendering the page or at least bootstrapping the DOM.
Is this surprising? I'd expect the possibility of this sort of behavior from any system that was vaguely Map-Reduce-y and operated on the scale of data that Google's indexing does.
That's not the issue here, we include the md5 hash of the content in the url of every javascript / css asset. New pages had all the correct (brand new) urls. The issue is that Google is executing javascript on html pages they downloaded days ago. The only solution I can see is to fire off cloudfront cache expiration requests for all old assets. But that negates the simplicity of including the hash of the content in the url.
Is it possible that people are looking at the page from Google's cache? I'm thinking the 3taps kind of 'web site scraping that doesn't look like web site scraping'
Well an interesting check would be to look at one of your pages in the cache that fires an AJAX call and see where that call comes from. I agree it would be 'weird' if it came from Googlebot instead of the browser looking at the cache.
At Blekko we post process extracted pages of the crawl which, if they were putting content behind js could result in js calls offset by the initial access but 3 days seems like a long time. Mostly though the js is just page animation.
Would it make sense that loading from the cache makes a call to the origin server?
I just checked one of my sites which loads available delivery dates via ajax through the google cache, and yep, it caches that as the dates are when the cache was taken.
They have a difficult cache invalidation problem here. The only way to find out if the Javascript in use on a site has changed is by checking if the page HTML has changed. And on top of that, the Javascript can change without any noticeable change to the HTML.