Subtle bug in Google can get you banned

Matt_Cutts · on Oct 3, 2010

There are a lot of people that scrape Google pretty badly, so we do need to have protection against bots, including ones that look like the Google toolbar. If you're resuming ~50 tabs, I can believe that might look like a scraper to us for while. I'm glad you could do regular Google searches after 15 minutes or so.

jacquesm · on Oct 3, 2010

So, are you seriously telling me that google can't tell the difference between their own toolbar used by a logged in user and a bot?

How about changing the toolbar code so it paces the requests to something that sits below the frequency of the 'ban for bot use' trigger? That would seem to me to be an obvious fix.

po · on Oct 3, 2010

Bots can probably perfectly duplicate the behavior of a toolbar. Only the rate and volume of requests would be different.

I'm assuming the toolbars can't communicate between each other. On toolbar launch, it should pick a random number between 1 and x and wait that many ms before contacting google. Pick x by looking at the number of req/sec that trigger a ban and the high-end number of tabs a power user might restart with. This would spread the requests out over that time period and keep it under the ban.

abraham · on Oct 3, 2010

The toolbar should really only pull the page rank for active tabs. This would effectively only require a single request initially.

regularfry · on Oct 3, 2010

> Bots can probably perfectly duplicate the behavior of a toolbar. Only the rate and volume of requests would be different.

If they do, isn't that precisely what Google would want? Isn't it only the rate and volume of requests that are a problem?

olefoo · on Oct 3, 2010

There is plenty of malware that finds victims by looking at the results of google searches. Google seems to think that they have an obligation to prevent the indiscriminate spread of self-replicating infovores. Fucking Censorship if you ask me.

dmoney · on Oct 3, 2010

They might not want someone to build a large database of pageranks.

rick_2047 · on Oct 3, 2010

ok, wouldn't that require like a very very very long time? Google has a dadabase probably terabytes big and if someone does want it can't they do something like what DDG does? I believe they get their searches by yahoo for free

jacquesm · on Oct 3, 2010

Of course they can communicate with each other, they're extensions, not web pages. The first one could act as the 'master', and proxy all the requests.

It's obvious the limiting is rate based, otherwise this would never have happened, so if it is rate based then the toolbar could pace itself to below that rate. Of course that would 'give away' the rate to observers of the toolbar during a browser restart but they could observe that just the same by checking when they get blocked, so that's no loss.

The toolbar knows I'm logged in, knows that a browser has just restarted and presumably can see how many instances/tabs are open (after all that's what it provides the info on) so it has all the data at it's disposal to make the right decision. This seems like a simple oversight to me (that a user installing the toolbar on a machine with a large number of tabs open would land in this situation).

po · on Oct 3, 2010

Of course they can communicate with each other, they're extensions, not web pages.

Firefox extensions are Javascript CSS and XUL so I don't think that's obvious. I think it's entirely reasonable to assume that they might be sand boxed and have no awareness of each other. Is it one instance of the toolbar per "page-opened" event? Is it one instance per window? What I was describing was a way to stay under the limit without having centralized state-aware rate-limiting code. If that's possible, then yeah sure, do it that way.

It's obvious the limiting is rate based, otherwise this would never have happened, so if it is rate based then the toolbar could pace itself to below that rate.

It's not obvious to me. I think the issue is that the OP opens 50 tabs simultaneously after a crash and each window opens a connection to google without a rate limit of any kind. My idea was a way to do it without a centralized state.

gmac · on Oct 3, 2010

Firefox extensions are Javascript CSS and XUL so I don't think that's obvious. I think it's entirely reasonable to assume that they might be sand boxed and have no awareness of each other.

Multiple Mozilla extension instances are indeed able to communicate via some centralised code.

spyder · on Oct 3, 2010

It's possible to do with JavaScript modules: "JavaScript code modules are a concept introduced in Gecko 1.9 (Firefox 3) and can be used for sharing code between different privileged scopes." https://developer.mozilla.org/en/Using_JavaScript_code_modul...

houseabsolute · on Oct 3, 2010

> Of course that would 'give away' . . .

Meh, I have trouble believing that spammers cannot experiment to find this number out themselves. The binary search on rate would require only a handful of IPs before you acquire it to a sufficient resolution for working purposes.

proexploit · on Oct 3, 2010

Wouldn't it be simpler to rate limit the toolbar? E.g. it will only try x requests/second? Then bots wouldn't emulate it because it wouldn't be able to provide a high enough rate to be really useful.

Of course, that solution is so simple, I'm sure there's a reason it's not possible.

mfukar · on Oct 3, 2010

You're oversimplifying a real problem. Would you "authenticate" your own toolbar somehow? That's probably excess effort for a temporary 15 minute ban. Would you rate limit it? Well that's a lost cause if I ever saw one; bots can rate limit themselves too.

It's not an easy problem to solve.

Matt_Cutts · on Oct 3, 2010

Other factors can also come into play, e.g. you could be sitting on an IP subnet where someone else has been scraping Google, or a worm has been sending automated queries to Google.

jacquesm · on Oct 3, 2010

> you could be sitting on an IP subnet where someone else has been scraping Google

Unlikely, I'm in the sticks, most people here are old and wouldn't know a mouse from a keyboard

> or a worm has been sending automated queries to Google.

That would have to be a linux based worm then. Unless that suspected worm is sitting on another IP of course.

Do you want me to try to make it reproducible ? I'd happily spend the time if it would help to make this problem go away. I understand how hard it is to differentiate between bots and regular users, but you should be able to pick up the difference between your own toolbar in normal use situations and a bot.

And if that's not the case then either the battle is 'lost' or it might be better to simply only let the toolbar query the google servers when explicitly asked to do so.

momokatte · on Oct 3, 2010

I'd report that to Google as a bug.

http://www.google.com/support/toolbar/bin/request.py?contact...

The official toolbar should not exceed request limits that were designed to prevent PageRank scraping by third-party software.

jacquesm · on Oct 3, 2010

It's now over half an hour, the toolbar still doesn't function.

"We're sorry...

... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now."

Right...

I'll try to file a bug with them, but my experience with google and support issues so far does not lead me to believe that anybody will actually read the report.

"I know this form is used to track new issues so I won't receive a response"

Does not give me great hope.

lnanek · on Oct 3, 2010

Reminds me of how I used to get banned from reading Google Groups all the time just for opening a bunch of threads in tabs from Google Reader. Now I know that I have to open a couple, read a couple, go back to Reader and repeat. Kind of a shame that people have to act differently just to not be banned as robots.

smokey_the_bear · on Oct 3, 2010

Yeah, I usually need help to do Google's captchas too. Amazingly hard to convince them you're human.

gregable · on Oct 3, 2010

Generally when this type of thing happens, Google will reply with a captcha that, if you correctly solve it, will let you keep going for a while. I guess toolbar requests might be a little different than web requests.

jacquesm · on Oct 3, 2010

No captcha to be seen, the IP ban is still in place, it's now 8 hours later.

kogir · on Oct 3, 2010

Enabling instant search in Chrome did this for me. Pretty sad really.

kranner · on Oct 3, 2010

OP, any chance you could say how many total tabs+windows you had open? That should be useful while this bug is open.

Thanks for the tip!

jacquesm · on Oct 3, 2010

About 50 in all. And I'm on a 10 Mbit link, possibly on a slower link it would not have triggered. More than an hour has passed now, I think I'll give up for the day (3:50 am here anyway) and hope that by tomorrow things will have normalized.

What a silly situation to be in.

I could change my IP by calling my provider but it is also entered in a fairly large number of ACLs that will not be updated automatically.

darrenkopp · on Oct 3, 2010

same thing happened to me with auto pagerize extension while incrementally refining my search because none of the results i was getting back were meaningful. i moved to duck duck go and bing. if they don't want me to use them for search, then fine. plenty of alternatives. haven't missed google so far.

known · on Oct 3, 2010

Somebody from http://geotool.flagfox.net/?search=82.128.1.251 hacked into my Gmail a/c