Hacker News new | past | comments | ask | show | jobs | submit login
The day I was DoSed by Google (thekeywordgeek.blogspot.com)
152 points by thekeywordgeek on March 12, 2015 | hide | past | favorite | 63 comments



Hi, I work with the Google crawling & indexing teams. Let me check with them to see what their recommendation would be. At first glance (based on the pages cached), it seems like we're just following the links within your site, like we would with other websites. In general, there are a few things one could do in a case like this (I might have more from the team later; these are in no particular order):

- Use rel=nofollow on links you don't need to have followed (this prevents passing of PageRank, which generally means we're less likely to crawl them)

- Use 503 for rate-limiting crawlers. 503 means we'll just retry later.

- Use the crawl rate limit in Webmaster Tools (I see you submitted the report there, so that should be active soon)

- If the content is fully auto-generated, you might choose to use a "noindex,nofollow" robots meta tag on these pages to prevent them from being indexed separately. It's hard for me to judge how useful your content would be in search directly.


A big thank you for enquiring on my behalf!

A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.

I have seen "noindex nofollow" kill a site stone dead in the past so I am very wary indeed of using it. In my experience once you've noindexed a page it is nigh-on impossible to get the engine to index it again.

My content is autogenerated, though I hope it has enough value to be considered useful. It's time-series data of word frequencies in politics, so for example you might use it to see how one candidate is doing relative to another in an election campaign.


FWIW I think the main problem is that you're essentially creating an "infinite space," meaning there's an extremely high number of URLs that are findable through crawling your pages, and the more pages we crawl, the more new ones we find. There's no general & trivial solution to crawling and indexing sites like that, so ideally you'd want to find a strategy that allows indexing of great content from your site, without overly taxing your resources on things that are irrelevant. Making those distinctions isn't always easy... but I'd really recommend taking a bit of time to work out which kinds of URLs you want crawled & indexed, and how they could be made discoverable through crawling without crawlers getting stuck elsewhere. It might even be worth blocking those pages from crawling completely (via robots.txt) until you come up with a strategy for that.


And one more thing ... you have some paths that are generating more URLs on their own without showing different content, for example:

http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th-anniversa...

I can't check at the moment, but my guess is that all of these generate the same content (and that you could add even more versions of those keywords in the path too). These were found through crawling, so somewhere within your site you're linking to them, and they're returning valid content, so we keep crawling deeper. That's essentially a normal bug worth fixing regardless of how you handle the rest.


> A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.

And persistence to track how many crawl requests have been served in the last N minutes. Even blindly serving a million 503's an hour could get really expensive.


Having a page that goes nofollow/noindex and back is fine, when we recrawl it, we'll take the new state into account.


Wouldn't "429 Too Many Requests" be more appropriate than 503? Or maybe Google doesn't respect 429?


It's weird that Google is charging for the GoogleBot bandwidth on its own services. Of course they absolutely don't have to do it, but stories like this make me worry about using the Google Cloud Storage.

[edit] That being said, the issue would be the same if it was another hoster or another search engine. I guess the real solution would be to be able to limit the crawl rate, as the OP said.


I thought about that too (I'm a Googler working on cloud), but then a colleague mentioned that this would become a way to get free computation from Google.

So, while I agree with the sentiment that it sucks that this crawling eats the quota, the solution is not to simply bypass the quota.


> but then a colleague mentioned that this would become a way to get free computation from Google

I'm a bit confused. What computation does the GoogleBot cause to be performed that benefits the Google service user? (Not Googlebot related stuff like indexing).

EDIT: Thanks kyrra!


Have a bunch of pages with no real content (but have millions of pages). Everytime someone tries to load a page, do some intensive task (ex: mining bitcoins). If you just make it appealing to GoogleBot and no one else, you get free computational resources.


Sorry, I still don't get it.

How does not charging for outgoing network traffic make computation free? You'd still be paying for everything else, eg the instances themselves, datastore storage, read/write datastore calls, using the logs API, which means mining bitcoins wouldn't be free.


OP's concern isn't with network traffic, it's with GAE compute time. Googlebot keeps causing instances to run.

If requests initiated by Googlebot were free to run, you could make a giant website full of garbage and use each free request to spend 50ms mining bitcoin.


If the mining are done at Google Cloud Storage, initiated by a google search bot, can't Google then identify and handle such abuse? I assume Google already scans for multiple types of abuse, such as sites that spread malware.


Google shouldn't really have to do this. Replace GoogleBot with BingBot or GCE with AWS and you still have the same problem. A website operator should be working to make sure search crawlers don't consume too many resources given that the bots follow rules.

Otherwise you'd have a team at every cloud provider trying to figure out how to manage bots.


Now you're suggesting that Google basically devotes a team to detecting "crawler-free-quota abuse", when the real solution needs to handles crawlers from many different sources that aren't all Google.


Is that really how it works?


What about discounted? It is rather unfair for Google to be eating both the funds and the service here.

That, and adding proper support for tuning the crawl rate.


Absolutely, even I can't say I'm deserving of a free lunch.


Absolutely, It's only coincidence that Google are both the host and the spider. I really would appreciate the ability to throttle it though.


When your cloud provider lets a client run their own source code, how can you REALLY determine that incoming traffic is even from a crawler? Do you want them to spawn a specific instance of the app just used by Googlebot and then using a load balancer to redirect those requests to those specific instances?

The more you think about this, the more insanely complex it gets.


Google crawlers come from well-known IPs, especially well-known to Google. Appengine requests come through reverse proxies, and there is no fundamental difficulty in not counting requests from crawlers towards the quota. That said, see my other descendant comment.


I once had a site with over a million auto generated pages. I thought if even 1 user visited 1 page a day, I'd be rich.

I no-indexed all of them because it had thin content. Guess what happened to my traffic? Almost no effect.

Stop assuming Google will send you traffic for auto generated pages - do you really think Google will even display them in the first few pages over quality content that actually is written by human beings?

Allowing those auto generated pages to be indexed will do you more harm than good. Noindex them.


Googlebot respect cache control. Have you tried more aggressive caching? There are solutions out there (free), like CloudFlare. You can even throttle your site, block some bots, etc, so at least your backend doesn't get shut down.


What's the point of letting Google index a site that nobody can get to? I'm pretty sure you're going to tank in their rankings anyway for your site being so sporadically available.

I'd probably go for not allowing spiders to crawl more than a few chosen pages (home, about, etc) until you have enough revenue to support it going to other pages.


Rather than it hitting quota error pages would it be feasible to give Googlebot a 503 header back after a certain amount of pages? Setting a Retry-After header to the next day should let it know when to come back for more

From https://plus.google.com/+PierreFar/posts/Gas8vjZ5fmB (Not sure how official this is but Pierre appears to work for Google)

Primarily the section

"2. Googlebot's crawling rate will drop when it sees a spike in 503 headers. This is unavoidable but as long as the blackout is only a transient event, it shouldn't cause any long-term problems and the crawl rate will recover fairly quickly to the pre-blackout rate. How fast depends on the site and it should be on the order of a few days."

Edit: Looks like the over quota page is a 503. Couldn't hurt to do it early yourself, Googlebot will see it the same way whatever provides it the 503


Silly question, but wouldn't putting something like Crawl-delay in the robots.txt file (somewhat) help alleviate the situation (if it is respected)?

Or maybe even just block crawling of the entire site except for the homepage?


Google doesn't respect crawl-delay, sadly. They rely on the Webmaster Tools setting, which is unavailable to me as I've described.


Restrict it 100% in robots.txt for now so it at least works? And then once you've managed to get through to a human (lol, good luck :( ) at Google, you can go from there.


See my reply to jacquesm above. Very wary of blocking, as sometimes persuading the engine you've unblocked it afterwards is nigh-on impossible.


Do you have any idea if this service will attract any actual users? Based on what I've read, I can't even figure out what it does, so I am certainly not a potential user. But, do you have any actual demand??

What I'm hearing is that you built a massive application, you've run into a technical problem and now you would rather wait on Google to fix it than to take any suggestions on how to get it up for actual users to use. Seriously, don't do this - at your stage, it would be better to have 10 real users than a site that has been fully indexed by Google.

On your note about persuading Google to index your site after being excluded, do you have any actual experience with this happening?? I've been doing this kind of stuff for years and years and have never had a problem. It can take five or six weeks at the outside, but that is still less of a problem than a product that can't be accessed...


I don't think you understand that your site is completely useless. It's not worth the death-by-bot-blocking if it's not there when you need it.


Unblocking via robots.txt is fine and won't cause problems.


You can easily tell Google not to index your site with a robots.txt -- although you probably don't want to.

You can also tell Google to index your site more slowly, in Google WebMaster Tools, although if I remember right the setting expires every few months, and needs to be reset.

The odd thing here IS that webmaster tools won't let him restrict the crawl rate, that's very odd.

(Also, it would be nice if you could restrict crawl rate in robots.txt, not just webmaster tools).

In the end though, if your business is going to depend on Google indexing it, then you don't really want to tell Google not to -- or, really even to tell it to index more slowly. But a robots.txt can be a temporary measure while you figure out what to do -- if you want Google to index your site, you've got to make your site able to stand up to googlebot traffic. Caching is often pretty helpful, and can help with your site's reliability and performance beyond googlebot issues.

That's kind of just the way it is, right? If you want google index, you've got to be able to handle googlebot. Nothing too shocking here?

Caching is definitely something to look into, that can improve the reliability and performance of your site beyond just dealing with googlebot.

I guess the odd thing is that Webmaster Tools is not letting the author rate-limit. And it would be really nice if google defined and respected some extension to robots.txt to do it there. I guess you could always rate-limit google bot with your own firewall-ish tools, but it might make googlebot mad and you might get even less indexing than you wanted.

Really, if your product's success depends on google indexing it, you don't want to slow it down anyway, except maybe as a temporary measure -- you're going to have to figure out how to handle it. People are usually complaining about how to make sure googlebot comes to _more_ of their site _more often_, not the reverse!


I guess if I gave myself the restrictions you've laid out, that I wouldn't necessarily do, I would have a check at the beginning of the request - if it was the google bot then give it back a 503 or something else. Then you can use this function based on other parameters, for example what time of day it is being used or maybe open up for parts of things you want indexed at times as others have suggested.


I'd be concerned as to adverse effects on my indexing. But it wouldn't fix my problem as it would still be a request that would require a GAE instance to handle it.

(edit) Yes, the bot is still hitting the GAE site atm even though it's returning a quota error.


But would the bot do the same number of requests if it got an error?


You could simply limit the pages through your robots.txt and slowly expose them in your sitemap. That would give you control over how many pages are spidered. If you want all your pages spidered and you dump millions of pages into the system the bot will hit you hard, but that's only because you offer a lot of pages to begin with, so that's where you can throttle the accesses.


I am a little concerned about doing that though. Having seen sites killed stone dead by people blocking stuff by mistake in robots.txt and the engines then never looking at them again I'm very wary indeed of blocking stuff I intend later to unblock.


It sounds like you care more about google traffic than you care about real users, if you want to do this without having more than a few entries in your robots.txt and your sitemap then you could simply remove the other pages until you're ready to have them spidered, alternatively have them behind a login and hand out invitations.

In a nutshell, if you put up millions of pages and tell google about it it will index you, if you don't want that you'll have to make choices about the quantity and/or switch to a different kind of host.

Also, this kind of 'bot trap' tends to attract penalties so if this is not some ploy to get traffic out of google you may want to re-consider how you've laid things out, the difference between a legitimate site with a lot of generated pages and a page-spammer is hard to determine and google tends to err on the side of caution.


Since your site is down I can't see how it's organized, but I would think hard about having two separate parts to it. One would be a site that changes slowly, perhaps a descriptive page, maybe with a 'best of' or examples that you cull. Then have the main page that is rapidly updated and keep the 'bots out of that. There's no reason to index the rapidly changing map, is there? Just index a slowly changing pointer to it.


But don't you think the bot has already stopped crawling the site due to it being unavailable? That's probably gonna hurt too.


We have a medium sized website with lots of user generated content. Think Google has indexed ~2m unique pages and they crawl ~420K every day. This costs us about 1,5GB/day in traffic. Dedicated servers with volume traffic option usually brings you a long way in terms of cost if you have this type of traffic pattern. For your case a cheap E3-1220 at €49 from leaseweb would bring you a long way.


As a short term solution, have you considered Cloudflare? If Google is repeatedly crawling the same static pages, you can have them served from the Cloudflare cache instead, and the free version should still have the features you need.

Another option would be to move to a dedicated server. You can get quite a powerful server from a company like LiquidWeb for a few hundred dollars a month. (A "managed" server, so although a bit of know-how is needed to get it performing optimally, they can help you with the basics at least.) I expect with a bit of tuning of your web server (nginx or even apache with mpm_event or worker) you could handle that level of traffic even without caching, but you could also use something like Varnish to do even better.


I'm not familiar with GAE and I'm suspecting this won't work there. But we've recently helped a client "deal" with enormous bot traffic. I dubbed the whole thing the "bot-split approach". One LB routes based on user agent. One traffic line goes to a heavily-cached server running Varnish and is hooked to a "stale" DB (updated hourly). And one traffic line dedicated to realtime/user traffic with almost zero caching. Heavy caching keeps the bot box alive with an OK load while the real user node has plenty of wiggle room.


How big is the site? I have rented a dedicated server with 100MBit unmetered and 1TB storage, 16GB ram and 2 xeon cpu's for under €100 a month; maybe it's better to look at such an option instead.


Really big :)

Its source is the English language, so if there's a word or phrase that gets used, it has a result. Corpus linguistics is fun like that.


If you've such big site, you should avoid resource priced cloud and go with your own VPS. It might take you more time to set it up, but it will be definitely much cheaper ($50/month and less), and it can surely handle all your traffic until it grows really big....


Yeah see here for example for 1TB/month traffic VPS: http://iwstack.com/, or unmetered traffic with dedicated box: http://www.online.net/en/dedicated-server/dedicated-server-o...

The cost effectiveness really depends on whether your data would fit into that 1TB or it'd require much more.


Even if his data doesn't fit into that 1TB, he can always delegate storage to different VPS and use a load balancer.

Saying "I have infinite data" is not an excuse for not looking for alternatives.



>Your site has been assigned special crawl rate settings. You will not be able to change the crawl rate.

That's the messed up part. I guess the question is, does robots.txt override that or not? If it does, fine. If not, all you need to do is make a few "google ignores robots.txt" posts and the problem solves itself.


Googlebot doesn't use the crawl delay line in robots.txt if that's what you're asking. I guess theoretically it's because you can normally change it in the webmasters interface.

It's not really ignoring robots.txt either, as crawl delay is not an 'official' setting


We had a similar experience at marine.travel a couple weeks ago.

Google would crawl our GAE site at bursts of about 30,000 requests in 4-minute periods. We had some quota exceeded moments.

On the other hand we got to load test our MongoDB backend in GCE without writing gatling tests. The results weren't promising for our ~$180/month VM.


'I made a site generating infinite amounts of pages after pages filled with auto-generated /dev/urandom. Its so precious I want GoogleBot to index ALL THE THINGS!

..and so googlebot indexes ALL THE THINGS eating my quota, if only it indexed my garbage slower.'

Cool story brah.


How about GAE learns to recognize google web crawlers and does not penalize GAE users for that traffic?


"The fact that it's Google who are causing me to use up my budget with Google is annoying but not sinister..."

Actually it is, google should detect its own GoogleBot and not charge people for using up all the traffic. Because this can be used on purpose to have people pay more. Very interesting artcile. Thank You, Jenny.


Could you maybe start serving 503 to googlebot after hitting a threshold?


When working in web hosting, I noticed Google Bot would take down many poorly optimized websites. Your site needs to be able to absorb a simple search indexer crawling your site. Next the OP will be writing about Yandex, Yahoo or Bing DDoSing his site.


I would definitely restrict Googlebot from accessing site, as it's acting like a "bad bot". Besides, does your business model rely on search engine traffic? I hope not.


Most websites rely on search engine traffic. It's how most people access websites these days - type "hacker news" into Google/Bing/Yahoo and click on the top link.

Especially if they haven't been there before.


Google Webmaster Tools allows you control crawling or just a simple robots.txt.


He mentioned in the article why neither of those suffice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: