“Fetch as Googlebot” tool helps to debug hacked sites

Zenst · on Aug 3, 2012

I know one approach from over 10 years ago was to have your editable site on a issolated server that would periodicaly copy its content to the hosting site. This meant that any changes to the hosted site would get stamped over by the master copy.

Still nothing is perfect and this is a good read about a issue alot of people are not aware of.

thaumaturgy · on Aug 3, 2012

What's frustrating to me -- as someone who's getting closer and closer to publicly launching a shared hosting service -- is that this problem should be solved already: we have twice-daily backups for our websites, going back up to a year. It's all automated, and customers can access the backups directly. If your site's been hacked, you can log in to the backup server, view the changes for your site's directory, and download the most recent good copy.

For most sites that store their templates as regular files and their contents in a database, this is plenty good enough. For sites that store their content as regular files too, it only takes a few extra minutes to separate the good stuff from the bad stuff.

This is super easy to implement. Every web host should be doing it.

Zenst · on Aug 3, 2012

I would urge some caution with regards to backups. If there easily accesable from the site then if your site is hacked then they are easily accesable to the hacker. It is a common overlooked area and can also be a weakness that I have seen in many hosted/colo sites were they have all the servers isolated and yet still linked via some all singing all dancing backup server.

thaumaturgy · on Aug 3, 2012

Yeah, that's something I considered pretty carefully. The backup server is completely isolated from the rest of the network, it pulls the backups via ssh/rsync using a special user account that only has sudo permissions for the rsync command (and can only authenticate via ssh certificate). The only way to break the backup server from a compromised server would be to replace OpenSSH on the compromised server and then wait for the backup server to connect -- and then try to somehow break rsync.

Thanks for thinking of that though.

ars · on Aug 3, 2012

What would be the point? If you got hacked once, after you restore the files you'll just get hacked again.

As a reporting tool it could be interesting though.

thaumaturgy · on Aug 3, 2012

I agree with Zenst below that it's a separate (though not less important) issue. The backups are there to get the customer back on their feet as quickly as possible, and to protect their data. A surprising number of our clients have had problems with websites related to not having any backups or local copies ... web-based CMSs have made this a remarkably easy trap to fall into.

The nature of the compromise is something we'd be interested in, and hopefully something that they'd bring to our attention. I'd like to be able to have our log monitoring software watch for attempts at common exploits and automatically block them, but it doesn't do that yet. Which is one reason why it's still not ready for launch yet. :-)

Zenst · on Aug 3, 2012

Content and software are seperate. Were talking about content and not the software. Sure you fix the flaw in the software that allowed the expliot or the weak password or however the site was taken over, but it is a seprate issue.

justincormack · on Aug 3, 2012

Not in PHP they aren't always, which is what the vast majority of hacks are on...

dangrossman · on Aug 3, 2012

They are separate in the PHP software that gets targeted by "the vast majority of hacks". Those hacks are against popular CMS packages that can be scanned for and exploited in an automated fashion. In a CMS, the software being exploited and the content are separate, in PHP as every other language.

The PHP files where the content and the system are one and the same (hand written pages not using a packaged CMS) aren't part of "the vast majority of hacks" category. Compared to exploiting a WordPress vulnerability in 50+ million installs, someone trying to mess with the black box that is someone's custom written page happens insignificantly rarely. Your retort doesn't hold water.

duskwuff · on Aug 3, 2012

> The PHP files where the content and the system are one and the same (hand written pages not using a packaged CMS) aren't part of "the vast majority of hacks" category.

Speaking from experience, this is simply not true. There are automated scanners in the wild which will attempt to detect and exploit common vulnerabilities in simple PHP templating systems and CMSes. One frequently exploited vulnerability is in applications which use URLs of the form:

    index.php?page=foobar

With supporting code along the lines of:

    $page = $_GET["page"]; /* if register_globals isn't set */
    include("pages/$page.html");

Until relatively recently, when PHP started rejecting filenames with embedded null bytes, code like this was vulnerable to input such as:

    index.php?page=../../../../../../proc/self/environ%00

Applications like this are relatively easy to detect in an automated fashion, and were for a time being exploited on a very large scale.

ThisIsADogHello · on Aug 3, 2012

... Wait. Your solution to secure a website against being hacked is to automate reverting any hacked content?

ars · on Aug 3, 2012

Another thing to do is the webserver should not have any write access to the files it serves.

The files must be created by a different account. For certain setups this can be problematic, but it's a good idea for most.

thaumaturgy · on Aug 3, 2012

This completely falls over for popular packages like Joomla, unfortunately, which have miserably bad caching systems and file upload mechanisms and web-based upgrade functions and module installers and the like.

Sukotto · on Aug 3, 2012

Interesting to learn that there's a communication method one can use to directly speak to a human at google and get a helpful same-day reply.

How do I, as someone outside of the music industry, gain access to that communication channel?

packetslave · on Aug 3, 2012

Become a paying customer

cantlin · on Aug 3, 2012

    $ curl -v -A Googlebot example.org
    $ curl -v -e www.google.com example.org

RKearney · on Aug 3, 2012

Unfortunately this does not magically make your request come from Google's IP block for their web crawlers. Just as easily as the attacker put in User Agent checks, they could have put in source IP checks.

michaelmior · on Aug 3, 2012

I was going to post the same thing. Fetch as Googlebot is nice but really nothing too special.

joshu · on Aug 3, 2012

You are wrong. Stop guessing and then acting like your guesses are facts.

michaelmior · on Aug 3, 2012

What I am guessing? Your comment does nothing to clarify any mistake I might have made, which I'll gladly admit to.

There are certainly differences between just setting the user agent and running Fetch as Googlebot. (The incoming IP address being an obvious one.)

mef · on Aug 3, 2012

For the curious, the site mentioned in the article appears to be either Alanis Morissette or The Doors: https://www.google.com/search?q=Generic+synthroid+bad+you

essayist · on Aug 3, 2012

This may help clear up something that has puzzled me about a site I use and often search via Google.

https://www.google.com/search?num=100&hl=en&safe=off...

(In this case, I specifically put in "For Sale" to highlight the spammy drug ads, but they come up even without this).

Puzzle: I scan through looking for pages with "XYZ For Sale" in the title and then check out Google's cached version of the page. Sometimes, I see the spam in the cache, but often enough I don't.

So: how is it that the search result is different than google's own cache for that page?

drivebyacct2 · on Aug 3, 2012

The sad thing is, even people who ought not be amateurs, but are, fall for this trick. My mother's company's site was hosted on GoDaddy (I offered to move it and pay for the hosting as it's a non-profit, but she declined). They swore up and down for weeks and weeks that they were not responsible and there was nothing wrong when it was a hosted Wordpress instance.

Most of the time they (hackers, sorry pronoun overload) just naively check the referrer. Going to Google and searching for the site and clicking it is often sufficient

ben0x539 · on Aug 3, 2012

Referrer header considered harmful yet? :-)