[clug] googlebot doing funny things in logs

Scott Ferguson scott.ferguson.clug at gmail.com
Wed Jun 15 00:47:15 MDT 2011


On Wed, 15 Jun 2011 11:55:30 +1000 Hal Ashburner wrote:
> I changed my server on the weekend and after restoring the mythtv 
> database forgot to put the usual
> User-agent: *
> Disallow: /
> in a robots.txt file.
> I was just glancing through some logs and in amongst one or two 
> seemingly fairly unsophisticated attempts at entry, google bot made an 
> appearance.
> 

Right there - without reading any further - I can see you're heading for
trouble.... :-)

> It first asked for robots.txt, which seems good manners (and which 
> wasn't present)
> Then asked it for
> /mythweb/settings/database
> then
> /mythweb/settings/tv/screens
> then
> robots.txt again
> then
> /mythweb/tv/schedules/manual
> only then did it ask for
> /
> 

I mean no offence but... that's what googlebot does. If you think that's
a problem wait (could be months) until Bingbot finds you - and you get a
dozen Bingbots trying to index your site simultaneously.
And there's a lot of worse things crawling my sites than Bing. At least
Bing and Google respect robot.txt.

It (Googlebot) can only index what it finds within the limitations of
your robots.txt. So if it requested those files (and I don't doubt you
that it did) - then it did so because those files exist - and you
allowed them to be visible.


<snipped>
> I'm a little weirded out by it, in truth.

As you should be - it's a shock to any sane person to see things
publicly accessible that should be private.

On the plus side - you have no put robots.txt in place and this is a
good time check all your file permission *and install .htacess into
every directory within your webservers purview*. :-)

> 
> Hal
> (One of the dreaded CLUG "list-only" members ;-) The list is brilliant 
> imho).


Having cleaned up a number of sites recently that were compromised
because of a failure to use .htaccess.....
Robots.txt restricts polite bots only. It is not something you want to
rely on.

As for Google recruiters - I've only experienced one (I'm probably not
that attractive!) - she was very patient and polite - and clearly has a
*very* tolerant sense of humour! ;-p (but that's another story).

At least you stopped the problem of your files becoming publicly
search-able - you never know when a myth vulnerability will cause a
Google search to bring miscreants to point Metasploit at your server.

But please do implement .htaccess files, and check those permissions -
there's enough malware servers out there infecting Windows machines that
then slow my firewall down with their incessant attempts to spread their
diseases. ;-p

You could also consider using https and a login to additionally secure
your site - and *if* it (sic) runs a cms consider changing the admin
directory from halssite.com.au/admin/ to something less predictable like
halssite.com.au/8ur_dQ/

Regards

Scott Ferguson
(Another one of the CLUG list only members - unless ten years ago counts!)


More information about the linux mailing list