[clug] googlebot doing funny things in logs

Hal Ashburner hal at ashburner.info
Wed Jun 15 21:08:47 MDT 2011

On 16/Jun/11 9:54 AM, Alex Satrapa wrote:
> On 15/06/2011, at 16:47 , Scott Ferguson wrote:
>> But please do implement .htaccess files, and check those permissions -
>> there's enough malware servers out there infecting Windows machines that
>> then slow my firewall down with their incessant attempts to spread their
>> diseases. ;-p
> I'd also suggest the following advice:
> Don't leave stuff on an Internet-facing host that you don't want to be accessible over the Internet. Your home network is not too small to matter. Your home network is not too small to be noticed.
> It's really simple: someone out there already knows a vulnerability which you and your OS publisher haven't heard of yet. If you start putting complex applications intended for individual use on an Internet-facing host, chances are you're opening a vulnerability which will end up being exploited by someone like lulzsec. The more junk you have installed on the Internet-facing host — regardless of whether it's listening to connections or just installed and "doing nothing — the more opportunities an intruder has of using your machine for their own purposes.
> There's a lot more to securing a machine than simply installing a firewall and DROPping every packet you don't like.
Tee hee! There's even more to securing a machine than that! :P
You've actually got to unplug it completely from the network "as Pwn2own 
has shown", just because you're running no services doesn't mean you 
can't be hacked. (Ooohhhh sorry ESR, cracked - now let me explain to the 
world that you reckon cracked means hacked, while hacked means something 
that doesn't lead the conversation to ESR being a gun nut who writes 
articles with titles like "Sex tips for geeks") ;-)

Then, because of wireless, bluetooth, ir, van-eck phreaking etc you've 
actually got to switch the machine off, unplug it and take out the 
battery if there is one.
But this still leaves the physical access attack vector, so you've got 
to make sure it can't be switched on. With an axe.
Drives could be swapped out to another machine - so they'd better be 
unreadable too. Encrypted isn't good enough because they could torture 
the password out of you even if the encryption scheme is "technically" 

Or option B is to trade off a reasonable assessment of the risk and the 
cost with the value of the service while trying to minimise the first 
two to some reasonable degree then make your trade off. There are those 
who do this professionally, some on this list, who can answer questions 
about the practicalities of it better than me.  Reckon you're one of 
them Alex! So are you recommending nobody run services visible to the 
web unless they treat they are experts who are willing to spend more 
than N hours a week securing it?
Mythweb == evil ? ssh tunnel and use a curses interface (write it if it 
doesn't exist) : ssh tunnel and use mythweb invisible to the web as 
htdigest isn't remotely good enough;
DMZ the machine from the lan and keep nothing on the disks other than tv 
Something else? What say you, i'd be interested to hear your (and 
others) reasoning.


But back to the original point how does google even know that /mythweb 
exists, given nothing links to it, it's not my usual location for it, it 
is, and I believe always has been behind a password, and until I forgot 
it on the machine changeover on the weekend there was a robots.txt 
disallowing everything from anyone if they're remotely polite - which 
googlebot claims to be and usually seem to be.

1) I must have originally had it placed at /mythweb and linked from my 
front page and have forgotten I did this over 2 years ago while exposing 
it via the firewall.
2) I must have not had a robots.txt at that time as well as now.
3) I must have let the password protection down at that time.
4) Googlebot must have proceeded to do "it's normal practice" following 
links and indexing pages with all of these things simultaneously in 
effect and also remembered all this about my domain for over 2 years, 
then followed the memory of links rather than actual links, while 
keeping the memory of links in its index when refused access.
And there's basically no enquiry I can make with google about it.

Any one of those I'd say, absolutely fair enough, I'm a goose and I make 
mistakes and the mythweb setup was an experimental diversion, much as 
the whole mythtv thing was and is.
Two of them, yeah why not, coins land heads 4 times in a row, sure.
Three, sure, slap my head about being a bigger goose than I thought but 
we all have bad days, right? And I probably was having at least one or 
two of those 2 years ago now I think about it.
3 + google having an index of the structure of links from over 2 years ago?
Well okay. It *is* possible. Interesting if that's what it is, though, 
huh? I'd also have thought all that occurring simultaneously unlikely. 
If there's no alternative explanation I guess I'd have been doing some 
wrong thinking about that too.

As it is, there's no damage done, presumably the Goog will "forget" the 
no longer relevant links that are on it's page for my domain one day 
given they're not even indexed, but it might take more than 2 years. 
It's just a bit weird. So yeah, the "best" explanation I have is is 
"iSuck" and google odd.

And if nobody else sees anything like it in the world I guess we can 
safely say that Googlebot is not conducting "research" or I can have 
paranoid fantasies about being specifically targeted by googlebot which, 
I'd have to say, is very unlikely. A lot less likely than them 
acknowledging and apologising for their repeated telephone script "our 
engineers were brainstorming and suggested you'd be ideal for google and 
google wants you - but please read this ad and then passionately make 
your case why google should deign to consider you." Then going on to 
refusing to put me in touch with one of these engineers who they claim 
are friends of mine and who personally recommended me to discuss why 
they would think such a thing and /whether/ I'm any kind of fit for goog 
- which I might well not be at all. Common sense tells you it's just a 
bit of webstalking by paid placement consultants who are not google 
employees priming applications - paid on commission for applicants who 
get through weeks of interviews, still want the gig, and get hired - to 
the tune of 25% of salary. A good probability high paying raffle ticket 
for a couple of phonecalls and some web searching. If you toy with them 
long enough they'll admit to the initial scripted dishonesty. The 
placement consultant industry sucks very hard as most of us are aware - 
Google are no different to the norm for large, poisonous corporations on 
that front even if they do make claims to be different on other fronts. 
I'd love it if they took to that wart on their face, head on, as it were 
and disrupted the odiousness of the industry, wouldn't we all? Some also 
want ponies, ponies that defecate world peace... ;-)

Thanks all for your thoughts.

More information about the linux mailing list