[clug] googlebot doing funny things in logs

Thu Jun 16 01:24:26 MDT 2011

On 16/Jun/11 4:52 PM, Angus Gratton wrote:
> On Thu, 2011-06-16 at 15:32 +1000, Hal Ashburner wrote:
>> On 16/Jun/11 3:08 PM, Angus Gratton wrote:
>>> On Thu, 2011-06-16 at 13:58 +1000, Edward C. Lang wrote:
>>>> Truth be told, they were trying to find out if you were watching the
>>>> My Little Pony Friendship is Magic TV series.
>> Oh I do, as I mentioned for world peace, obviously.
>>> ;). At least they didn't erase all Hal's recordings,
>>>
>>> https://bbs.archlinux.org/viewtopic.php?id=63897
>>>
>>> Fight GoogleBot TV Censorship!
>> Someone else with a mythweb site with no links to the /mythweb path (if
>> I read that correctly) and yet googlebot somehow "found" it to index. So
>> I'm not totally alone, curiouser&  curiouser.
>>
>> As far as I know mine was, is and will be secured with user name&
>> password and googlebot has, to my knowledge, not seen any of the pages
>> it has tried to index. I can't 100% swear to it as fact because I could
>> have made a few mistakes simultaneously that I missed - ie I make no
>> claims to perfection.  We're just guessing as to why it's trying (and we
>> know it's failing) to index them at all
> This is true, but I have to say I'm pretty disinclined to believe a
> targeted mythweb conspiracy over the other potential explanations.
> Firstly, what use would such a conspiracy be to Google? The info they'd
> get would be about programs watched by the subset of people nerdy enough
> to use mythtv, techy enough to run it on a publicly accessible linux
> server, but who accidentally don't secure it properly. What would they
> want that info for? Even if they did want it, why would they put it in
> their search cache?
I have no idea what Google would be looking for or how they would be 
looking for it.
Having been one of those who scoffed and said "Damn unlikely" when the 
whole wireless streetview story broke, and then were proven very, very 
wrong - once bitten.

> Secondly, if they were probing everyone's server for /mythweb someone
> else would have noticed (lots of paranoid security types out there.)
I have no idea how they might decide to try it, but you're getting to that.
> Thirdly, I think there are still many more possibilities for Google
> getting the URL via Chrome or some other vector - in addition to the
> explanation you described. Here are 3:
>
> * You went to ashburner.info/mythweb with Chrome while setting it up,
> and it 404ed because of some misconfig. Chrome sends the 404ed URL to
> Google as part of their smart 404 handling (as clearly stated in the
> privacy policy.) You fix the server and start using mythweb. Shortly
> thereafter, GoogleBot comes to see if there is anything actually at the
> was-404ing URL. Crawled!
"Clearly stated" well, hmm. I wonder in an impartial survey of chrome 
users how many would answer
Yes to a question "have you consented to google keeping a record of all 
sites you visit that happened to be down, for any reason" It surprises 
me, for one. But evidently it's clear to you so who knows? I consider 
burying the important stuff in the fine print rather than explaining it 
up front in plain English to be plain evil. I thought google would not 
do that. I hope google would not do that and this is merely a 
misunderstanding of some kind.

Creeps me out more than googlebot trying to find out how many privately 
hosted linux boxes run myth and the details, maybe as research for 
google tv or something.
Far more relaxed about my tv box being examined by google than my web 
browsing to be honest. Even that part of it that that doesn't connect 
for whatever reason.
> * You've associated one of your Android devices with a Google account
> and by default it's backing up your browser history and/or your
> bookmarks "to the cloud".
Didn't ask for it, not sure where they told me they're doing that or how 
to opt out. (Surely should be an "opt in" service anyway?) Is this the 
case for all android devices, given you essentially have to associate 
them with an account to use them?
> * You've had "Safe Web Browsing" and "send anonymous usage information"
> turned on in Chrome at some point, and it's decided (as allowed for in
> the privacy policy) to send all the links from your hitherto
> unknown-to-Google mythweb page back to Google, so they can be checked
> against it's anti-phishing database. GoogleBot has dutifully checked
> them for phishing content, and come up against 401s.
Is there a moral here? Maybe: never use a google browser on any intranet 
ever as they're going to map your non-exposed network and increase the 
chances of intrusion?
> Those scenarios are all unlikely, but they seem simpler to me than
> someone at Google programming the crawler to trawl for mythweb URLs.
All seem creepier to me though.
> Like you say though, I don't think you'll ever know for sure. :/
Why is that?
Are these the sorts of things that should be known, for sure?
Scares me even though I don't think I'm involved in any political 
activism of any kind. Did google just hand over everything private they 
could find on one Assange, J. To a secret US court or something where 
twitter fought it and made it public?

"I do so solemnly swear, Senator, that I am not now, nor have I ever 
been, a member of the anti-google fraternity."

Anyway this is way off topic and we'll start seeing swarms of googs 
taking over, hiding under our beds soon.