[clug] googlebot doing funny things in logs
Hal Ashburner
hal at ashburner.info
Thu Jun 16 01:24:26 MDT 2011
On 16/Jun/11 4:52 PM, Angus Gratton wrote:
> On Thu, 2011-06-16 at 15:32 +1000, Hal Ashburner wrote:
>> On 16/Jun/11 3:08 PM, Angus Gratton wrote:
>>> On Thu, 2011-06-16 at 13:58 +1000, Edward C. Lang wrote:
>>>> Truth be told, they were trying to find out if you were watching the
>>>> My Little Pony Friendship is Magic TV series.
>> Oh I do, as I mentioned for world peace, obviously.
>>> ;). At least they didn't erase all Hal's recordings,
>>>
>>> https://bbs.archlinux.org/viewtopic.php?id=63897
>>>
>>> Fight GoogleBot TV Censorship!
>> Someone else with a mythweb site with no links to the /mythweb path (if
>> I read that correctly) and yet googlebot somehow "found" it to index. So
>> I'm not totally alone, curiouser& curiouser.
>>
>> As far as I know mine was, is and will be secured with user name&
>> password and googlebot has, to my knowledge, not seen any of the pages
>> it has tried to index. I can't 100% swear to it as fact because I could
>> have made a few mistakes simultaneously that I missed - ie I make no
>> claims to perfection. We're just guessing as to why it's trying (and we
>> know it's failing) to index them at all
> This is true, but I have to say I'm pretty disinclined to believe a
> targeted mythweb conspiracy over the other potential explanations.
> Firstly, what use would such a conspiracy be to Google? The info they'd
> get would be about programs watched by the subset of people nerdy enough
> to use mythtv, techy enough to run it on a publicly accessible linux
> server, but who accidentally don't secure it properly. What would they
> want that info for? Even if they did want it, why would they put it in
> their search cache?
I have no idea what Google would be looking for or how they would be
looking for it.
Having been one of those who scoffed and said "Damn unlikely" when the
whole wireless streetview story broke, and then were proven very, very
wrong - once bitten.
> Secondly, if they were probing everyone's server for /mythweb someone
> else would have noticed (lots of paranoid security types out there.)
I have no idea how they might decide to try it, but you're getting to that.
> Thirdly, I think there are still many more possibilities for Google
> getting the URL via Chrome or some other vector - in addition to the
> explanation you described. Here are 3:
>
> * You went to ashburner.info/mythweb with Chrome while setting it up,
> and it 404ed because of some misconfig. Chrome sends the 404ed URL to
> Google as part of their smart 404 handling (as clearly stated in the
> privacy policy.) You fix the server and start using mythweb. Shortly
> thereafter, GoogleBot comes to see if there is anything actually at the
> was-404ing URL. Crawled!
"Clearly stated" well, hmm. I wonder in an impartial survey of chrome
users how many would answer
Yes to a question "have you consented to google keeping a record of all
sites you visit that happened to be down, for any reason" It surprises
me, for one. But evidently it's clear to you so who knows? I consider
burying the important stuff in the fine print rather than explaining it
up front in plain English to be plain evil. I thought google would not
do that. I hope google would not do that and this is merely a
misunderstanding of some kind.
Creeps me out more than googlebot trying to find out how many privately
hosted linux boxes run myth and the details, maybe as research for
google tv or something.
Far more relaxed about my tv box being examined by google than my web
browsing to be honest. Even that part of it that that doesn't connect
for whatever reason.
> * You've associated one of your Android devices with a Google account
> and by default it's backing up your browser history and/or your
> bookmarks "to the cloud".
Didn't ask for it, not sure where they told me they're doing that or how
to opt out. (Surely should be an "opt in" service anyway?) Is this the
case for all android devices, given you essentially have to associate
them with an account to use them?
> * You've had "Safe Web Browsing" and "send anonymous usage information"
> turned on in Chrome at some point, and it's decided (as allowed for in
> the privacy policy) to send all the links from your hitherto
> unknown-to-Google mythweb page back to Google, so they can be checked
> against it's anti-phishing database. GoogleBot has dutifully checked
> them for phishing content, and come up against 401s.
Is there a moral here? Maybe: never use a google browser on any intranet
ever as they're going to map your non-exposed network and increase the
chances of intrusion?
> Those scenarios are all unlikely, but they seem simpler to me than
> someone at Google programming the crawler to trawl for mythweb URLs.
All seem creepier to me though.
> Like you say though, I don't think you'll ever know for sure. :/
Why is that?
Are these the sorts of things that should be known, for sure?
Scares me even though I don't think I'm involved in any political
activism of any kind. Did google just hand over everything private they
could find on one Assange, J. To a secret US court or something where
twitter fought it and made it public?
"I do so solemnly swear, Senator, that I am not now, nor have I ever
been, a member of the anti-google fraternity."
Anyway this is way off topic and we'll start seeing swarms of googs
taking over, hiding under our beds soon.
More information about the linux
mailing list