[clug] googlebot doing funny things in logs

Thu Jun 16 02:04:33 MDT 2011

... and we're back on the list! :P

On Thu, 2011-06-16 at 17:24 +1000, Hal Ashburner wrote:
> I have no idea what Google would be looking for or how they would be 
> looking for it.
> Having been one of those who scoffed and said "Damn unlikely" when the 
> whole wireless streetview story broke, and then were proven very, very 
> wrong - once bitten.
> 

The thing about the wireless streetview thing is that data is really
potentially very useful for locating people without GPS. People found it
creepy and Google probably shouldn't have done it, but I can see why
it's useful and why engineers would go "yes, that's a brilliant solution
to a complex problem!" While they ignored the human aspects, of
course. ;)

> > * You went to ashburner.info/mythweb with Chrome while setting it up,
> > and it 404ed because of some misconfig. Chrome sends the 404ed URL to
> > Google as part of their smart 404 handling (as clearly stated in the
> > privacy policy.) You fix the server and start using mythweb. Shortly
> > thereafter, GoogleBot comes to see if there is anything actually at the
> > was-404ing URL. Crawled!
> "Clearly stated" well, hmm. I wonder in an impartial survey of chrome 
> users how many would answer
> Yes to a question "have you consented to google keeping a record of all 
> sites you visit that happened to be down, for any reason" It surprises 
> me, for one. But evidently it's clear to you so who knows? I consider 
> burying the important stuff in the fine print rather than explaining it 
> up front in plain English to be plain evil. 

>From http://www.google.com/chrome/intl/en/privacy.html (which I linked
in my very first email.)

"If you navigate to a URL that does not exist, Google Chrome may send
the URL to Google so we can help you find the URL you were looking for.
You can disable this feature as explained here."

Plain English as far as I can see, I found nearly the entire policy to
be quite clear as far as I could read it.

> Far more relaxed about my tv box being examined by google than my web 
> browsing to be honest. Even that part of it that that doesn't connect 
> for whatever reason.

Well, at least you don't use IE w/ Bing Toolbar. ;) One of the reasons I
know a bit about this tracking-to-make-search-better idea is that I
looked into what IE sends during the whole "Bing steals Google search
results" spat earlier this year.

http://projectgus.com/2011/02/bing-google-finding-some-facts/

Once again though, the Toolbar doesn't do it unless you agree to it
(although of course noone reads the details before they click 'OK'.)

> > * You've associated one of your Android devices with a Google account
> > and by default it's backing up your browser history and/or your
> > bookmarks "to the cloud".
> Didn't ask for it, not sure where they told me they're doing that or how 
> to opt out. (Surely should be an "opt in" service anyway?) Is this the 
> case for all android devices, given you essentially have to associate 
> them with an account to use them?

OK, this one is speculation and I should have worded it less strongly. I
have no idea what Android shares. Like I said earlier, I couldn't find
an Android Privacy Policy posted online and I don't own an Android
phone. Google offer that service (Web History backup) in Chrome so I
supposed that they could offer it in Android, too.

> > * You've had "Safe Web Browsing" and "send anonymous usage information"
> > turned on in Chrome at some point, and it's decided (as allowed for in
> > the privacy policy) to send all the links from your hitherto
> > unknown-to-Google mythweb page back to Google, so they can be checked
> > against it's anti-phishing database. GoogleBot has dutifully checked
> > them for phishing content, and come up against 401s.
> Is there a moral here?

The "sends all the links" part of this is speculation, because the
Chrome Privacy Policy is vague on exactly what gets sent for
anti-phishing. Known specifics seem to be:

If you _don't_ have "anonymous usage information" turned on then it only
sends hashes of individual URL components. From looking at the Chromium
source, that includes hashes of the URL components of all links on the
page.

That's not enough to identify the page though, unless they go out of
their way to reconstruct full URLs from common hashes (that's back in
conspiracy land, though.)

But if you _do_ have "anonymous usage information" turned on, though,
then the privacy policy makes it clear that more detail gets sent (it
says the page URL plus 'other information'). They don't say exactly
what, but I'd assume that sending URLs linked from the page would also
be useful as it lets them rapidly expand their anti-phishing database.

>  Maybe: never use a google browser on any intranet 
> ever as they're going to map your non-exposed network and increase the 
> chances of intrusion?

It's worth noting all the features I've mentioned are optional, and
straightforward to turn off.

I actually went and trawled the Chromium source last night cos I was
vaguely interested, and one thing I noticed is that private
(non-routable) IPs are immediately bypassed in the phishing filter. Only
potential internet IPs get the check.

You have to admit, there are useful non-evil applications for all of
these things Google are doing:

* If a page 404s due to a typo or misspelling, or it has moved, Google
Chrome can suggest the correct option.

* If a web site actually is hosting phishing data, getting the URLs to
Google quickly allows them to expand their phishing database and stops
anyone else falling for the phishing site. This one seems particularly
likely to me - because a hitherto-unknown domain name like
ashburner.info is potentially a phishing site.

Google doesn't "know" if ashburner.info/mythweb is a new exciting web
page, a dodgy phishing scam, or a private thing that leaked out until it
goes and checks for itself. Especially if you don't have a /robots.txt
saying "go away".

> > Those scenarios are all unlikely, but they seem simpler to me than
> > someone at Google programming the crawler to trawl for mythweb URLs.
> All seem creepier to me though.
> > Like you say though, I don't think you'll ever know for sure. :/
> Why is that?
> Are these the sorts of things that should be known, for sure?

It'd be great to know for sure, but you don't have logs and it's
probably not worth taking it up with Google. Nor will Google necessarily
have enough audit information to know (more speculation.)  So, as a
result it would seem to be all idle speculation at this point.

The only solid information to go on immediately, ironically, is Google's
own search results and the descriptions given in Google's Privacy
Policies.

And the good advice, dispensed all over this thread, that you should be
careful what goes on internet-facing boxen.

:)

- Angus