[clug] Another thought on Cleanfeed

Michael Cohen scudette at gmail.com
Tue Oct 28 11:41:38 GMT 2008

  Web based filtering is not only a hit and miss affair its actually
completely impossible to implement properly. Even if the page you are
trying to block actually contains the words you are trying to block on
- the filter can easily be bypassed. Here is an example (and thats
just the tip of the iceberg). Say you have the word "sex" in your
blacklist and i want to download a page containing this word. The
proxy is likely to intercept my request for the page and scan the page
for the word before it gives it to me.

That is actually much more difficult that it sounds.

Imagine if the page contained 2mb of random data and then the sex
word. What is the proxy to do? should it allow the 2mb through until
the sex word and then kill the connection? if yes it may well be too
late since most of the content is already been allowed to go to the

If not - the proxy needs to download the full page, scan it in its
completion and then forward the whole thing back to the client. What
if the page takes 10 minutes to fully download (a http object can be
unlimited in size). Most browsers will give up on the connection if no
data is coming back in such a long time. What is the maximum size
limit the proxy holds for scanning before releasing a partial page?

Many proxies therefore scan individual reads and pass them back to the
client one at the time. What if the server sends the string 's' wait a
second 'e' wait again and then 'x'. What is the proxy to do?

Does it send each character as it appears? If yes it will fail to
match, if not it will make the browser wait too long.

Another example is the gzip encoding - clients can ask for pages to be
compressed using gzip encoding. They do so by sending an
Accept-Encoding: gzip header. If the proxy allows this header through
the server is likely to send the page back compressed. How should the
proxy examine a compressed page?

Should the proxy uncompress the page before scanning it? Maximum
compression ratio for gzip is about 1000:1 so a 1mb page will
decompress to 1gb in the proxy ram before it can be scanned - does
this scale to a country?

Many proxies drop the Accept-Encoding header before sending the
request to the server. This makes bw utilization even worse.

Also you might notice that many sites these days use javascript
obfuscation - basically to stop simple perl crawlers. The page is
obfuscated in some way and has some javascript to de-obfuscate it on
the fly in the browser. (For example MSN tv guide does that and
Oreilly bookshelf too). Besides the fact that this obfuscation is a
pretty stupid idea for preventing crawlers (its trivial to bypass in
about 20 lines of code) its a killer for proxies because the
javascript obfuscation algorithm can be as complex as it likes and
change very often. Although a crawler can easily deobfuscate the page,
a proxy can not.

You dont need to go to SSL to disable a scanning proxy - its very
difficult to write a scanning web proxy that works at all. The harder
you try the more of an effect you have on connection speeds. The
proxies they trialled showed varying levels of performance
degradation. Thats not because one product was written badly and one
was written well - its because each product made a different
performance/security tradeoff. Its not a simple matter of choosing the
filter that performed the best by slowing the network least, because
the tradeoffs adopted by that solution meant that it was also easiest
to bypass. By the same token the filter which was hardest to bypass
(which is not that hard really anyway) was the one with the most
impact on network performance. Which should be adopted then?

In an enterprise setting the filter serves critical security goals.
The filter serves a function and the organization is prepared to take
the performance degradation hit for it. Mandatory filtering is a
political vote buyer and performance degradation is critical - so i
expect the solution with the least impact will be adopted. So it
should be very trivial to bypass.

Certainly the best solution is to require browsers to implement the
evil bit properly, and users to enable the evil checkbox when they
surf for pr0n.


On Tue, Oct 28, 2008 at 9:34 PM, Paul Matthews <plm at netspace.net.au> wrote:
> *If* we do see a 70% slow down as indicated in some papers, then given
> that my ISP and I had entered into a contract that he would provide
> access at a given rate, and not one that is 70% slower.... hmm ... I
> foresee a lot of ISP getting a lot of angry calls.
> *If* it is doing content filtering on the whole file, then what is the
> chance of *any* naughty four letter word appearing in a 4 Gb mostly
> compressed (ie random data) ISO. Not good if its 98% of the way though
> the download of course.
> We had this problem with the filter at work. One IBM DB/2 patch was not
> downloadable as the .zip binary had a four letter word in it.
> --
> Fools ignore complexity. Pragmatists suffer it.
> Some can avoid it. Geniuses remove it.
> --
> linux mailing list
> linux at lists.samba.org
> https://lists.samba.org/mailman/listinfo/linux

More information about the linux mailing list