[jcifs] Character Set discussions

Sat Feb 8 07:48:00 EST 2003

On Fri, 7 Feb 2003 06:15:20 -0500 
"Glass, Eric" <eric.glass at capitalone.com> wrote:

> jCIFS would probably never need to escape characters for representation; if
> I give it a Java string with unescaped characters, i.e.:
> 
> smb://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip
> 
> you would just handle it directly.  jCIFS (and other clients) may need to
> UNescape characters, however; if you are given:
> 
> smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip
> 
> that is a valid URL, and should work.  This would be especially important

Well this does make sense. It doesn't make sense for the SMB URL but
to be consistent and interooperable with other URL-aware software it
makes sense.

> for browsers supporting SMB.  If I load an HTML page containing an SMB URL
> like:
> 
> <a href="smb://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip">Check out my
> sweet slovak zip file</a>
> 
> that is currently illegal (this is one of the cases addressed by the IRI);
> the proper way to represent this is:
> 
> <a
> href="smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip">C
> heck out my sweet slovak zip file</a>
> 
> For historical reasons, the HTTP URL does not specify that the %HH%HHs MUST
> represent UTF-8 encoded characters.  It is the recommended practice,
> however.  RFC 2718 recommends that new URL schemes (such as SMB) adopt UTF-8
> as the standard encoding in cases such as this, unless there is some
> compelling reason to do otherwise.

Let's be honest. They're going to use UTF-8. It's ugly but there would
be far too much resistance to use anything else.

> The only implication for jCIFS would be that if I choose to give you:
> 
> smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip
> 
> You would unescape the %HHs and interpret the result as UTF-8.  In Java 1.4,
> the java.net.URI class will do this for you automatically; I can do:
> 
> URI uri = new
> URI("smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip");
> 
> and get the properly decoded components from the URI object.  It is
> interesting to note that the java.net.URI class deviates from RFC 2396 in
> that it MANDATES the UTF-8 encoding recommendation; if a URL scheme did have
> a compelling reason to use an encoding other than UTF-8, such a URI would be
> unusable with the java.net.URI class.

Didn't we(I) conclude previously however that once we start accepting
escapes we must also return them? And if we return them then do we escape
characters as we drill down doing list() operations? We just can't do
that. We might as well just drop the URL status alltogether and call
it URL-like. Does the URI class remove and discard the escapes? I don't
have Java 1.4 on my machines. Is there anyway to get the original escaped
URL from you uri object above back? Like from toString()?

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived.