[jcifs] Character Set discussions

Glass, Eric eric.glass at capitalone.com
Fri Feb 7 22:15:20 EST 2003


> 
> 	However despite our recent discussion of HTTP and it's 
> handling of HTTP URLs, let's
> 	not forget that SMB is not HTTP and these SMB URLs will 
> not be submitted in GET
> 	requests (not directly). It's the HTTP transport that 
> is requiring normalization to
> 	ASCII. If (when) the browsers support the SMB URL they 
> will use the local CIFS
> 	client directly which again does not require escaping. 
> Therefore I do not believe the
> 	jCIFS package should be encoding/decoding or 
> escaping/unescaping Unicode
> 	characters in SMB URLs (if that's something being 
> suggested; don't know).
> 

jCIFS would probably never need to escape characters for representation; if
I give it a Java string with unescaped characters, i.e.:

smb://svr/slovak/môžem/jesť/sklo/nezraní/ma.zip

you would just handle it directly.  jCIFS (and other clients) may need to
UNescape characters, however; if you are given:

smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip

that is a valid URL, and should work.  This would be especially important
for browsers supporting SMB.  If I load an HTML page containing an SMB URL
like:

<a href="smb://svr/slovak/môžem/jesť/sklo/nezraní/ma.zip">Check out my
sweet slovak zip file</a>

that is currently illegal (this is one of the cases addressed by the IRI);
the proper way to represent this is:

<a
href="smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip">C
heck out my sweet slovak zip file</a>

For historical reasons, the HTTP URL does not specify that the %HH%HHs MUST
represent UTF-8 encoded characters.  It is the recommended practice,
however.  RFC 2718 recommends that new URL schemes (such as SMB) adopt UTF-8
as the standard encoding in cases such as this, unless there is some
compelling reason to do otherwise.

The only implication for jCIFS would be that if I choose to give you:

smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip

You would unescape the %HHs and interpret the result as UTF-8.  In Java 1.4,
the java.net.URI class will do this for you automatically; I can do:

URI uri = new
URI("smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip");

and get the properly decoded components from the URI object.  It is
interesting to note that the java.net.URI class deviates from RFC 2396 in
that it MANDATES the UTF-8 encoding recommendation; if a URL scheme did have
a compelling reason to use an encoding other than UTF-8, such a URI would be
unusable with the java.net.URI class.


Eric
 
**************************************************************************
The information transmitted herewith is sensitive information intended only
for use by the individual or entity to which it is addressed. If the reader
of this message is not the intended recipient, you are hereby notified that
any review, retransmission, dissemination, distribution, copying or other
use of, or taking of any action in reliance upon this information is
strictly prohibited. If you have received this communication in error,
please contact the sender and delete the material from your computer.


More information about the jcifs mailing list