[jcifs] Character Set discussions

Michael B. Allen miallen at eskimo.com
Tue Feb 11 07:10:11 EST 2003


On Mon, 10 Feb 2003 05:17:40 -0500
"Glass, Eric" <eric.glass at capitalone.com> wrote:

> > > > What I'm concerned will happen is that an escape sequence 
> > like %C5%A5
> > > > will be converted into the Unicode characters U+00C5 
> > followed by the
> > > > character U+00A5 rather than being converted to the 
> > single character
> > > > U+0165 as we intended.
> > > 
> > > isn't a method to UNescape the entire URI and return it.  There are 
> > > methods to access the different components in this fashion, 
> > however, and 
> > > they do interpret %HH%HHs as UTF-8 characters; you would do
> > > 
> > > String str = uri.getScheme() + ":" + uri.getSchemeSpecificPart();
> > > if (uri.getFragment() != null) {
> > >      str += "#" + uri.getFragment();
> > > }
> > > 
> > > Which will give you the input URI with all %HH%HHs 
> > unescaped and decoded 
> > > as UTF-8 -- basically, a Java string with the Unicode characters.
> > 
> > Ok. So it works. I'm a little surprised but I'm glad I was 
> > wrong. However
> > now I wonder if this behavior is locale depedant. Meaning if 
> > you do the
> > same thing in a Latin1 locale the escapes *are* interpreted 
> > as individual
> > characters rather than a UTF-8 sequence. They should be and I suspect
> > they will because that's trivial by comparison.
> 
> The URI class always uses UTF-8 (regardless of the locale settings) to
> interpret the escapes.  This is in line with the draft W3C recommendations.
> Specifically, the java.net.URI javadoc states:
> 
> A sequence of escaped octets is decoded by replacing it with the sequence of
> characters that it represents in the UTF-8 character set. UTF-8 contains
> US-ASCII, hence decoding has the effect of de-quoting any quoted US-ASCII
> characters as well as that of decoding any encoded non-US-ASCII characters. 
> 
> Sun seems to be taking this stance on UTF-8 for most URL-related encoding
> issues; java.net.URLEncoder and java.net.URLDecoder allow you to specify a
> character encoding, but the javadoc warns that not using UTF-8 may introduce
> incompatibilities.

So you are pretty much locked into converting to Unicode (preferably
UTF-8) first (no problem for Java but I'm thinking of the SMB URL spec
in general too). That leaves non-Unicode locale dependant applications
in an awkward position. You have to go through UTF-8 to get to ASCII
although Latin1 is backward compatible with UTF-8 so that favors the
majority. Something like smbclient and libsmbclient might have a problem
with this. I guess these clients would be no worse off then they are
now with no Unicode support at all.

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 


More information about the jcifs mailing list