[jcifs] Character Set discussions

Tue Feb 11 07:10:11 EST 2003

On Mon, 10 Feb 2003 05:17:40 -0500
"Glass, Eric" <eric.glass at capitalone.com> wrote:

> > > > What I'm concerned will happen is that an escape sequence 
> > like %C5%A5
> > > > will be converted into the Unicode characters U+00C5 
> > followed by the
> > > > character U+00A5 rather than being converted to the 
> > single character
> > > > U+0165 as we intended.
> > > 
> > > isn't a method to UNescape the entire URI and return it.  There are 
> > > methods to access the different components in this fashion, 
> > however, and 
> > > they do interpret %HH%HHs as UTF-8 characters; you would do
> > > 
> > > String str = uri.getScheme() + ":" + uri.getSchemeSpecificPart();
> > > if (uri.getFragment() != null) {
> > >      str += "#" + uri.getFragment();
> > > }
> > > 
> > > Which will give you the input URI with all %HH%HHs 
> > unescaped and decoded 
> > > as UTF-8 -- basically, a Java string with the Unicode characters.
> > 
> > Ok. So it works. I'm a little surprised but I'm glad I was 
> > wrong. However
> > now I wonder if this behavior is locale depedant. Meaning if 
> > you do the
> > same thing in a Latin1 locale the escapes *are* interpreted 
> > as individual
> > characters rather than a UTF-8 sequence. They should be and I suspect
> > they will because that's trivial by comparison.
> 
> The URI class always uses UTF-8 (regardless of the locale settings) to
> interpret the escapes.  This is in line with the draft W3C recommendations.
> Specifically, the java.net.URI javadoc states:
> 
> A sequence of escaped octets is decoded by replacing it with the sequence of
> characters that it represents in the UTF-8 character set. UTF-8 contains
> US-ASCII, hence decoding has the effect of de-quoting any quoted US-ASCII
> characters as well as that of decoding any encoded non-US-ASCII characters. 
> 
> Sun seems to be taking this stance on UTF-8 for most URL-related encoding
> issues; java.net.URLEncoder and java.net.URLDecoder allow you to specify a
> character encoding, but the javadoc warns that not using UTF-8 may introduce
> incompatibilities.

So you are pretty much locked into converting to Unicode (preferably
UTF-8) first (no problem for Java but I'm thinking of the SMB URL spec
in general too). That leaves non-Unicode locale dependant applications
in an awkward position. You have to go through UTF-8 to get to ASCII
although Latin1 is backward compatible with UTF-8 so that favors the
majority. Something like smbclient and libsmbclient might have a problem
with this. I guess these clients would be no worse off then they are
now with no Unicode support at all.

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived.