[jcifs] Character Set discussions

Mon Feb 10 21:17:40 EST 2003

> -----Original Message-----
> From: Michael B. Allen [mailto:miallen at eskimo.com]
> Sent: Sunday, February 09, 2003 2:37 PM
> To: Eric
> Cc: jcifs at lists.samba.org; crh at ubiqx.mn.org
> Subject: Re: [jcifs] Character Set discussions
> 
> 
> On Sun, 09 Feb 2003 06:09:23 -0500
> Eric <eglass1 at attbi.com> wrote:
> 
> > > What I'm concerned will happen is that an escape sequence 
> like %C5%A5
> > > will be converted into the Unicode characters U+00C5 
> followed by the
> > > character U+00A5 rather than being converted to the 
> single character
> > > U+0165 as we intended.
> > 
> > isn't a method to UNescape the entire URI and return it.  There are 
> > methods to access the different components in this fashion, 
> however, and 
> > they do interpret %HH%HHs as UTF-8 characters; you would do
> > 
> > String str = uri.getScheme() + ":" + uri.getSchemeSpecificPart();
> > if (uri.getFragment() != null) {
> >      str += "#" + uri.getFragment();
> > }
> > 
> > Which will give you the input URI with all %HH%HHs 
> unescaped and decoded 
> > as UTF-8 -- basically, a Java string with the Unicode characters.
> 
> Ok. So it works. I'm a little surprised but I'm glad I was 
> wrong. However
> now I wonder if this behavior is locale depedant. Meaning if 
> you do the
> same thing in a Latin1 locale the escapes *are* interpreted 
> as individual
> characters rather than a UTF-8 sequence. They should be and I suspect
> they will because that's trivial by comparison.

The URI class always uses UTF-8 (regardless of the locale settings) to
interpret the escapes.  This is in line with the draft W3C recommendations.
Specifically, the java.net.URI javadoc states:

A sequence of escaped octets is decoded by replacing it with the sequence of
characters that it represents in the UTF-8 character set. UTF-8 contains
US-ASCII, hence decoding has the effect of de-quoting any quoted US-ASCII
characters as well as that of decoding any encoded non-US-ASCII characters. 

Sun seems to be taking this stance on UTF-8 for most URL-related encoding
issues; java.net.URLEncoder and java.net.URLDecoder allow you to specify a
character encoding, but the javadoc warns that not using UTF-8 may introduce
incompatibilities.

Eric

> In theory I suppose
> this can work provided the escaping takes the locale and surrounding
> characters into consideration. But I bet that was hairy peice 
> of code. I
> don't think I'm up to reproducing this in the smb Handler. Perhaps we
> can backport the URI class.
> 
> > Whether you can do a System.out.println(str) successfully 
> would depend 
> > on console support, as you noted; obviously, the ability to 
> output the 
> > character is limited by the ability of the console to represent it. 
> > Since I am able to do so, it looks fine on my screen.
> 
> Right. We are only concerned with how a Unicode string is handled
> internally. Getting one from the colsole or displaying one 
> correctly on
> the console is a totally separable thing.
> 
> Mike
> 
> -- 
> A  program should be written to model the concepts of the task it
> performs rather than the physical world or a process because this
> maximizes  the  potential  for it to be applied to tasks that are
> conceptually  similar and, more important, to tasks that have not
> yet been conceived. 
> 

**************************************************************************
The information transmitted herewith is sensitive information intended only
for use by the individual or entity to which it is addressed. If the reader
of this message is not the intended recipient, you are hereby notified that
any review, retransmission, dissemination, distribution, copying or other
use of, or taking of any action in reliance upon this information is
strictly prohibited. If you have received this communication in error,
please contact the sender and delete the material from your computer.