[jcifs] Character Set discussions

Eric eglass1 at attbi.com
Sun Feb 9 22:09:23 EST 2003


> 
> Yes, that terminal is obviously in a UTF-8 locale. Did you set that
> up or are you using UTF-8 as the default locale? Is it Red Hat 8.0? RH8
> uses the UTF-8 locale by default.
> 
> Doesn't matter. Now that we have established you can properly display
> Unicode Strings here is the $24,000 question:
> 
> If you display these URIs unescaped (not clear to me how you do that)
> what do they look like? Are the characters properly converted? How about
> in the URL with both escaped and unescaped characters?
> 
> What I'm concerned will happen is that an escape sequence like %C5%A5
> will be converted into the Unicode characters U+00C5 followed by the
> character U+00A5 rather than being converted to the single character
> U+0165 as we intended.
> 
> Mike
> 

Okay, I think I see where you're coming from (Red Hat 8 is on this box, 
incidentally).

You're asking, if I enter a URL like:

smb://svr/slovak/m%C3%B4žem/jest(/sklo/nezran%C3%AD/ma.zip

(with a mixture of escaped and unescaped) or even just

smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip

(all escaped), how does it interpret the %HH%HHs -- as a single UTF-8 
encoded char or as 2 separate characters.

The URI.toString() method returns the raw URL as entered by the user -- 
in the first case above, a mix of escaped and unescaped.  The 
URI.toASCIIString() method escapes all the non-ASCII characters.  There 
isn't a method to UNescape the entire URI and return it.  There are 
methods to access the different components in this fashion, however, and 
they do interpret %HH%HHs as UTF-8 characters; you would do

String str = uri.getScheme() + ":" + uri.getSchemeSpecificPart();
if (uri.getFragment() != null) {
     str += "#" + uri.getFragment();
}

Which will give you the input URI with all %HH%HHs unescaped and decoded 
as UTF-8 -- basically, a Java string with the Unicode characters.

Whether you can do a System.out.println(str) successfully would depend 
on console support, as you noted; obviously, the ability to output the 
character is limited by the ability of the console to represent it. 
Since I am able to do so, it looks fine on my screen.


Eric



More information about the jcifs mailing list