[jcifs] Character Set discussions

Sat Feb 8 22:37:43 EST 2003

Michael B. Allen wrote:
> On Fri, 07 Feb 2003 20:35:04 -0500
> Eric <eglass1 at attbi.com> wrote:
> 
> 
>>>Incedentally here's another problem I just realized. If we accept
>>>both unescaped URLs and URLs that have Unicode characters that have
>>>been converted to UTF-8 and escaped how will we know after unescaping
>>>them that they are really a sequence of UTF-8 bytes encoding a Uniocde
>>>character? Ans: You don't. The only way to know is if you know the URL
>>>will always escape such sequences. It cannot be "optional".
>>>
>>

If you unescape the %HH%HHs, you will have to store the URI as a 
sequence of characters (i.e., a char[] or String in Java), not as a 
byte[] (sequence of bytes).  If you plan on storing them as a sequence 
of bytes, you have to escape everything outside of the ~60 characters 
allowed in a "real" URI.

Assuming you have a Unicode String containing some URI with exotic 
characters, and you want to get an ASCII representation, you would do 
something like:

StringBuffer b;
for (int i = 0; i < myUri.length(); i++) {
     char c = myUri.charAt(i);
     if (isValidASCIIURICharacter(c)) {
         b.append(c);
     } else {
         byte[] bytes = Character.toString(c).getBytes("UTF-8");
         b.append("%");
         for (int j = 0; j < bytes.length; j++) {
             b.append(Integer.toHexString((bytes[i] >> 4) & 0x0f));
             b.append(Integer.toHexString(bytes[i] & 0x0f));
         }
     }
}
byte[] asciiUriBytes = b.toString().getBytes("US-ASCII");

Granted, the above is not very efficient.  But the point is you would 
never carry around

myUri.getBytes("UTF-8");

Basically, after unescaping you know they are a sequence of bytes 
encoding a Unicode character because you are storing them as a sequence 
of Unicode characters.  If you need a byte representation, you need to 
encode and escape the bytes.

>>I'm not sure I follow some of the above... attached is an example of how 
>>the URI class interprets some of this.  The attached class takes 2 URIs 
>>as input, then resolves the second relative to the first and outputs the 
>>result.  I ran this with an absolute URI (containing a mixture of 
>>unescaped Unicode characters and escaped, UTF-8 encoded characters) and 
>>a relative URI (containing similar characters):
>>
>>java testuri smb://svr/slovak/m%C3%B4_em/ 
>>./jest(/sklo/nezran%C3%AD/ma.zip > output.txt
>>
>>Attached is the output -- this seems to be a reasonable interpretation.
>>Does it address any/all of the above?
> 
> 
> Well how are you entering the strings? If you are entering them on
> the console in a non-UTF-8 locale they will not be converted to
> Unicode. Each byte in the UTF-8 sequence will just be treated as
> an individual character.

Exactly -- any time you have a stream of BYTES representing a URI, it 
needs to be escaped.  I'm assuming that if you are using Unicode 
characters in your URI, you are using a character-based representation. 
  If a user is going to enter a URI from a console which can't handle 
Unicode characters, they would need to escape any non-ASCII chars.

> This is actually the point I'm trying to
> make. The problem is evident if you use uri.getBytes("UTF-16") with the
> jcifs.util.Log.printHexDump method like this:
> 
>   import jcifs.util.Log;
>   ...
>   byte[] buf;
>   Log.addMask( Log.HEX_DUMPS );
>   <your code>
>   buf = uri.toString().getBytes("UTF-16");
>   Log.printHexDump("Hexdump: ", buf);
> 
> to show what the UCS code of each character is:
> 
> Feb 7 22:44:24.282 - Hexdump: 
> 00000: FE FF 00 6A 00 65 00 73 00 C5 00 A5 00 2F 00 73  |þÿ.j.e.s.Å.¥./.s|
> 00010: 00 6B 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72  |.k.l.o./.n.e.z.r|
> 00020: 00 61 00 6E 00 25 00 43 00 33 00 25 00 41 00 44  |.a.n.%.C.3.%.A.D|
> 00030: 00 2F 00 6D 00 61 00 2E 00 7A 00 69 00 70        |./.m.a...z.i.p  |
> 
> As you can see the C5 A5 was not interpreted as UTF-8 but rather each
> byte is considered a character in the ISO-8859-1 character set.
> 
> Run your test again but this time first create an input string like:
> 
>   uristring = new String( args[0], "UTF8" );
>   uri = new URI( uristring );
> 
> Also, if you're on Linux youi'll have to run your test in a UTF-8 capable
> xterm and UTF-8 locale. To get a good UTF-8 xterm you need to use the -u8
> option and you need a good Unicode font. Try the following (also with
> nice colors):
> 
>   $ xterm -u8 -bg grey20 -fg grey80 -fn -*-fixed-*-*-*-*-12-*-*-*-*-*-iso10646-1
> 
> [you might also try the 20pt font instead of the 12pt]
> 
> Now cat the output.txt you sent me and you'll see the non-ascii characters
> the way they're supposed to look. Then run your program again but you have
> to specify the UTF-8 locale like:
> 
> [miallen at nano miallen]$ LANG=en_US.UTF-8 java testuri jes/sklo/nezran%C3%AD/ma.zip
> Feb 7 23:01:41.453 - Hexdump: 
> 00000: FE FF 00 6A 00 65 00 73 01 65 00 2F 00 73 00 6B  |þÿ.j.e.s.e./.s.k|
> 00010: 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72 00 61  |.l.o./.n.e.z.r.a|
> 00020: 00 6E 00 25 00 43 00 33 00 25 00 41 00 44 00 2F  |.n.%.C.3.%.A.D./|
> 00030: 00 6D 00 61 00 2E 00 7A 00 69 00 70              |.m.a...z.i.p    |
> 
> So now you can see the C5 A5 was converted properly to U+0165 which is
> a small t with a little apostrophe above and to the right. Actually mine
> is a little unside down carrot but I think that's just a font problem.
> 
> Mike
> 

I attached a screen shot of what I am seeing (not sure if it will make 
it through to the list)... is this correct?

Eric
-------------- next part --------------
A non-text attachment was scrubbed...
Name: output.gif
Type: image/gif
Size: 27512 bytes
Desc: not available
Url : http://lists.samba.org/archive/jcifs/attachments/20030208/030f4f2e/output.gif