[jcifs] Character Set discussions
Eric
eglass1 at attbi.com
Sat Feb 8 22:37:43 EST 2003
Michael B. Allen wrote:
> On Fri, 07 Feb 2003 20:35:04 -0500
> Eric <eglass1 at attbi.com> wrote:
>
>
>>>Incedentally here's another problem I just realized. If we accept
>>>both unescaped URLs and URLs that have Unicode characters that have
>>>been converted to UTF-8 and escaped how will we know after unescaping
>>>them that they are really a sequence of UTF-8 bytes encoding a Uniocde
>>>character? Ans: You don't. The only way to know is if you know the URL
>>>will always escape such sequences. It cannot be "optional".
>>>
>>
If you unescape the %HH%HHs, you will have to store the URI as a
sequence of characters (i.e., a char[] or String in Java), not as a
byte[] (sequence of bytes). If you plan on storing them as a sequence
of bytes, you have to escape everything outside of the ~60 characters
allowed in a "real" URI.
Assuming you have a Unicode String containing some URI with exotic
characters, and you want to get an ASCII representation, you would do
something like:
StringBuffer b;
for (int i = 0; i < myUri.length(); i++) {
char c = myUri.charAt(i);
if (isValidASCIIURICharacter(c)) {
b.append(c);
} else {
byte[] bytes = Character.toString(c).getBytes("UTF-8");
b.append("%");
for (int j = 0; j < bytes.length; j++) {
b.append(Integer.toHexString((bytes[i] >> 4) & 0x0f));
b.append(Integer.toHexString(bytes[i] & 0x0f));
}
}
}
byte[] asciiUriBytes = b.toString().getBytes("US-ASCII");
Granted, the above is not very efficient. But the point is you would
never carry around
myUri.getBytes("UTF-8");
Basically, after unescaping you know they are a sequence of bytes
encoding a Unicode character because you are storing them as a sequence
of Unicode characters. If you need a byte representation, you need to
encode and escape the bytes.
>>I'm not sure I follow some of the above... attached is an example of how
>>the URI class interprets some of this. The attached class takes 2 URIs
>>as input, then resolves the second relative to the first and outputs the
>>result. I ran this with an absolute URI (containing a mixture of
>>unescaped Unicode characters and escaped, UTF-8 encoded characters) and
>>a relative URI (containing similar characters):
>>
>>java testuri smb://svr/slovak/m%C3%B4_em/
>>./jest(/sklo/nezran%C3%AD/ma.zip > output.txt
>>
>>Attached is the output -- this seems to be a reasonable interpretation.
>>Does it address any/all of the above?
>
>
> Well how are you entering the strings? If you are entering them on
> the console in a non-UTF-8 locale they will not be converted to
> Unicode. Each byte in the UTF-8 sequence will just be treated as
> an individual character.
Exactly -- any time you have a stream of BYTES representing a URI, it
needs to be escaped. I'm assuming that if you are using Unicode
characters in your URI, you are using a character-based representation.
If a user is going to enter a URI from a console which can't handle
Unicode characters, they would need to escape any non-ASCII chars.
> This is actually the point I'm trying to
> make. The problem is evident if you use uri.getBytes("UTF-16") with the
> jcifs.util.Log.printHexDump method like this:
>
> import jcifs.util.Log;
> ...
> byte[] buf;
> Log.addMask( Log.HEX_DUMPS );
> <your code>
> buf = uri.toString().getBytes("UTF-16");
> Log.printHexDump("Hexdump: ", buf);
>
> to show what the UCS code of each character is:
>
> Feb 7 22:44:24.282 - Hexdump:
> 00000: FE FF 00 6A 00 65 00 73 00 C5 00 A5 00 2F 00 73 |þÿ.j.e.s.Å.¥./.s|
> 00010: 00 6B 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72 |.k.l.o./.n.e.z.r|
> 00020: 00 61 00 6E 00 25 00 43 00 33 00 25 00 41 00 44 |.a.n.%.C.3.%.A.D|
> 00030: 00 2F 00 6D 00 61 00 2E 00 7A 00 69 00 70 |./.m.a...z.i.p |
>
> As you can see the C5 A5 was not interpreted as UTF-8 but rather each
> byte is considered a character in the ISO-8859-1 character set.
>
> Run your test again but this time first create an input string like:
>
> uristring = new String( args[0], "UTF8" );
> uri = new URI( uristring );
>
> Also, if you're on Linux youi'll have to run your test in a UTF-8 capable
> xterm and UTF-8 locale. To get a good UTF-8 xterm you need to use the -u8
> option and you need a good Unicode font. Try the following (also with
> nice colors):
>
> $ xterm -u8 -bg grey20 -fg grey80 -fn -*-fixed-*-*-*-*-12-*-*-*-*-*-iso10646-1
>
> [you might also try the 20pt font instead of the 12pt]
>
> Now cat the output.txt you sent me and you'll see the non-ascii characters
> the way they're supposed to look. Then run your program again but you have
> to specify the UTF-8 locale like:
>
> [miallen at nano miallen]$ LANG=en_US.UTF-8 java testuri jes/sklo/nezran%C3%AD/ma.zip
> Feb 7 23:01:41.453 - Hexdump:
> 00000: FE FF 00 6A 00 65 00 73 01 65 00 2F 00 73 00 6B |þÿ.j.e.s.e./.s.k|
> 00010: 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72 00 61 |.l.o./.n.e.z.r.a|
> 00020: 00 6E 00 25 00 43 00 33 00 25 00 41 00 44 00 2F |.n.%.C.3.%.A.D./|
> 00030: 00 6D 00 61 00 2E 00 7A 00 69 00 70 |.m.a...z.i.p |
>
> So now you can see the C5 A5 was converted properly to U+0165 which is
> a small t with a little apostrophe above and to the right. Actually mine
> is a little unside down carrot but I think that's just a font problem.
>
> Mike
>
I attached a screen shot of what I am seeing (not sure if it will make
it through to the list)... is this correct?
Eric
-------------- next part --------------
A non-text attachment was scrubbed...
Name: output.gif
Type: image/gif
Size: 27512 bytes
Desc: not available
Url : http://lists.samba.org/archive/jcifs/attachments/20030208/030f4f2e/output.gif
More information about the jcifs
mailing list