[jcifs] Character Set discussions

Michael B. Allen miallen at eskimo.com
Sat Feb 8 15:14:34 EST 2003


On Fri, 07 Feb 2003 20:35:04 -0500
Eric <eglass1 at attbi.com> wrote:

> > Incedentally here's another problem I just realized. If we accept
> > both unescaped URLs and URLs that have Unicode characters that have
> > been converted to UTF-8 and escaped how will we know after unescaping
> > them that they are really a sequence of UTF-8 bytes encoding a Uniocde
> > character? Ans: You don't. The only way to know is if you know the URL
> > will always escape such sequences. It cannot be "optional".
> > 
> 
> I'm not sure I follow some of the above... attached is an example of how 
> the URI class interprets some of this.  The attached class takes 2 URIs 
> as input, then resolves the second relative to the first and outputs the 
> result.  I ran this with an absolute URI (containing a mixture of 
> unescaped Unicode characters and escaped, UTF-8 encoded characters) and 
> a relative URI (containing similar characters):
> 
> java testuri smb://svr/slovak/m%C3%B4_em/ 
> ./jest(/sklo/nezran%C3%AD/ma.zip > output.txt
> 
> Attached is the output -- this seems to be a reasonable interpretation.
> Does it address any/all of the above?

Well how are you entering the strings? If you are entering them on
the console in a non-UTF-8 locale they will not be converted to
Unicode. Each byte in the UTF-8 sequence will just be treated as
an individual character. This is actually the point I'm trying to
make. The problem is evident if you use uri.getBytes("UTF-16") with the
jcifs.util.Log.printHexDump method like this:

  import jcifs.util.Log;
  ...
  byte[] buf;
  Log.addMask( Log.HEX_DUMPS );
  <your code>
  buf = uri.toString().getBytes("UTF-16");
  Log.printHexDump("Hexdump: ", buf);

to show what the UCS code of each character is:

Feb 7 22:44:24.282 - Hexdump: 
00000: FE FF 00 6A 00 65 00 73 00 C5 00 A5 00 2F 00 73  |þÿ.j.e.s.Å.¥./.s|
00010: 00 6B 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72  |.k.l.o./.n.e.z.r|
00020: 00 61 00 6E 00 25 00 43 00 33 00 25 00 41 00 44  |.a.n.%.C.3.%.A.D|
00030: 00 2F 00 6D 00 61 00 2E 00 7A 00 69 00 70        |./.m.a...z.i.p  |

As you can see the C5 A5 was not interpreted as UTF-8 but rather each
byte is considered a character in the ISO-8859-1 character set.

Run your test again but this time first create an input string like:

  uristring = new String( args[0], "UTF8" );
  uri = new URI( uristring );

Also, if you're on Linux youi'll have to run your test in a UTF-8 capable
xterm and UTF-8 locale. To get a good UTF-8 xterm you need to use the -u8
option and you need a good Unicode font. Try the following (also with
nice colors):

  $ xterm -u8 -bg grey20 -fg grey80 -fn -*-fixed-*-*-*-*-12-*-*-*-*-*-iso10646-1

[you might also try the 20pt font instead of the 12pt]

Now cat the output.txt you sent me and you'll see the non-ascii characters
the way they're supposed to look. Then run your program again but you have
to specify the UTF-8 locale like:

[miallen at nano miallen]$ LANG=en_US.UTF-8 java testuri jes/sklo/nezran%C3%AD/ma.zip
Feb 7 23:01:41.453 - Hexdump: 
00000: FE FF 00 6A 00 65 00 73 01 65 00 2F 00 73 00 6B  |þÿ.j.e.s.e./.s.k|
00010: 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72 00 61  |.l.o./.n.e.z.r.a|
00020: 00 6E 00 25 00 43 00 33 00 25 00 41 00 44 00 2F  |.n.%.C.3.%.A.D./|
00030: 00 6D 00 61 00 2E 00 7A 00 69 00 70              |.m.a...z.i.p    |

So now you can see the C5 A5 was converted properly to U+0165 which is
a small t with a little apostrophe above and to the right. Actually mine
is a little unside down carrot but I think that's just a font problem.

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 


More information about the jcifs mailing list