[jcifs] Character Set discussions

Sun Feb 9 07:37:54 EST 2003

This did get axed by the list software. Here it is without the
graphic. I'm responding in a separate messgae.

On Sat, 08 Feb 2003 06:37:43 -0500
Eric <eglass1 at attbi.com> wrote:

> Michael B. Allen wrote:
> > On Fri, 07 Feb 2003 20:35:04 -0500
> > Eric <eglass1 at attbi.com> wrote:
> > 
> > 
> >>>Incedentally here's another problem I just realized. If we accept
> >>>both unescaped URLs and URLs that have Unicode characters that have
> >>>been converted to UTF-8 and escaped how will we know after unescaping
> >>>them that they are really a sequence of UTF-8 bytes encoding a Uniocde
> >>>character? Ans: You don't. The only way to know is if you know the URL
> >>>will always escape such sequences. It cannot be "optional".
> >>>
> >>
> 
> If you unescape the %HH%HHs, you will have to store the URI as a 
> sequence of characters (i.e., a char[] or String in Java), not as a 
> byte[] (sequence of bytes).  If you plan on storing them as a sequence 
> of bytes, you have to escape everything outside of the ~60 characters 
> allowed in a "real" URI.
> 
> Assuming you have a Unicode String containing some URI with exotic 
> characters, and you want to get an ASCII representation, you would do 
> something like:
> 
> StringBuffer b;
> for (int i = 0; i < myUri.length(); i++) {
>      char c = myUri.charAt(i);
>      if (isValidASCIIURICharacter(c)) {
>          b.append(c);
>      } else {
>          byte[] bytes = Character.toString(c).getBytes("UTF-8");
>          b.append("%");
>          for (int j = 0; j < bytes.length; j++) {
>              b.append(Integer.toHexString((bytes[i] >> 4) & 0x0f));
>              b.append(Integer.toHexString(bytes[i] & 0x0f));
>          }
>      }
> }
> byte[] asciiUriBytes = b.toString().getBytes("US-ASCII");
> 
> Granted, the above is not very efficient.  But the point is you would 
> never carry around
> 
> myUri.getBytes("UTF-8");
> 
> Basically, after unescaping you know they are a sequence of bytes 
> encoding a Unicode character because you are storing them as a sequence 
> of Unicode characters.  If you need a byte representation, you need to 
> encode and escape the bytes.
> 
> >>I'm not sure I follow some of the above... attached is an example of how 
> >>the URI class interprets some of this.  The attached class takes 2 URIs 
> >>as input, then resolves the second relative to the first and outputs the 
> >>result.  I ran this with an absolute URI (containing a mixture of 
> >>unescaped Unicode characters and escaped, UTF-8 encoded characters) and 
> >>a relative URI (containing similar characters):
> >>
> >>java testuri smb://svr/slovak/m%C3%B4_em/ 
> >>./jest(/sklo/nezran%C3%AD/ma.zip > output.txt
> >>
> >>Attached is the output -- this seems to be a reasonable interpretation.
> >>Does it address any/all of the above?
> > 
> > 
> > Well how are you entering the strings? If you are entering them on
> > the console in a non-UTF-8 locale they will not be converted to
> > Unicode. Each byte in the UTF-8 sequence will just be treated as
> > an individual character.
> 
> Exactly -- any time you have a stream of BYTES representing a URI, it 
> needs to be escaped.  I'm assuming that if you are using Unicode 
> characters in your URI, you are using a character-based representation. 
>   If a user is going to enter a URI from a console which can't handle 
> Unicode characters, they would need to escape any non-ASCII chars.
> 
> > This is actually the point I'm trying to
> > make. The problem is evident if you use uri.getBytes("UTF-16") with the
> > jcifs.util.Log.printHexDump method like this:
> > 
> >   import jcifs.util.Log;
> >   ...
> >   byte[] buf;
> >   Log.addMask( Log.HEX_DUMPS );
> >   <your code>
> >   buf = uri.toString().getBytes("UTF-16");
> >   Log.printHexDump("Hexdump: ", buf);
> > 
> > to show what the UCS code of each character is:
> > 
> > Feb 7 22:44:24.282 - Hexdump: 
> > 00000: FE FF 00 6A 00 65 00 73 00 C5 00 A5 00 2F 00 73  |þÿ.j.e.s.Å.¥./.s|
> > 00010: 00 6B 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72  |.k.l.o./.n.e.z.r|
> > 00020: 00 61 00 6E 00 25 00 43 00 33 00 25 00 41 00 44  |.a.n.%.C.3.%.A.D|
> > 00030: 00 2F 00 6D 00 61 00 2E 00 7A 00 69 00 70        |./.m.a...z.i.p  |
> > 
> > As you can see the C5 A5 was not interpreted as UTF-8 but rather each
> > byte is considered a character in the ISO-8859-1 character set.
> > 
> > Run your test again but this time first create an input string like:
> > 
> >   uristring = new String( args[0], "UTF8" );
> >   uri = new URI( uristring );
> > 
> > Also, if you're on Linux youi'll have to run your test in a UTF-8 capable
> > xterm and UTF-8 locale. To get a good UTF-8 xterm you need to use the -u8
> > option and you need a good Unicode font. Try the following (also with
> > nice colors):
> > 
> >   $ xterm -u8 -bg grey20 -fg grey80 -fn -*-fixed-*-*-*-*-12-*-*-*-*-*-iso10646-1
> > 
> > [you might also try the 20pt font instead of the 12pt]
> > 
> > Now cat the output.txt you sent me and you'll see the non-ascii characters
> > the way they're supposed to look. Then run your program again but you have
> > to specify the UTF-8 locale like:
> > 
> > [miallen at nano miallen]$ LANG=en_US.UTF-8 java testuri jes/sklo/nezran%C3%AD/ma.zip
> > Feb 7 23:01:41.453 - Hexdump: 
> > 00000: FE FF 00 6A 00 65 00 73 01 65 00 2F 00 73 00 6B  |þÿ.j.e.s.e./.s.k|
> > 00010: 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72 00 61  |.l.o./.n.e.z.r.a|
> > 00020: 00 6E 00 25 00 43 00 33 00 25 00 41 00 44 00 2F  |.n.%.C.3.%.A.D./|
> > 00030: 00 6D 00 61 00 2E 00 7A 00 69 00 70              |.m.a...z.i.p    |
> > 
> > So now you can see the C5 A5 was converted properly to U+0165 which is
> > a small t with a little apostrophe above and to the right. Actually mine
> > is a little unside down carrot but I think that's just a font problem.
> > 
> > Mike
> > 
> 
> I attached a screen shot of what I am seeing (not sure if it will make 
> it through to the list)... is this correct?
> 
> 
> Eric
> 

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived.