[jcifs] Character Set discussions

Michael B. Allen miallen at eskimo.com
Sun Feb 9 07:37:54 EST 2003


This did get axed by the list software. Here it is without the
graphic. I'm responding in a separate messgae.


On Sat, 08 Feb 2003 06:37:43 -0500
Eric <eglass1 at attbi.com> wrote:

> Michael B. Allen wrote:
> > On Fri, 07 Feb 2003 20:35:04 -0500
> > Eric <eglass1 at attbi.com> wrote:
> > 
> > 
> >>>Incedentally here's another problem I just realized. If we accept
> >>>both unescaped URLs and URLs that have Unicode characters that have
> >>>been converted to UTF-8 and escaped how will we know after unescaping
> >>>them that they are really a sequence of UTF-8 bytes encoding a Uniocde
> >>>character? Ans: You don't. The only way to know is if you know the URL
> >>>will always escape such sequences. It cannot be "optional".
> >>>
> >>
> 
> If you unescape the %HH%HHs, you will have to store the URI as a 
> sequence of characters (i.e., a char[] or String in Java), not as a 
> byte[] (sequence of bytes).  If you plan on storing them as a sequence 
> of bytes, you have to escape everything outside of the ~60 characters 
> allowed in a "real" URI.
> 
> Assuming you have a Unicode String containing some URI with exotic 
> characters, and you want to get an ASCII representation, you would do 
> something like:
> 
> StringBuffer b;
> for (int i = 0; i < myUri.length(); i++) {
>      char c = myUri.charAt(i);
>      if (isValidASCIIURICharacter(c)) {
>          b.append(c);
>      } else {
>          byte[] bytes = Character.toString(c).getBytes("UTF-8");
>          b.append("%");
>          for (int j = 0; j < bytes.length; j++) {
>              b.append(Integer.toHexString((bytes[i] >> 4) & 0x0f));
>              b.append(Integer.toHexString(bytes[i] & 0x0f));
>          }
>      }
> }
> byte[] asciiUriBytes = b.toString().getBytes("US-ASCII");
> 
> Granted, the above is not very efficient.  But the point is you would 
> never carry around
> 
> myUri.getBytes("UTF-8");
> 
> Basically, after unescaping you know they are a sequence of bytes 
> encoding a Unicode character because you are storing them as a sequence 
> of Unicode characters.  If you need a byte representation, you need to 
> encode and escape the bytes.
> 
> >>I'm not sure I follow some of the above... attached is an example of how 
> >>the URI class interprets some of this.  The attached class takes 2 URIs 
> >>as input, then resolves the second relative to the first and outputs the 
> >>result.  I ran this with an absolute URI (containing a mixture of 
> >>unescaped Unicode characters and escaped, UTF-8 encoded characters) and 
> >>a relative URI (containing similar characters):
> >>
> >>java testuri smb://svr/slovak/m%C3%B4_em/ 
> >>./jest(/sklo/nezran%C3%AD/ma.zip > output.txt
> >>
> >>Attached is the output -- this seems to be a reasonable interpretation.
> >>Does it address any/all of the above?
> > 
> > 
> > Well how are you entering the strings? If you are entering them on
> > the console in a non-UTF-8 locale they will not be converted to
> > Unicode. Each byte in the UTF-8 sequence will just be treated as
> > an individual character.
> 
> Exactly -- any time you have a stream of BYTES representing a URI, it 
> needs to be escaped.  I'm assuming that if you are using Unicode 
> characters in your URI, you are using a character-based representation. 
>   If a user is going to enter a URI from a console which can't handle 
> Unicode characters, they would need to escape any non-ASCII chars.
> 
> > This is actually the point I'm trying to
> > make. The problem is evident if you use uri.getBytes("UTF-16") with the
> > jcifs.util.Log.printHexDump method like this:
> > 
> >   import jcifs.util.Log;
> >   ...
> >   byte[] buf;
> >   Log.addMask( Log.HEX_DUMPS );
> >   <your code>
> >   buf = uri.toString().getBytes("UTF-16");
> >   Log.printHexDump("Hexdump: ", buf);
> > 
> > to show what the UCS code of each character is:
> > 
> > Feb 7 22:44:24.282 - Hexdump: 
> > 00000: FE FF 00 6A 00 65 00 73 00 C5 00 A5 00 2F 00 73  |þÿ.j.e.s.Å.¥./.s|
> > 00010: 00 6B 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72  |.k.l.o./.n.e.z.r|
> > 00020: 00 61 00 6E 00 25 00 43 00 33 00 25 00 41 00 44  |.a.n.%.C.3.%.A.D|
> > 00030: 00 2F 00 6D 00 61 00 2E 00 7A 00 69 00 70        |./.m.a...z.i.p  |
> > 
> > As you can see the C5 A5 was not interpreted as UTF-8 but rather each
> > byte is considered a character in the ISO-8859-1 character set.
> > 
> > Run your test again but this time first create an input string like:
> > 
> >   uristring = new String( args[0], "UTF8" );
> >   uri = new URI( uristring );
> > 
> > Also, if you're on Linux youi'll have to run your test in a UTF-8 capable
> > xterm and UTF-8 locale. To get a good UTF-8 xterm you need to use the -u8
> > option and you need a good Unicode font. Try the following (also with
> > nice colors):
> > 
> >   $ xterm -u8 -bg grey20 -fg grey80 -fn -*-fixed-*-*-*-*-12-*-*-*-*-*-iso10646-1
> > 
> > [you might also try the 20pt font instead of the 12pt]
> > 
> > Now cat the output.txt you sent me and you'll see the non-ascii characters
> > the way they're supposed to look. Then run your program again but you have
> > to specify the UTF-8 locale like:
> > 
> > [miallen at nano miallen]$ LANG=en_US.UTF-8 java testuri jes/sklo/nezran%C3%AD/ma.zip
> > Feb 7 23:01:41.453 - Hexdump: 
> > 00000: FE FF 00 6A 00 65 00 73 01 65 00 2F 00 73 00 6B  |þÿ.j.e.s.e./.s.k|
> > 00010: 00 6C 00 6F 00 2F 00 6E 00 65 00 7A 00 72 00 61  |.l.o./.n.e.z.r.a|
> > 00020: 00 6E 00 25 00 43 00 33 00 25 00 41 00 44 00 2F  |.n.%.C.3.%.A.D./|
> > 00030: 00 6D 00 61 00 2E 00 7A 00 69 00 70              |.m.a...z.i.p    |
> > 
> > So now you can see the C5 A5 was converted properly to U+0165 which is
> > a small t with a little apostrophe above and to the right. Actually mine
> > is a little unside down carrot but I think that's just a font problem.
> > 
> > Mike
> > 
> 
> I attached a screen shot of what I am seeing (not sure if it will make 
> it through to the list)... is this correct?
> 
> 
> Eric
> 


-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 


More information about the jcifs mailing list