[jcifs] Character Encioding Problems on OS/390

Fri Oct 25 11:39:19 EST 2002

-----Original Message-----
From: Allen, Michael B (RSCH)
To: 'Eric Glass'
Cc: 'Bigham, Ted'; 'jcifs at samba.org'
Sent: 10/24/2002 8:13 PM
Subject: RE: [jcifs] Character Encioding Problems on OS/390

> -----Original Message-----
> From:	Eric Glass [SMTP:eglass1 at attbi.com]
> Sent:	Thursday, October 24, 2002 8:04 PM
> To:	Allen, Michael B (RSCH)
> Cc:	'Bigham, Ted'; 'jcifs at samba.org'
> Subject:	RE: [jcifs] Character Encioding Problems on OS/390
> 
> On Thu, 2002-10-24 at 18:01, Allen, Michael B (RSCH) wrote:
> > 
> >
> > > 
> > 	Or you mean use char[] for all array work and then at the last
minute create a
> > 	String from it and do getBytes( "ISO-8859-1" ). I still don't
understand were the
> > 	UTF-8 comes in though. Also you sound like you know of all of
the locations in
> > 	the code were these changes would need to occur but almost all
operations
> > 	are byte oriented. Can you give me a few example locations?
> > 

Using char arrays would have the same effect as using Strings.  Somewhere
down the line those chars have to be converted into bytes.  And yes, that
's where the right encoding has to be specified.  An example is
jcifs.smb.SmbComNegotiate.  It has the following decleration...

    static final byte[] dialects = new String(
        '\2' + "NT LM 0.12" + '\0' ).getBytes();

As you can imagine, the bytes generated from this on a EBCDIC machine are
going to be completley different than those from an ASCII machine.  This
class is actually the first one in my path of failure, as you can tell from
the difference in the hex dumps between the two messages:

This one fails...
Oct 23 12:06:17.969 - datagram packet sent to: 148.162.36.216
00000: 00 01 00 00 00 01 00 00 00 00 00 00 20 43 4B 41  |............ CKA|
00010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41  |AAAAAAAAAAAAAAAA|
00020: 41 41 41 41 41 41 41 41 41 41 41 41 41 00 00 21  |AAAAAAAAAAAAA..!|
00030: 00 01                                            |..              |

This one works and comes from the exact same program, just a differnt
file.encoding property setting...
Oct 23 12:10:27.996 - smb sent
00000: FF 53 4D 42 72 00 00 00 00 18 01 80 00 00 00 00  |.SMBr...........|
00010: 00 00 00 00 00 00 00 00 00 00 7F 89 00 00 01 00  |................|
00020: 00 0C 00 02 4E 54 20 4C 4D 20 30 2E 31 32 00     |....NT LM 0.12. |

> 
> The UTFs (Unicode Transformation Formats) are means of representing
> Unicode characters; in UTF-8's case, 0-255 are represented as a single
> byte (same as ISO-8859-1).  For characters above 255, UTF-8 is a
> multibyte representation.  I believe ISO-8859-1 is incapable of
> representing 256+.
> 
>	I'm familar with the encoding. I just am not clear on what it
>has to do with Ted's
>	problem.

I was just saying UTF-8 where i should have been saying ISO-8859-1.

> A good explanation of the various character sets can be found here:
> 
> http://www.czyborra.com/utf/
> http://czyborra.com/charsets/iso8859.html
> 
> as well as a brief discussion of EBCDIC, the issue at hand:
> 
> http://czyborra.com/charsets/iso646.html
> 
> As far as jCIFS is concerned, it probably doesn't matter which
encoding
> you use; a String containing characters over 255 would be encoded as
> multiple bytes using UTF-8, which (I'm guessing) would be meaningless
to
> jCIFS.
> 
>	Java's internal encoding is UCS-2 not UTF-8 but the encoding
>doesn't
>	really matter because the String class abstracts this fact.
>

Right, we never care about the internal encoding, just what's comming in and
out.

>   Characters over 255 can't be represented using ISO-8859-1, and
> the behavior in this case is unspecified (according to the String
> Javadocs).  So either way, you'll probably get garbage with any input
> characters over 255, which isn't really an issue unless the underlying
> network protocol has specified a means of handling it (in which case
you
> would use the specified encoding).
> 
> As far as the actual code changes required for jCIFS, the only places
> that they should need to be applied would be at the point of
conversion
> between a String and a byte[].  The most common would be something
like:
> 
> String myString = "hello there.";
> byte[] myBytes = myString.getBytes();
> 
> which would just need to be changed to:
> 
> String myString = "hello there.";
> byte[] myBytes = myString.getBytes("ISO-8859-1");
> 
> Another instance would be:
> 
> byte[] myBytes;
> ...
> String myString = new String(myBytes);
> 
> which would be changed to:
> 
> byte[] myBytes;
> ...
> String myString = new String(myBytes, "ISO-8859-1");
> 
> 
> Eric
> 
>	I think you'r right. In fact I have completely done this for the
>netbios, util,
>	and http packages. The alternative is to use char[] which are 16
>bit UCS
>	codes and therefore do not need further manipulation. There is a
>little bit
>	of code that does this. The jcifs.util.Log class and it's
>extensions will
>	need to be converted to use char[] though. In most cases however
>it is
>	indeed just a matter of specifying an encoding. The encodings
>can differ
>	however. In some cases you know what the encoding is like when I
>	copy in the jcifs/http/ne.css style sheet. This is ISO-8859-1
>(actually
>	ASCII). The proper Java encoding identifier for this is
>ISO8859_1 but I
>	have seen 8859_1 in Java source too. Too bad they do not use the
>	standard identifiers. In other cases like the
>jcifs/util/Base64.java class
>	base64 encoding is a binary to ASCII converter so I can use
>"ASCII".
>	Elsewere however the encoding should be the CIFS 8 bit encoding
>	(which is referred to in the docs as ASCII but it could be any 8
>	codepage really) which means we need to use a jcifs.encoding
>property.
>	In most cases this is ISO8859_1 but since this will deprecate
>the
>	jcifs.smb.client.codepage property it could be other encodings
>as well.

I wouldn't converte everything to char arrays, Strings will do just fine.
It's just generating those packets from those chars or Strings have to be
changed.  A lot of the jCIFS code takes this into consideration, some of it
doesn't.