[jcifs] Character Encioding Problems on OS/390

Fri Oct 25 13:04:35 EST 2002

> > > 	Or you mean use char[] for all array work and then at the last
> minute create a
> > > 	String from it and do getBytes( "ISO-8859-1" ). I still don't
> understand were the
> > > 	UTF-8 comes in though. Also you sound like you know of all of
> the locations in
> > > 	the code were these changes would need to occur but almost all
> operations
> > > 	are byte oriented. Can you give me a few example locations?
> > > 
> 
> Using char arrays would have the same effect as using Strings.  Somewhere
> down the line those chars have to be converted into bytes.  And yes, that
> 's where the right encoding has to be specified.  An example is
> jcifs.smb.SmbComNegotiate.  It has the following decleration...
> 
>     static final byte[] dialects = new String(
>         '\2' + "NT LM 0.12" + '\0' ).getBytes();
> 
> As you can imagine, the bytes generated from this on a EBCDIC machine are
> going to be completley different than those from an ASCII machine.  This
> class is actually the first one in my path of failure, as you can tell from
> the difference in the hex dumps between the two messages:
> 
> This one fails...
> Oct 23 12:06:17.969 - datagram packet sent to: 148.162.36.216
> 00000: 00 01 00 00 00 01 00 00 00 00 00 00 20 43 4B 41  |............ CKA|
> 00010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41  |AAAAAAAAAAAAAAAA|
> 00020: 41 41 41 41 41 41 41 41 41 41 41 41 41 00 00 21  |AAAAAAAAAAAAA..!|
> 00030: 00 01                                            |..              |
> 
> This one works and comes from the exact same program, just a differnt
> file.encoding property setting...
> Oct 23 12:10:27.996 - smb sent
> 00000: FF 53 4D 42 72 00 00 00 00 18 01 80 00 00 00 00  |.SMBr...........|
> 00010: 00 00 00 00 00 00 00 00 00 00 7F 89 00 00 01 00  |................|
> 00020: 00 0C 00 02 4E 54 20 4C 4D 20 30 2E 31 32 00     |....NT LM 0.12. |
> 
	What I did with this is to make it a full fledged String:

	    static final String DIALECTS = "\u0002NT LM 0.12\u0000";

	and then encode it like:

	    int writeBytesWireFormat( byte[] dst, int dstIndex ) {
	        byte[] dialects;
	        try {
	            dialects = DIALECTS.getBytes( "ASCII" );
	        } catch( UnsupportedEncodingException uee ) { 
	            return 0;
	        }
	        System.arraycopy( dialects, 0, dst, dstIndex, dialects.length );
	        return dialects.length;
	    }

	I can use ASCII here because I know this character sequence is *always* the same. That UnsupportedEncodingException is a PITA though. I've got try {} catch all over the place now.
	I think maybe my WireFormat methods should throw UnsupportedEncodingException or a
	derivative of it.

> > 
> > The UTFs (Unicode Transformation Formats) are means of representing
> > Unicode characters; in UTF-8's case, 0-255 are represented as a single
> > byte (same as ISO-8859-1).  For characters above 255, UTF-8 is a
> > multibyte representation.  I believe ISO-8859-1 is incapable of
> > representing 256+.
> > 
> >	I'm familar with the encoding. I just am not clear on what it
> >has to do with Ted's
> >	problem.
> 
> I was just saying UTF-8 where i should have been saying ISO-8859-1.
> 
	Good, because multibyte sequences of UTF-8 will absolutly not work with CIFS.

> > A good explanation of the various character sets can be found here:
> > 
> > http://www.czyborra.com/utf/
> > http://czyborra.com/charsets/iso8859.html
> > 
> > as well as a brief discussion of EBCDIC, the issue at hand:
> > 
> > http://czyborra.com/charsets/iso646.html
> > 
> > As far as jCIFS is concerned, it probably doesn't matter which
> encoding
> > you use; a String containing characters over 255 would be encoded as
> > multiple bytes using UTF-8, which (I'm guessing) would be meaningless
> to
> > jCIFS.
> > 
> >	Java's internal encoding is UCS-2 not UTF-8 but the encoding
> >doesn't
> >	really matter because the String class abstracts this fact.
> >
> 
> Right, we never care about the internal encoding, just what's comming in and
> out.
> 
> >   Characters over 255 can't be represented using ISO-8859-1, and
> > the behavior in this case is unspecified (according to the String
> > Javadocs).  So either way, you'll probably get garbage with any input
> > characters over 255, which isn't really an issue unless the underlying
> > network protocol has specified a means of handling it (in which case
> you
> > would use the specified encoding).
> > 
> > As far as the actual code changes required for jCIFS, the only places
> > that they should need to be applied would be at the point of
> conversion
> > between a String and a byte[].  The most common would be something
> like:
> > 
> > String myString = "hello there.";
> > byte[] myBytes = myString.getBytes();
> > 
> > which would just need to be changed to:
> > 
> > String myString = "hello there.";
> > byte[] myBytes = myString.getBytes("ISO-8859-1");
> > 
> > Another instance would be:
> > 
> > byte[] myBytes;
> > ...
> > String myString = new String(myBytes);
> > 
> > which would be changed to:
> > 
> > byte[] myBytes;
> > ...
> > String myString = new String(myBytes, "ISO-8859-1");
> > 
> > 
> > Eric
> > 
> >	I think you'r right. In fact I have completely done this for the
> >netbios, util,
> >	and http packages. The alternative is to use char[] which are 16
> >bit UCS
> >	codes and therefore do not need further manipulation. There is a
> >little bit
> >	of code that does this. The jcifs.util.Log class and it's
> >extensions will
> >	need to be converted to use char[] though. In most cases however
> >it is
> >	indeed just a matter of specifying an encoding. The encodings
> >can differ
> >	however. In some cases you know what the encoding is like when I
> >	copy in the jcifs/http/ne.css style sheet. This is ISO-8859-1
> >(actually
> >	ASCII). The proper Java encoding identifier for this is
> >ISO8859_1 but I
> >	have seen 8859_1 in Java source too. Too bad they do not use the
> >	standard identifiers. In other cases like the
> >jcifs/util/Base64.java class
> >	base64 encoding is a binary to ASCII converter so I can use
> >"ASCII".
> >	Elsewere however the encoding should be the CIFS 8 bit encoding
> >	(which is referred to in the docs as ASCII but it could be any 8
> >	codepage really) which means we need to use a jcifs.encoding
> >property.
> >	In most cases this is ISO8859_1 but since this will deprecate
> >the
> >	jcifs.smb.client.codepage property it could be other encodings
> >as well.
> 
> 
> I wouldn't converte everything to char arrays, Strings will do just fine.
> It's just generating those packets from those chars or Strings have to be
> changed.  A lot of the jCIFS code takes this into consideration, some of it
> doesn't.
> 
	Char arrays are only used where character for character string manipulation
	is necessary. I believe I have successfully converted all the code. After
	looking at the Log.printHexDump code it looks like it might actually be ok
	because it uses char[]. However it is difficult to tell how successfully this will
	work without testing it on a platform like yours. Can you test? Not just the
	NtlmHttpFilter but can you run examples/Torture1 and examples/FileOps?