[jcifs] Character Encioding Problems on OS/390
Allen, Michael B (RSCH)
Michael_B_Allen at ml.com
Fri Oct 25 13:04:35 EST 2002
> > > Or you mean use char[] for all array work and then at the last
> minute create a
> > > String from it and do getBytes( "ISO-8859-1" ). I still don't
> understand were the
> > > UTF-8 comes in though. Also you sound like you know of all of
> the locations in
> > > the code were these changes would need to occur but almost all
> operations
> > > are byte oriented. Can you give me a few example locations?
> > >
>
> Using char arrays would have the same effect as using Strings. Somewhere
> down the line those chars have to be converted into bytes. And yes, that
> 's where the right encoding has to be specified. An example is
> jcifs.smb.SmbComNegotiate. It has the following decleration...
>
> static final byte[] dialects = new String(
> '\2' + "NT LM 0.12" + '\0' ).getBytes();
>
> As you can imagine, the bytes generated from this on a EBCDIC machine are
> going to be completley different than those from an ASCII machine. This
> class is actually the first one in my path of failure, as you can tell from
> the difference in the hex dumps between the two messages:
>
> This one fails...
> Oct 23 12:06:17.969 - datagram packet sent to: 148.162.36.216
> 00000: 00 01 00 00 00 01 00 00 00 00 00 00 20 43 4B 41 |............ CKA|
> 00010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
> 00020: 41 41 41 41 41 41 41 41 41 41 41 41 41 00 00 21 |AAAAAAAAAAAAA..!|
> 00030: 00 01 |.. |
>
> This one works and comes from the exact same program, just a differnt
> file.encoding property setting...
> Oct 23 12:10:27.996 - smb sent
> 00000: FF 53 4D 42 72 00 00 00 00 18 01 80 00 00 00 00 |.SMBr...........|
> 00010: 00 00 00 00 00 00 00 00 00 00 7F 89 00 00 01 00 |................|
> 00020: 00 0C 00 02 4E 54 20 4C 4D 20 30 2E 31 32 00 |....NT LM 0.12. |
>
What I did with this is to make it a full fledged String:
static final String DIALECTS = "\u0002NT LM 0.12\u0000";
and then encode it like:
int writeBytesWireFormat( byte[] dst, int dstIndex ) {
byte[] dialects;
try {
dialects = DIALECTS.getBytes( "ASCII" );
} catch( UnsupportedEncodingException uee ) {
return 0;
}
System.arraycopy( dialects, 0, dst, dstIndex, dialects.length );
return dialects.length;
}
I can use ASCII here because I know this character sequence is *always* the same. That UnsupportedEncodingException is a PITA though. I've got try {} catch all over the place now.
I think maybe my WireFormat methods should throw UnsupportedEncodingException or a
derivative of it.
> >
> > The UTFs (Unicode Transformation Formats) are means of representing
> > Unicode characters; in UTF-8's case, 0-255 are represented as a single
> > byte (same as ISO-8859-1). For characters above 255, UTF-8 is a
> > multibyte representation. I believe ISO-8859-1 is incapable of
> > representing 256+.
> >
> > I'm familar with the encoding. I just am not clear on what it
> >has to do with Ted's
> > problem.
>
> I was just saying UTF-8 where i should have been saying ISO-8859-1.
>
Good, because multibyte sequences of UTF-8 will absolutly not work with CIFS.
> > A good explanation of the various character sets can be found here:
> >
> > http://www.czyborra.com/utf/
> > http://czyborra.com/charsets/iso8859.html
> >
> > as well as a brief discussion of EBCDIC, the issue at hand:
> >
> > http://czyborra.com/charsets/iso646.html
> >
> > As far as jCIFS is concerned, it probably doesn't matter which
> encoding
> > you use; a String containing characters over 255 would be encoded as
> > multiple bytes using UTF-8, which (I'm guessing) would be meaningless
> to
> > jCIFS.
> >
> > Java's internal encoding is UCS-2 not UTF-8 but the encoding
> >doesn't
> > really matter because the String class abstracts this fact.
> >
>
> Right, we never care about the internal encoding, just what's comming in and
> out.
>
> > Characters over 255 can't be represented using ISO-8859-1, and
> > the behavior in this case is unspecified (according to the String
> > Javadocs). So either way, you'll probably get garbage with any input
> > characters over 255, which isn't really an issue unless the underlying
> > network protocol has specified a means of handling it (in which case
> you
> > would use the specified encoding).
> >
> > As far as the actual code changes required for jCIFS, the only places
> > that they should need to be applied would be at the point of
> conversion
> > between a String and a byte[]. The most common would be something
> like:
> >
> > String myString = "hello there.";
> > byte[] myBytes = myString.getBytes();
> >
> > which would just need to be changed to:
> >
> > String myString = "hello there.";
> > byte[] myBytes = myString.getBytes("ISO-8859-1");
> >
> > Another instance would be:
> >
> > byte[] myBytes;
> > ...
> > String myString = new String(myBytes);
> >
> > which would be changed to:
> >
> > byte[] myBytes;
> > ...
> > String myString = new String(myBytes, "ISO-8859-1");
> >
> >
> > Eric
> >
> > I think you'r right. In fact I have completely done this for the
> >netbios, util,
> > and http packages. The alternative is to use char[] which are 16
> >bit UCS
> > codes and therefore do not need further manipulation. There is a
> >little bit
> > of code that does this. The jcifs.util.Log class and it's
> >extensions will
> > need to be converted to use char[] though. In most cases however
> >it is
> > indeed just a matter of specifying an encoding. The encodings
> >can differ
> > however. In some cases you know what the encoding is like when I
> > copy in the jcifs/http/ne.css style sheet. This is ISO-8859-1
> >(actually
> > ASCII). The proper Java encoding identifier for this is
> >ISO8859_1 but I
> > have seen 8859_1 in Java source too. Too bad they do not use the
> > standard identifiers. In other cases like the
> >jcifs/util/Base64.java class
> > base64 encoding is a binary to ASCII converter so I can use
> >"ASCII".
> > Elsewere however the encoding should be the CIFS 8 bit encoding
> > (which is referred to in the docs as ASCII but it could be any 8
> > codepage really) which means we need to use a jcifs.encoding
> >property.
> > In most cases this is ISO8859_1 but since this will deprecate
> >the
> > jcifs.smb.client.codepage property it could be other encodings
> >as well.
>
>
> I wouldn't converte everything to char arrays, Strings will do just fine.
> It's just generating those packets from those chars or Strings have to be
> changed. A lot of the jCIFS code takes this into consideration, some of it
> doesn't.
>
Char arrays are only used where character for character string manipulation
is necessary. I believe I have successfully converted all the code. After
looking at the Log.printHexDump code it looks like it might actually be ok
because it uses char[]. However it is difficult to tell how successfully this will
work without testing it on a platform like yours. Can you test? Not just the
NtlmHttpFilter but can you run examples/Torture1 and examples/FileOps?
More information about the jcifs
mailing list