[jcifs] Character Encioding Problems on OS/390
Allen, Michael B (RSCH)
Michael_B_Allen at ml.com
Fri Oct 25 10:13:59 EST 2002
> -----Original Message-----
> From: Eric Glass [SMTP:eglass1 at attbi.com]
> Sent: Thursday, October 24, 2002 8:04 PM
> To: Allen, Michael B (RSCH)
> Cc: 'Bigham, Ted'; 'jcifs at samba.org'
> Subject: RE: [jcifs] Character Encioding Problems on OS/390
>
> On Thu, 2002-10-24 at 18:01, Allen, Michael B (RSCH) wrote:
> >
> >
> > >
> > Or you mean use char[] for all array work and then at the last minute create a
> > String from it and do getBytes( "ISO-8859-1" ). I still don't understand were the
> > UTF-8 comes in though. Also you sound like you know of all of the locations in
> > the code were these changes would need to occur but almost all operations
> > are byte oriented. Can you give me a few example locations?
> >
>
> The UTFs (Unicode Transformation Formats) are means of representing
> Unicode characters; in UTF-8's case, 0-255 are represented as a single
> byte (same as ISO-8859-1). For characters above 255, UTF-8 is a
> multibyte representation. I believe ISO-8859-1 is incapable of
> representing 256+.
>
I'm familar with the encoding. I just am not clear on what it has to do with Ted's
problem.
> A good explanation of the various character sets can be found here:
>
> http://www.czyborra.com/utf/
> http://czyborra.com/charsets/iso8859.html
>
> as well as a brief discussion of EBCDIC, the issue at hand:
>
> http://czyborra.com/charsets/iso646.html
>
> As far as jCIFS is concerned, it probably doesn't matter which encoding
> you use; a String containing characters over 255 would be encoded as
> multiple bytes using UTF-8, which (I'm guessing) would be meaningless to
> jCIFS.
>
Java's internal encoding is UCS-2 not UTF-8 but the encoding doesn't
really matter because the String class abstracts this fact.
> Characters over 255 can't be represented using ISO-8859-1, and
> the behavior in this case is unspecified (according to the String
> Javadocs). So either way, you'll probably get garbage with any input
> characters over 255, which isn't really an issue unless the underlying
> network protocol has specified a means of handling it (in which case you
> would use the specified encoding).
>
> As far as the actual code changes required for jCIFS, the only places
> that they should need to be applied would be at the point of conversion
> between a String and a byte[]. The most common would be something like:
>
> String myString = "hello there.";
> byte[] myBytes = myString.getBytes();
>
> which would just need to be changed to:
>
> String myString = "hello there.";
> byte[] myBytes = myString.getBytes("ISO-8859-1");
>
> Another instance would be:
>
> byte[] myBytes;
> ...
> String myString = new String(myBytes);
>
> which would be changed to:
>
> byte[] myBytes;
> ...
> String myString = new String(myBytes, "ISO-8859-1");
>
>
> Eric
>
I think you'r right. In fact I have completely done this for the netbios, util,
and http packages. The alternative is to use char[] which are 16 bit UCS
codes and therefore do not need further manipulation. There is a little bit
of code that does this. The jcifs.util.Log class and it's extensions will
need to be converted to use char[] though. In most cases however it is
indeed just a matter of specifying an encoding. The encodings can differ
however. In some cases you know what the encoding is like when I
copy in the jcifs/http/ne.css style sheet. This is ISO-8859-1 (actually
ASCII). The proper Java encoding identifier for this is ISO8859_1 but I
have seen 8859_1 in Java source too. Too bad they do not use the
standard identifiers. In other cases like the jcifs/util/Base64.java class
base64 encoding is a binary to ASCII converter so I can use "ASCII".
Elsewere however the encoding should be the CIFS 8 bit encoding
(which is referred to in the docs as ASCII but it could be any 8
codepage really) which means we need to use a jcifs.encoding property.
In most cases this is ISO8859_1 but since this will deprecate the
jcifs.smb.client.codepage property it could be other encodings as well.
More information about the jcifs
mailing list