[jcifs] Character Encioding Problems on OS/390

Allen, Michael B (RSCH) Michael_B_Allen at ml.com
Fri Oct 25 10:13:59 EST 2002


> -----Original Message-----
> From:	Eric Glass [SMTP:eglass1 at attbi.com]
> Sent:	Thursday, October 24, 2002 8:04 PM
> To:	Allen, Michael B (RSCH)
> Cc:	'Bigham, Ted'; 'jcifs at samba.org'
> Subject:	RE: [jcifs] Character Encioding Problems on OS/390
> 
> On Thu, 2002-10-24 at 18:01, Allen, Michael B (RSCH) wrote:
> > 
> >
> > > 
> > 	Or you mean use char[] for all array work and then at the last minute create a
> > 	String from it and do getBytes( "ISO-8859-1" ). I still don't understand were the
> > 	UTF-8 comes in though. Also you sound like you know of all of the locations in
> > 	the code were these changes would need to occur but almost all operations
> > 	are byte oriented. Can you give me a few example locations?
> > 
> 
> The UTFs (Unicode Transformation Formats) are means of representing
> Unicode characters; in UTF-8's case, 0-255 are represented as a single
> byte (same as ISO-8859-1).  For characters above 255, UTF-8 is a
> multibyte representation.  I believe ISO-8859-1 is incapable of
> representing 256+.
> 
	I'm familar with the encoding. I just am not clear on what it has to do with Ted's
	problem.

> A good explanation of the various character sets can be found here:
> 
> http://www.czyborra.com/utf/
> http://czyborra.com/charsets/iso8859.html
> 
> as well as a brief discussion of EBCDIC, the issue at hand:
> 
> http://czyborra.com/charsets/iso646.html
> 
> As far as jCIFS is concerned, it probably doesn't matter which encoding
> you use; a String containing characters over 255 would be encoded as
> multiple bytes using UTF-8, which (I'm guessing) would be meaningless to
> jCIFS.
> 
	Java's internal encoding is UCS-2 not UTF-8 but the encoding doesn't
	really matter because the String class abstracts this fact.

>   Characters over 255 can't be represented using ISO-8859-1, and
> the behavior in this case is unspecified (according to the String
> Javadocs).  So either way, you'll probably get garbage with any input
> characters over 255, which isn't really an issue unless the underlying
> network protocol has specified a means of handling it (in which case you
> would use the specified encoding).
> 
> As far as the actual code changes required for jCIFS, the only places
> that they should need to be applied would be at the point of conversion
> between a String and a byte[].  The most common would be something like:
> 
> String myString = "hello there.";
> byte[] myBytes = myString.getBytes();
> 
> which would just need to be changed to:
> 
> String myString = "hello there.";
> byte[] myBytes = myString.getBytes("ISO-8859-1");
> 
> Another instance would be:
> 
> byte[] myBytes;
> ...
> String myString = new String(myBytes);
> 
> which would be changed to:
> 
> byte[] myBytes;
> ...
> String myString = new String(myBytes, "ISO-8859-1");
> 
> 
> Eric
> 
	I think you'r right. In fact I have completely done this for the netbios, util,
	and http packages. The alternative is to use char[] which are 16 bit UCS
	codes and therefore do not need further manipulation. There is a little bit
	of code that does this. The jcifs.util.Log class and it's extensions will
	need to be converted to use char[] though. In most cases however it is
	indeed just a matter of specifying an encoding. The encodings can differ
	however. In some cases you know what the encoding is like when I
	copy in the jcifs/http/ne.css style sheet. This is ISO-8859-1 (actually
	ASCII). The proper Java encoding identifier for this is ISO8859_1 but I
	have seen 8859_1 in Java source too. Too bad they do not use the
	standard identifiers. In other cases like the jcifs/util/Base64.java class
	base64 encoding is a binary to ASCII converter so I can use "ASCII".
	Elsewere however the encoding should be the CIFS 8 bit encoding
	(which is referred to in the docs as ASCII but it could be any 8
	codepage really) which means we need to use a jcifs.encoding property.
	In most cases this is ISO8859_1 but since this will deprecate the
	jcifs.smb.client.codepage property it could be other encodings as well.





More information about the jcifs mailing list