[jcifs] problem encoding

Fri Jan 17 13:27:55 EST 2003

> -----Original Message-----
> From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> Sent:	Thursday, January 16, 2003 4:45 PM
> To:	Michael B. Allen
> Cc:	andrea.lanza at frameweb.it; jcifs at samba.org
> Subject:	Re: [jcifs] problem encoding
> 
> Okay, spent a few minutes on #samba-technical and got some answers on 
> this.
> 
> I was probably incorrect in connecting the earlier discussion to the
> problem that Andrea reported.
> 
> That aside, there is still a problem.
> - If the client and server negotiate Unicode then everything works just 
>   fine.
> 
	Andrea's problem has nothing to do with SMB, jCIFS, or SMB URLs. He is
	trying to make an *HTTP URL* that encodes an SMB path in the
	PATH_INFO area like:

	  http://httpserver/servlet/JcifsServlet/this/is/path/info.pdf

	Inside the servlet you do request.getPathInfo() and you get:

	  this/is/path/info.pdf

	This has been around since early CGI days and has to do with how
	browsers (MSIE) handle mime types. If you try and pass a path as a
	QUERY_STRING parameter IE will ignore the .pdf. He is talking about using
	this getPathInfo() string with jCIFS to get a file on an SMB file server. The
	problem he is having is that extended characters in the path are for whatever
	reason not being escaped/unescaped sufficiently. I don't know the details
	but the important thing is that this not be confused with the ongoing SMB
	URL escaping discussion.

> - If either side is unable to handle Unicode, then they both must be using
>   the same 8-bit encoding (same DOS OEM codepage) or anything above ASCII
>   127 is at risk for being mapped incorrectly.
> 
> This is, in fact, a problem for Samba 2.2.x.  Full Unicode support is in
> 3.x, but not in 2.2.x.
> 
> Note that this is also an SMB protocol bug, not a client or server bug.  
> There is nothing in SMB that allows negotiation of the codepage, which is 
> a major oversight.  I guess they never figured that people from different 
> nationalities might want to communicate.
> 
> One more interesting note:  UTF-8 is a multi-byte encoding.  ASCII values 
> (codes 127 and below) are stored in one byte.  Anything above is stored in 
> two bytes, with the high-order bit set.  I am told that UTF-8 is *not* 
> used in SMB at all.
> 
	UTF-8 can actually occupy up to 6 bytes per character. Also to clarify UTF-8
	is Unicode. As is UTF-16, UTF-16LE, UTF-16BE, UCS-2, UCS-2LE, UCS-2BE,
	UCS-4, UCS-4LE, and UCS-4BE. They are just different encodings of UCS
	codes which are integers that can represent character codes up to 0x10FFFF.
	For example UTF-16BE uses two bytes encoded in big endian byte order but if
	codes greater some certain range (don't remember off hand) need to be
	represented another two bytes (a "sarrogate pair") is employed. The extension
	principle is very similar to that of UTF-8 in that if you run out of range you flip
	on the high bit and use another unit of space. The UCS-X encodings do not do
	this. UCS-4LE for example uses 4 bytes in little endian byte order to encode
	each character.

	SMB uses UCS-2LE and possibly UTF-16LE which are identical except in the
	UTF range used to identify characters that fall outside the supported UTF-16
	range.

	See: http://www.cl.cam.ac.uk/~mgk25/unicode.html

	Incedentally I wrote a library that can encode and decode any of these
	encodings and various codepages among other things. I bascially rewrote
	the loop_ routines of libiconv to support my preferred interface.

	http://www.eskimo.com/~miallen/encdec/