[jcifs] problem encoding
Allen, Michael B (RSCH)
Michael_B_Allen at ml.com
Fri Jan 17 13:27:55 EST 2003
> -----Original Message-----
> From: Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> Sent: Thursday, January 16, 2003 4:45 PM
> To: Michael B. Allen
> Cc: andrea.lanza at frameweb.it; jcifs at samba.org
> Subject: Re: [jcifs] problem encoding
>
> Okay, spent a few minutes on #samba-technical and got some answers on
> this.
>
> I was probably incorrect in connecting the earlier discussion to the
> problem that Andrea reported.
>
> That aside, there is still a problem.
> - If the client and server negotiate Unicode then everything works just
> fine.
>
Andrea's problem has nothing to do with SMB, jCIFS, or SMB URLs. He is
trying to make an *HTTP URL* that encodes an SMB path in the
PATH_INFO area like:
http://httpserver/servlet/JcifsServlet/this/is/path/info.pdf
Inside the servlet you do request.getPathInfo() and you get:
this/is/path/info.pdf
This has been around since early CGI days and has to do with how
browsers (MSIE) handle mime types. If you try and pass a path as a
QUERY_STRING parameter IE will ignore the .pdf. He is talking about using
this getPathInfo() string with jCIFS to get a file on an SMB file server. The
problem he is having is that extended characters in the path are for whatever
reason not being escaped/unescaped sufficiently. I don't know the details
but the important thing is that this not be confused with the ongoing SMB
URL escaping discussion.
> - If either side is unable to handle Unicode, then they both must be using
> the same 8-bit encoding (same DOS OEM codepage) or anything above ASCII
> 127 is at risk for being mapped incorrectly.
>
> This is, in fact, a problem for Samba 2.2.x. Full Unicode support is in
> 3.x, but not in 2.2.x.
>
> Note that this is also an SMB protocol bug, not a client or server bug.
> There is nothing in SMB that allows negotiation of the codepage, which is
> a major oversight. I guess they never figured that people from different
> nationalities might want to communicate.
>
> One more interesting note: UTF-8 is a multi-byte encoding. ASCII values
> (codes 127 and below) are stored in one byte. Anything above is stored in
> two bytes, with the high-order bit set. I am told that UTF-8 is *not*
> used in SMB at all.
>
UTF-8 can actually occupy up to 6 bytes per character. Also to clarify UTF-8
is Unicode. As is UTF-16, UTF-16LE, UTF-16BE, UCS-2, UCS-2LE, UCS-2BE,
UCS-4, UCS-4LE, and UCS-4BE. They are just different encodings of UCS
codes which are integers that can represent character codes up to 0x10FFFF.
For example UTF-16BE uses two bytes encoded in big endian byte order but if
codes greater some certain range (don't remember off hand) need to be
represented another two bytes (a "sarrogate pair") is employed. The extension
principle is very similar to that of UTF-8 in that if you run out of range you flip
on the high bit and use another unit of space. The UCS-X encodings do not do
this. UCS-4LE for example uses 4 bytes in little endian byte order to encode
each character.
SMB uses UCS-2LE and possibly UTF-16LE which are identical except in the
UTF range used to identify characters that fall outside the supported UTF-16
range.
See: http://www.cl.cam.ac.uk/~mgk25/unicode.html
Incedentally I wrote a library that can encode and decode any of these
encodings and various codepages among other things. I bascially rewrote
the loop_ routines of libiconv to support my preferred interface.
http://www.eskimo.com/~miallen/encdec/
More information about the jcifs
mailing list