[jcifs] problem encoding
Christopher R. Hertel
crh at ubiqx.mn.org
Fri Jan 17 06:05:22 EST 2003
This looks as though it is an eight-bit character encoding issue.
[Mike: I'll just smile quietly to myself.] ;)
Microsoft uses different "DOS Code Pages" (also known as "OEM Character
Sets) to encode file and directory names. These go back to the old days
when IBM managed the SMB protocol, long before Unicode was available. My
guess, in this case, is that the filename on the SMB server side is
written using one of the DOS Code Pages.
The DOS Code Page values do not match Unicode values. When you enter a
file name such as "Fübär.txt" and convert it to UTF-8 you will get the
bytes: { 'F', 0xFC, 'b', 0xE4, 'r', 0 }.
[Note, in case that doesn't come through correctly, that the filename is
supposed to be 'F', o-umlout, 'b' a-umlout, 'r'.]
Using DOS Code Page 437, however, the same string would be encoded as:
{ 'F', 0x81, 'b', 0x84, 'r', 0 }
So, even though your URL is encoded "correctly", it gets to the other end
and the server interprets it using the wrong set of byte-to-character
mapping values. The UTF-8 string doesn't match the DOS Codepage 437
string.
I don't have a good solution off-hand. The best thing to do is to ensure
that both the client and the server are using the same codeset. Unicode
would be a good choice. Conversion from UTF-8 to Unicode is
straight-forward. The only other option (and this is really ugly) is to
include DOS Codepage definitions with jCIFS and force the user to select
the correct codepage for the particular server.
That latter one is a bad idea.
Better to negotiate Unicode, where possible.
Chris -)-----
PS. Any chance you'll be at the Samba/XP conference in Goettingen,
Germany, next April? www.sambaxp.org
On Thu, Jan 16, 2003 at 04:08:53PM +0100, andrea.lanza at frameweb.it wrote:
> My problem is the following.
> Becouse I am writing a servlet using jcifs, I have an argoument passwd to
> the servlet's url containing the SMB file to get and work on.
>
> I encode this argoument using:
>
> java.net.URLEncoder.encode("Name of my SMB File","UTF-8");
>
> Everithing is OK fo a lot of character like () [] {} and so on...
>
> But with some characters (accented characters , the degree symbol and
> other) the encode fails.
>
> Which is the best encoding I can use ?
>
> thanks in advance,
>
> Andrea
>
>
--
Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org
More information about the jcifs
mailing list