[jcifs] problem encoding

Christopher R. Hertel crh at ubiqx.mn.org
Fri Jan 17 06:05:22 EST 2003

This looks as though it is an eight-bit character encoding issue.

  [Mike:  I'll just smile quietly to myself.]  ;)

Microsoft uses different "DOS Code Pages" (also known as "OEM Character
Sets) to encode file and directory names.  These go back to the old days
when IBM managed the SMB protocol, long before Unicode was available. My
guess, in this case, is that the filename on the SMB server side is
written using one of the DOS Code Pages.

The DOS Code Page values do not match Unicode values.  When you enter a
file name such as "Fübär.txt" and convert it to UTF-8 you will get the
bytes:  { 'F', 0xFC, 'b', 0xE4, 'r', 0 }.

  [Note, in case that doesn't come through correctly, that the filename is
  supposed to be 'F', o-umlout, 'b' a-umlout, 'r'.]

Using DOS Code Page 437, however, the same string would be encoded as:
{ 'F', 0x81, 'b', 0x84, 'r', 0 }

So, even though your URL is encoded "correctly", it gets to the other end 
and the server interprets it using the wrong set of byte-to-character 
mapping values.  The UTF-8 string doesn't match the DOS Codepage 437 

I don't have a good solution off-hand.  The best thing to do is to ensure 
that both the client and the server are using the same codeset.  Unicode 
would be a good choice.  Conversion from UTF-8 to Unicode is 
straight-forward.  The only other option (and this is really ugly) is to 
include DOS Codepage definitions with jCIFS and force the user to select 
the correct codepage for the particular server.

That latter one is a bad idea.

Better to negotiate Unicode, where possible.

Chris -)-----

PS.  Any chance you'll be at the Samba/XP conference in Goettingen,
     Germany, next April?  www.sambaxp.org

On Thu, Jan 16, 2003 at 04:08:53PM +0100, andrea.lanza at frameweb.it wrote:
> My problem is the following.
> Becouse I am writing a servlet using jcifs, I have an argoument passwd to
> the servlet's url containing the SMB file to get and work on.
> I encode this argoument using:
> java.net.URLEncoder.encode("Name of my SMB File","UTF-8");
> Everithing is OK fo a lot of character like () [] {} and so on...
> But with some characters (accented characters , the degree symbol and
> other) the encode fails.
> Which is the best encoding I can use ?
> thanks in advance,
> Andrea

Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org

More information about the jcifs mailing list