[linux-cifs-client] Re: Unicode filenames on cifs mounted share

Tue Aug 24 18:57:03 GMT 2004

On 2004-08-24 at 10:22 -0500 Steve French sent off:
> > there is a problem with Unicode filenames on cifs mounted shares. 
> > It works okay with "simple" umlaut-containing filenames but it is
> > broken for example with Japanese filenames. Here is an example ...
> 
> I can't yet prove that in your example the problem is caused by a cifs
> bug (rather than in the conversion from Unicode to UTF-8 done in the
> Linux NLS module)

well, I think here is the proof that the linux UTF-8 NLS module is not the
cause:

pell:/home/bjacke # ls /srv/samba/word/|while read a ; do touch /mnt/1/dir-on-jfs-utf8/"$a" ; done
pell:/home/bjacke # ls /mnt/1/dir-on-jfs-utf8/
.         bla1  fd             first      mkdir.exe                 ordner    test2         VMware-workstation-4.5.2-8848.exe
..        eu€   ffdようこそ    indesign2  Neu Textdokument (2).txt  sh_histo  töten
ようこそ  f1    file_name.tar  ksh.exe    Neu Textdokument.txt      täter     UNDELETE.EXE

I created exactly the same files on a iocharset=utf8 mounted JFS partition, so
the NFS module has to do the same job as in the case of the cifs mount. ... and
the files are ls'ed correct.

> related and was pointed out to me in the cifs code a few weeks ago at
> the cifs conference - It is possible apparently when using UTF8 codepage
> to have strings end up longer than the same string would be encoded when
> in Unicode since UTF8 characters can take more than two bytes (while the

UTF-8 is not a codepage, UTF-8 it is just one presentation of Unicode, just
like UTF-16 is another presentation of Unicode.

> Unicode strings on the wire are always 2 bytes per character).

UTF-8 may be up to 6 bytes/character long, yes

> lead to corruption in search buffers - when such (hopefully unusual)

not unusual at all, as Asian an Thai code points are all longer than 2 bytes in
UTF-8 presentation.

> strings are converted locally on the client in place in the network
> buffer because this might lead to the string conversion overwriting part
> of the following string.  That behavior (doing Unicode->local codepage)
> conversions in place is a bad idea based on an incorrect assumption -

good to hear that you already seem to have found where the problem is hidden
:-)

Bjoern
-- 
Björn Jacke,  SerNet Service Network GmbH
Phone: +49-(0)551-370000-0,  Fax: +49-(0)551-370000-9