[linux-cifs-client] [PATCH 1/3] cifs: Introduce helper to compute length of nls string in bytes

Sat Apr 25 03:12:25 GMT 2009

Am Freitag, 24. April 2009 schrieb Jeff Layton:
> On Fri, 24 Apr 2009 11:59:54 -0500
> Shirish Pargaonkar <shirishpargaonkar at gmail.com> wrote:
> 
> > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar
> > <shirishpargaonkar at gmail.com> wrote:
> > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton at redhat.com> wrote:
> > >> On Thu, 23 Apr 2009 02:49:21 +0200
> > >> Günter Kukkukk <linux at kukkukk.com> wrote:
> > >>
> > >>> just some further notes.
> > >>> With "it's heavily used" i didn't mean the number of callers using this
> > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
> > >>> is called in daily usage.... (readdir results)
> > >>>
> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
> > >>> to cifs_convertUCSpath()!
> > >>>
> > >>> See the following code snippet:
> > >>>
> > >>> readdir.c --> static int cifs_get_name_from_search_buf()
> > >>> ....
> > >>>
> > >>>       if (unicode) {
> > >>>               /* BB fixme - test with long names */
> > >>>               /* Note converted filename can be longer than in unicode */
> > >>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
> > >>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
> > >>>                                       (__le16 *)filename, len/2, nlt);
> > >>>               else
> > >>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
> > >>>                                       (__le16 *)filename, len/2, nlt);
> > >>>
> > >>> ....
> > >>
> > >> I see what you mean. Good catch. That function also has broken buffer
> > >> length checking logic too.
> > >>
> > >> This patch is only compile-tested, but it should fix those problems. In
> > >> the long run, we probably need to make all of these functions take an
> > >> argument with the length of the destination buffer.
> > >>
> > >> Let's plan that overhaul after Suresh's latest set goes in though.
> > >>
> > >> --
> > >> Jeff Layton <jlayton at redhat.com>
> > >>
> > >> _______________________________________________
> > >> linux-cifs-client mailing list
> > >> linux-cifs-client at lists.samba.org
> > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client
> > >>
> > >>
> > >
> > > A general question, the functions such as cifs_strtoUCS call uni2char
> > > which assumes UTF-8 translation format.
> > > If one of the characaters being encoded happens to be 6 bytes long,
> > > will a SMB/CIFS server be able
> > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
> > > byte encoded value, (how) would it handle
> > > 6 byte encoded value!
> > >
> > 
> > Sorry, I meant to say
> >  'char2uni which assumes UTF-8 translation format'
> > and not
> >  'uni2char which assumes UTF-8 translation format'
> 
> My understanding is that the unicode spec allows for a character to
> translate to a wide char of up to 6 bytes. According to Suresh's
> earlier email though, the unicode standard specifies no characters
> above 0x10ffff. So Unicode characters can only be up to four bytes long
> in UTF-8 (and maybe even only 3 bytes unless I'm missing something).
> 
> The question of course is, what if the client is using some other
> non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6
> bytes in that case?
> 

i've spent days now on the "unicode" (and ISO-10646) stuff - will write
a conclusion later this week.
The current unicode upper limit 0x10ffff results to 4 bytes utf-8.

   "It is important to note that both the Unicode consortium and ISO pledge
    to never extend the encoding-space past this range." (0x10ffff)

Jeff: "...what if the client is using some other non-UTF8 multibyte charset?"

Some unixes use UTF-32 and UCS4 to represent one character - but even
those would only consume 4 bytes per char - always wasting 11 bit in
the 32 bit range.

The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS
filesystem would only result in 3 bytes UTF-8 (more later this week).

Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2"
using proper nls 'char2uni'.

More later ...
Cheers, Günter