[linux-cifs-client] [PATCH 1/3] cifs: Introduce helper to compute length of nls string in bytes

Sat Apr 25 03:28:48 GMT 2009

2009/4/24 Günter Kukkukk <linux at kukkukk.com>:
> Am Freitag, 24. April 2009 schrieb Jeff Layton:
>> On Fri, 24 Apr 2009 11:59:54 -0500
>> Shirish Pargaonkar <shirishpargaonkar at gmail.com> wrote:
>>
>> > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar
>> > <shirishpargaonkar at gmail.com> wrote:
>> > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton at redhat.com> wrote:
>> > >> On Thu, 23 Apr 2009 02:49:21 +0200
>> > >> Günter Kukkukk <linux at kukkukk.com> wrote:
>> > >>
>> > >>> just some further notes.
>> > >>> With "it's heavily used" i didn't mean the number of callers using this
>> > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
>> > >>> is called in daily usage.... (readdir results)
>> > >>>
>> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
>> > >>> to cifs_convertUCSpath()!
>> > >>>
>> > >>> See the following code snippet:
>> > >>>
>> > >>> readdir.c --> static int cifs_get_name_from_search_buf()
>> > >>> ....
>> > >>>
>> > >>>       if (unicode) {
>> > >>>               /* BB fixme - test with long names */
>> > >>>               /* Note converted filename can be longer than in unicode */
>> > >>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
>> > >>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
>> > >>>                                       (__le16 *)filename, len/2, nlt);
>> > >>>               else
>> > >>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
>> > >>>                                       (__le16 *)filename, len/2, nlt);
>> > >>>
>> > >>> ....
>> > >>
>> > >> I see what you mean. Good catch. That function also has broken buffer
>> > >> length checking logic too.
>> > >>
>> > >> This patch is only compile-tested, but it should fix those problems. In
>> > >> the long run, we probably need to make all of these functions take an
>> > >> argument with the length of the destination buffer.
>> > >>
>> > >> Let's plan that overhaul after Suresh's latest set goes in though.
>> > >>
>> > >> --
>> > >> Jeff Layton <jlayton at redhat.com>
>> > >>
>> > >> _______________________________________________
>> > >> linux-cifs-client mailing list
>> > >> linux-cifs-client at lists.samba.org
>> > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>> > >>
>> > >>
>> > >
>> > > A general question, the functions such as cifs_strtoUCS call uni2char
>> > > which assumes UTF-8 translation format.
>> > > If one of the characaters being encoded happens to be 6 bytes long,
>> > > will a SMB/CIFS server be able
>> > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
>> > > byte encoded value, (how) would it handle
>> > > 6 byte encoded value!
>> > >
>> >
>> > Sorry, I meant to say
>> >  'char2uni which assumes UTF-8 translation format'
>> > and not
>> >  'uni2char which assumes UTF-8 translation format'
>>
>> My understanding is that the unicode spec allows for a character to
>> translate to a wide char of up to 6 bytes. According to Suresh's
>> earlier email though, the unicode standard specifies no characters
>> above 0x10ffff. So Unicode characters can only be up to four bytes long
>> in UTF-8 (and maybe even only 3 bytes unless I'm missing something).
>>
>> The question of course is, what if the client is using some other
>> non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6
>> bytes in that case?
>>
>
> i've spent days now on the "unicode" (and ISO-10646) stuff - will write
> a conclusion later this week.
> The current unicode upper limit 0x10ffff results to 4 bytes utf-8.
>
>   "It is important to note that both the Unicode consortium and ISO pledge
>    to never extend the encoding-space past this range." (0x10ffff)
>

Gunter,

The range 0 - 0x10ffff is the range of Unicode/UCS character set.
But when any of these Unicode/UCS characters is encoded using UTF-8
the encoded value can span upto 6 bytes. Would that be correct?

> Jeff: "...what if the client is using some other non-UTF8 multibyte charset?"
>
> Some unixes use UTF-32 and UCS4 to represent one character - but even
> those would only consume 4 bytes per char - always wasting 11 bit in
> the 32 bit range.
>
> The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS
> filesystem would only result in 3 bytes UTF-8 (more later this week).
>
> Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2"
> using proper nls 'char2uni'.
>
> More later ...
> Cheers, Günter
>