[linux-cifs-client] [PATCH 1/3] cifs: Introduce helper to compute length of nls string in bytes

Sat Apr 25 03:46:27 GMT 2009

On Fri, Apr 24, 2009 at 10:28 PM, Shirish Pargaonkar
<shirishpargaonkar at gmail.com> wrote:
> 2009/4/24 Günter Kukkukk <linux at kukkukk.com>:
>> Am Freitag, 24. April 2009 schrieb Jeff Layton:
>>> On Fri, 24 Apr 2009 11:59:54 -0500
>>> Shirish Pargaonkar <shirishpargaonkar at gmail.com> wrote:
>>>
>>> > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar
>>> > <shirishpargaonkar at gmail.com> wrote:
>>> > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton at redhat.com> wrote:
>>> > >> On Thu, 23 Apr 2009 02:49:21 +0200
>>> > >> Günter Kukkukk <linux at kukkukk.com> wrote:
>>> > >>
>>> > >>> just some further notes.
>>> > >>> With "it's heavily used" i didn't mean the number of callers using this
>>> > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
>>> > >>> is called in daily usage.... (readdir results)
>>> > >>>
>>> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
>>> > >>> to cifs_convertUCSpath()!
>>> > >>>
>>> > >>> See the following code snippet:
>>> > >>>
>>> > >>> readdir.c --> static int cifs_get_name_from_search_buf()
>>> > >>> ....
>>> > >>>
>>> > >>>       if (unicode) {
>>> > >>>               /* BB fixme - test with long names */
>>> > >>>               /* Note converted filename can be longer than in unicode */
>>> > >>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
>>> > >>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
>>> > >>>                                       (__le16 *)filename, len/2, nlt);
>>> > >>>               else
>>> > >>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
>>> > >>>                                       (__le16 *)filename, len/2, nlt);
>>> > >>>
>>> > >>> ....
>>> > >>
>>> > >> I see what you mean. Good catch. That function also has broken buffer
>>> > >> length checking logic too.
>>> > >>
>>> > >> This patch is only compile-tested, but it should fix those problems. In
>>> > >> the long run, we probably need to make all of these functions take an
>>> > >> argument with the length of the destination buffer.
>>> > >>
>>> > >> Let's plan that overhaul after Suresh's latest set goes in though.
>>> > >>
>>> > >> --
>>> > >> Jeff Layton <jlayton at redhat.com>
>>> > >>
>>> > >> _______________________________________________
>>> > >> linux-cifs-client mailing list
>>> > >> linux-cifs-client at lists.samba.org
>>> > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>>> > >>
>>> > >>
>>> > >
>>> > > A general question, the functions such as cifs_strtoUCS call uni2char
>>> > > which assumes UTF-8 translation format.
>>> > > If one of the characaters being encoded happens to be 6 bytes long,
>>> > > will a SMB/CIFS server be able
>>> > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
>>> > > byte encoded value, (how) would it handle
>>> > > 6 byte encoded value!
>>> > >
>>> >
>>> > Sorry, I meant to say
>>> >  'char2uni which assumes UTF-8 translation format'
>>> > and not
>>> >  'uni2char which assumes UTF-8 translation format'
>>>
>>> My understanding is that the unicode spec allows for a character to
>>> translate to a wide char of up to 6 bytes. According to Suresh's
>>> earlier email though, the unicode standard specifies no characters
>>> above 0x10ffff. So Unicode characters can only be up to four bytes long
>>> in UTF-8 (and maybe even only 3 bytes unless I'm missing something).
>>>
>>> The question of course is, what if the client is using some other
>>> non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6
>>> bytes in that case?
>>>
>>
>> i've spent days now on the "unicode" (and ISO-10646) stuff - will write
>> a conclusion later this week.
>> The current unicode upper limit 0x10ffff results to 4 bytes utf-8.
>>
>>   "It is important to note that both the Unicode consortium and ISO pledge
>>    to never extend the encoding-space past this range." (0x10ffff)
>>
>
> Gunter,
>
>
> The range 0 - 0x10ffff is the range of Unicode/UCS character set.
> But when any of these Unicode/UCS characters is encoded using UTF-8
> the encoded value can span upto 6 bytes. Would that be correct?
>

I should not say 'any of these' but should say 'some of these'.
So a Unicode/UCS character itself would not take more than 4 bytes
but some of their encoded value may take upto six bytes and endoded value is
what sent over the wire to the server.

>> Jeff: "...what if the client is using some other non-UTF8 multibyte charset?"
>>
>> Some unixes use UTF-32 and UCS4 to represent one character - but even
>> those would only consume 4 bytes per char - always wasting 11 bit in
>> the 32 bit range.
>>
>> The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS
>> filesystem would only result in 3 bytes UTF-8 (more later this week).
>>
>> Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2"
>> using proper nls 'char2uni'.
>>
>> More later ...
>> Cheers, Günter
>>
>