Doubts about Samba's unicode translation tables

Fri Apr 19 18:29:38 UTC 2024

Hi Ralph,

On Fri, Apr 19, 2024 at 7:16 PM Ralph Boehme <slow at samba.org> wrote:

> On 4/19/24 11:04, Xavi Hernandez via samba-technical wrote:
> > The first question is why Samba uses two tables while Windows only
> requires
> > one ?
>
> cf 3ed9c903671e795964ce3da9d0080444ef3eb5e9 and
>
> https://bugzilla.samba.org/show_bug.cgi?id=13018

This seems to be related to the posix strcasecmp() function but, if I'm not
wrong, Samba shouldn't compare filenames based on the posix specification.

I've found some very old archived references coming from blog.msdn.com
where it seems to be described how filesystem names comparisons are made
(and it doesn't seem like any "standard" case-insensitive function), though
I'm not sure if they are still relevant:

https://archives.miloush.net/michkap/archive/2005/01/16/353873.html
https://archives.miloush.net/michkap/archive/2005/10/17/481600.html

>
> > For what purpose is the lowercase translation table in Samba used ?
> > Is the Samba's case-insensitive comparison method actually equal to
> Windows
> > ?
>
> Hopefully. :)
>
> > I've also extracted the $UpCase file from a Windows 11 machine and I've
> > found that the Samba's uppercase table is very similar but not identical
> > (there are 339 different values). Is this expected ?
>
> I guess not. Can you share the differences?
>

Please, take a look at my previous email. I've attached a text file with
the differences.

> > I'm new to Samba, so I will be very grateful for any insights you might
> > give me about how the unicode tables work in Samba and any other
> important
> > details related to the case-insensitive accesses.
>
> The higher level processing is from get_real_filename_at() try the VFS
> via SMB_VFS_GET_REAL_FILENAME_AT() and if the VFS doesn't implement this
> (vfs_default returns NT_STATUS_NOT_SUPPORTED), go via
> get_real_filename_full_scan_at() which ends up calling
>
> fname_equal() -> strequal() -> strcasecmp_m() -> strcasecmp_m_handle()
> which contains the core logic.
>

More or less I already followed that path, but if that's all the story,
then I don't understand why tolower_m() is used. Apparently NTFS doesn't
use any lowercase conversion to compare file names, just the uppercase
table, and the uppercase table is different. So I'm not sure how the result
of the comparison could always be the same. Of course the differences will
only appear for very rare corner cases (or specially crafted names), but
I'm not sure if it's relevant or not.

Also, if I want to implement something similar on the CephFS side, is it
safe to use the NTFS table or should I use the Samba version, which seems a
bit more complex ?

Xavi

> -slow
>