Doubts about Samba's unicode translation tables

Mon Apr 22 09:01:41 UTC 2024

Hi Douglas,

On Mon, Apr 22, 2024 at 7:26 AM Douglas Bagnall <
douglas.bagnall at catalyst.net.nz> wrote:

> On 19/04/24 21:04, Xavi Hernandez via samba-technical wrote:
> > The first question is why Samba uses two tables while Windows only
> requires
> > one ?
> > For what purpose is the lowercase translation table in Samba used ?
> > Is the Samba's case-insensitive comparison method actually equal to
> Windows
> > ?
>
> I don't have real answers, but I think the current mappings date back to
> this 2001 commit:
>
>
> https://gitlab.com/samba-team/samba/-/commit/9bcd133e9e7b0cfe974f273fb23409d660af8358
>
> The Windows sorting weight tables change often.
> On https://www.microsoft.com/en-us/download/details.aspx?id=10921 we see:
>
>    Windows Vista Sorting Weight Table.txt
>    Windows 8 and Windows Server 2012 Sorting Weight Table.txt
>    Windows Server 2008 Sorting Weight Table.txt
>    Windows 7 and Windows server 2008 R2 Sorting Weight Table.txt
>    Windows 8 Upper Case Mapping Table.txt
>    Windows NT 4.0 through Windows Server 2003 Sorting Weight Table.txt
>    Windows 10 Sorting Weight Table.txt
>
> That is not exactly the same thing as case mapping (apart perhaps from
> the one called "Windows 8 Upper Case Mapping Table"). It seems likely that
> a lot of the changes are for new Unicode characters beyond the 16 bit
> plane.
>
> "Windows 8 Upper Case Mapping Table.txt" has at least some of the changes
> in
> your differences.txt.
>
> This Gitlab thread is related:
>
>
> https://gitlab.com/samba-team/sam.txtba/-/merge_requests/3258#note_1576341163
> <https://gitlab.com/samba-team/samba/-/merge_requests/3258#note_1576341163>
>
> I have never got to the bottom of why we do what we do and how it differs
> from Windows, but I suspect the answer is it works well enough most of
> the time. That's worrying, but not enough to make it a priority.
>

Thanks for taking a look and for the valuable links.

I think we are dealing with two different things here. On one side we have
locale-based case-insensitive comparisons. This is the most common
situation for applications where they need to be able to compare two
strings based on the specific rules for the user location, so that the
result of the comparison yields what the user would expect. The same exact
strings for another user in another location (with another locale) may have
different rules for comparison and return a different result.

On the other side we have case-insensitive NTFS file accesses. In this case
the rules need to be a bit different. I see 2 major things to consider:

1. The comparison cannot be locale-related

When a file is saved to the filesystem, it cannot depend on the locale of
the user (or even the server) whether a file name is "equivalent" to
another or not, because changing the locale can cause the appearance of
duplicated files in a directory.

2. Only comparison for equality is required

To find a file by name in a directory we just need to compare
case-insentitively for equality (normally a hash is used to find the bucket
where the file resides and then a case-insensitive comparison for equality
is enough). We don't care about the relative order of the existing name and
the name we are looking for. Another very different thing is, after having
listed all directory entries, to sort them by name to show them to the
user. This later comparison depends on the locale and is made on the client
side.

I think that NTFS implements the $UpCase table just for this purpose: It's
locale-independent and it's used just for equality, and this is independent
of the generic NLS-aware functions that Windows provides.

>From what I understand (though I may be wrong), it seems like Samba is
using a mix of both things: it uses fixed tables to convert the string
case, which is locale-independent, but then it does relative comparisons
(i.e. greater/less than, instead of just equality). I don't know how NTFS
works exactly, and most of the information I've found is quite old, so
maybe I'm completely wrong here, but I think it makes sense to do
case-insensitive comparisons for a filesystem in the way I've explained,
and it would also explain why NTFS still has the $UpCase file.

Does this make any sense ?

Best regards,

Xavi