Doubts about Samba's unicode translation tables
Douglas Bagnall
douglas.bagnall at catalyst.net.nz
Tue Apr 23 01:56:41 UTC 2024
On 22/04/24 21:01, Xavi Hernandez wrote:
> I think we are dealing with two different things here. On one side we have
> locale-based case-insensitive comparisons. This is the most common situation for
> applications where they need to be able to compare two strings based on the
> specific rules for the user location, so that the result of the comparison
> yields what the user would expect. The same exact strings for another user in
> another location (with another locale) may have different rules for comparison
> and return a different result.
>
> On the other side we have case-insensitive NTFS file accesses. In this case the
> rules need to be a bit different. I see 2 major things to consider:
>
> 1. The comparison cannot be locale-related
>
> When a file is saved to the filesystem, it cannot depend on the locale of the
> user (or even the server) whether a file name is "equivalent" to another or not,
> because changing the locale can cause the appearance of duplicated files in a
> directory.
>
> 2. Only comparison for equality is required
>
> To find a file by name in a directory we just need to compare case-insentitively
> for equality (normally a hash is used to find the bucket where the file resides
> and then a case-insensitive comparison for equality is enough). We don't care
> about the relative order of the existing name and the name we are looking for.
> Another very different thing is, after having listed all directory entries, to
> sort them by name to show them to the user. This later comparison depends on the
> locale and is made on the client side.
>
> I think that NTFS implements the $UpCase table just for this purpose: It's
> locale-independent and it's used just for equality, and this is independent of
> the generic NLS-aware functions that Windows provides.
>
> From what I understand (though I may be wrong), it seems like Samba is using a
> mix of both things: it uses fixed tables to convert the string case, which is
> locale-independent, but then it does relative comparisons (i.e. greater/less
> than, instead of just equality). I don't know how NTFS works exactly, and most
> of the information I've found is quite old, so maybe I'm completely wrong here,
> but I think it makes sense to do case-insensitive comparisons for a filesystem
> in the way I've explained, and it would also explain why NTFS still has the
> $UpCase file.
>
> Does this make any sense ?
Yes. A sorting compare will give you equality (in a given locale), but it won't
give you a canonical version for hashing.
In Samba we may conflate things because we are not just a remote NTFS, we are
also Active Directory and RPCs.
I am curious whether "Windows 8 Upper Case Mapping Table.txt" from
>> On https://www.microsoft.com/en-us/download/details.aspx?id=10921
matches the $UpCase table you find, and whether that means we just have an old
one from win2k days. I don't see a change in Linux's fs/ntfs/upcase.c though, so
I suspect not.
Douglas
More information about the samba-technical
mailing list