Doubts about Samba's unicode translation tables

Douglas Bagnall douglas.bagnall at catalyst.net.nz
Tue Apr 23 01:56:41 UTC 2024


On 22/04/24 21:01, Xavi Hernandez wrote:

> I think we are dealing with two different things here. On one side we have 
> locale-based case-insensitive comparisons. This is the most common situation for 
> applications where they need to be able to compare two strings based on the 
> specific rules for the user location, so that the result of the comparison 
> yields what the user would expect. The same exact strings for another user in 
> another location (with another locale) may have different rules for comparison 
> and return a different result.
> 
> On the other side we have case-insensitive NTFS file accesses. In this case the 
> rules need to be a bit different. I see 2 major things to consider:
> 
> 1. The comparison cannot be locale-related
> 
> When a file is saved to the filesystem, it cannot depend on the locale of the 
> user (or even the server) whether a file name is "equivalent" to another or not, 
> because changing the locale can cause the appearance of duplicated files in a 
> directory.
> 
> 2. Only comparison for equality is required
> 
> To find a file by name in a directory we just need to compare case-insentitively 
> for equality (normally a hash is used to find the bucket where the file resides 
> and then a case-insensitive comparison for equality is enough). We don't care 
> about the relative order of the existing name and the name we are looking for. 
> Another very different thing is, after having listed all directory entries, to 
> sort them by name to show them to the user. This later comparison depends on the 
> locale and is made on the client side.
> 
> I think that NTFS implements the $UpCase table just for this purpose: It's 
> locale-independent and it's used just for equality, and this is independent of 
> the generic NLS-aware functions that Windows provides.
> 
>  From what I understand (though I may be wrong), it seems like Samba is using a 
> mix of both things: it uses fixed tables to convert the string case, which is 
> locale-independent, but then it does relative comparisons (i.e. greater/less 
> than, instead of just equality). I don't know how NTFS works exactly, and most 
> of the information I've found is quite old, so maybe I'm completely wrong here, 
> but I think it makes sense to do case-insensitive comparisons for a filesystem 
> in the way I've explained, and it would also explain why NTFS still has the 
> $UpCase file.
> 
> Does this make any sense ?

Yes. A sorting compare will give you equality (in a given locale), but it won't
give you a canonical version for hashing.

In Samba we may conflate things because we are not just a remote NTFS, we are
also Active Directory and RPCs.

I am curious whether "Windows 8 Upper Case Mapping Table.txt" from

>>     On https://www.microsoft.com/en-us/download/details.aspx?id=10921

matches the $UpCase table you find, and whether that means we just have an old
one from win2k days. I don't see a change in Linux's fs/ntfs/upcase.c though, so
I suspect not.

Douglas




More information about the samba-technical mailing list