Doubts about Samba's unicode translation tables
Xavi Hernandez
xhernandez at gmail.com
Tue Apr 23 08:39:55 UTC 2024
Hi Douglas,
On Tue, Apr 23, 2024 at 3:56 AM Douglas Bagnall <
douglas.bagnall at catalyst.net.nz> wrote:
> On 22/04/24 21:01, Xavi Hernandez wrote:
>
> > I think we are dealing with two different things here. On one side we
> have
> > locale-based case-insensitive comparisons. This is the most common
> situation for
> > applications where they need to be able to compare two strings based on
> the
> > specific rules for the user location, so that the result of the
> comparison
> > yields what the user would expect. The same exact strings for another
> user in
> > another location (with another locale) may have different rules for
> comparison
> > and return a different result.
> >
> > On the other side we have case-insensitive NTFS file accesses. In this
> case the
> > rules need to be a bit different. I see 2 major things to consider:
> >
> > 1. The comparison cannot be locale-related
> >
> > When a file is saved to the filesystem, it cannot depend on the locale
> of the
> > user (or even the server) whether a file name is "equivalent" to another
> or not,
> > because changing the locale can cause the appearance of duplicated files
> in a
> > directory.
> >
> > 2. Only comparison for equality is required
> >
> > To find a file by name in a directory we just need to compare
> case-insentitively
> > for equality (normally a hash is used to find the bucket where the file
> resides
> > and then a case-insensitive comparison for equality is enough). We don't
> care
> > about the relative order of the existing name and the name we are
> looking for.
> > Another very different thing is, after having listed all directory
> entries, to
> > sort them by name to show them to the user. This later comparison
> depends on the
> > locale and is made on the client side.
> >
> > I think that NTFS implements the $UpCase table just for this purpose:
> It's
> > locale-independent and it's used just for equality, and this is
> independent of
> > the generic NLS-aware functions that Windows provides.
> >
> > From what I understand (though I may be wrong), it seems like Samba is
> using a
> > mix of both things: it uses fixed tables to convert the string case,
> which is
> > locale-independent, but then it does relative comparisons (i.e.
> greater/less
> > than, instead of just equality). I don't know how NTFS works exactly,
> and most
> > of the information I've found is quite old, so maybe I'm completely
> wrong here,
> > but I think it makes sense to do case-insensitive comparisons for a
> filesystem
> > in the way I've explained, and it would also explain why NTFS still has
> the
> > $UpCase file.
> >
> > Does this make any sense ?
>
> Yes. A sorting compare will give you equality (in a given locale), but it
> won't
> give you a canonical version for hashing.
>
> In Samba we may conflate things because we are not just a remote NTFS, we
> are
> also Active Directory and RPCs.
>
Yes. I understand that some string comparisons need to return relative
order, but I'm wondering if we shouldn't use a specific and more simple
comparison function just for file names.
> I am curious whether "Windows 8 Upper Case Mapping Table.txt" from
>
> >> On https://www.microsoft.com/en-us/download/details.aspx?id=10921
>
> matches the $UpCase table you find, and whether that means we just have an
> old
> one from win2k days. I don't see a change in Linux's fs/ntfs/upcase.c
> though, so
> I suspect not.
>
I've done a bit more research. Actually, the kernel ntfs driver doesn't
generate the upcase table, it just loads it from the $UpCase file in the
NTFS filesystem and uses it for filename comparisons. The comparison
function uses the table to convert both strings to uppercase (maybe not
strictly uppercase, but a canonical value) and compares it. Nothing else.
I've looked at the code that creates NTFS filesystems (mkfs.ntfs in
ntfsprogs package) and I've seen that it supports 3 different upcase tables
for 3 different Windows versions. I've extracted all 3 tables from
ntfsprogs (winxp, vista, win7), the table from the "Windows 8 Upper Case
Mapping Table.txt" file (win8), the table from Samba code (samba), and the
table from a Windows 11 machine (win11).
What I've seen is that win7, win8 and win11 are identical, vista is
different from all the others, and winxp and samba are equal.
The ntfsprogs package also has code to generate a lowcase table. I
generated the lowcase table for winxp and compared it to the lowcase_table
from Samba. They are equal.
So it seems that Samba is using Windows XP tables.
Some questions:
Should we update the table to the latest Win8 ?
Should we support different tables and make it configurable ?
Should we dynamically load the table from the shared filesystem itself
(similar to accessing an existing NTFS) ?
Should we differentiate regular case-insensitive comparison from filename
comparison ?
Thanks,
Xavi
More information about the samba-technical
mailing list