smbtorture4 base.charset test broken by confusion about the meaning of UCS-2 Surrogate code pairs

Richard Sharpe realrichardsharpe at gmail.com
Tue Nov 6 10:29:30 MST 2012


Hi folks,

The Unicode FAQ (http://www.unicode.org/faq/basic_q.html#13) asks:

Q: Are surrogate characters the same as supplementary characters?

A: This question shows a common confusion. It is very important to
distinguish surrogate code points (in the range U+D800..U+DFFF) from
supplementary code points (in the completely different range,
U+10000..U+10FFFF). Surrogate code points are reserved for use, in
pairs, in representing supplementary code points in UTF-16.

There are supplementary characters (i.e. encoded characters
represented with a single supplementary code point), but there are not
and will never be surrogate characters (i.e. encoded characters
represented with a single surrogate code point).
------------------------------

OK, so when you see code like this in source4/torture/basic/charset.c:

/*
  see if the server recognises a partial surrogate pair
*/
static bool test_surrogate(struct torture_context *tctx,
                           struct smbcli_state *cli)
{
        const uint32_t name1[] = {0xd800};
        const uint32_t name2[] = {0xdc00};
        const uint32_t name3[] = {0xd800, 0xdc00};
        NTSTATUS status;

        torture_assert(tctx, torture_setup_dir(cli, BASEDIR),
                       "setting up basedir");

        status = unicode_open(tctx, cli->tree, tctx,
NTCREATEX_DISP_CREATE, name1, 1);

you know that the intent of the test is to address the above confusion.

However, unicode_open tries to convert the incoming UCS-name into
CH_UNIX (probably UTF-8) but that fails because name1 and name2 are
actually illegal UCS-names.

Thus, the code is broken, it seems to me. It should not use
unicode_open, or it should add a flag to unicode_open that signals
that a non-legal UNICODE fragment is being used. However, it seems
that there is nothing in the code that allows us to convey, all the
way down, that the string is already UNICODE (UCS-2) encoded.

A second part of the test also fails, but that is because Samba 3.6.x
seems to fail to understand that the UCS-2 character wide-a is the
lower-case version of the UCS-2 character wide-A.

-- 
Regards,
Richard Sharpe
(何以解憂?唯有杜康。--曹操)


More information about the samba-technical mailing list