Help Please

Andrew Tridgell tridge at samba.org
Mon May 21 14:55:23 GMT 2001


> As long as you quit putting Reply-to:, I don't care how.

tough

> "Full" you say. Sounds good, but did you check what Unicode v3 says?

The SMB protocol only does UCS2. We need to support 3 string formats:

1) client codepage (what we now call ascii)
2) UCS2
3) OS (filesystem) format

> Aren't you simply thinking that "With UTF8, we will have ASCII code
> ASCII, so nothing is need to be worried"?!

no, I am not just thinking that. 

Heck, why don't you just keep assuming I'm a complete idiot, then we
will get on really well.

> If you really know about Unicode, you must not have selected UTF-8
> as internal code. Not UCS2 itself too. You should be using 32bit
> fixed variable which each word points one character, to treat
> multi-word characters.

so, tell me how you make SMB do anything but UCS2 or client
codepage. Or maybe you're thinking of some other network protocol.

> There's no such thing as "free software development model" that you
> think there is.

gee, I must have been dreaming the last 10 years of my life.

As I've said before, patches are welcome, specific design docs are
welcome, bug reports are welcome, prototypes are welcome. Random
ramblings on "the design is no good lets stop for 6 months" aren't. 


Now to try to get things back to civil conversation, here is a rough
outline of what Iverg and I are implementing for
internationalisation. Criticisms are welcome, unless they take the
form "you don't know what you are doing, leave it to the experts".

Basics: We will have 4 string formats. They are called "wire",
"internal", "unicode (ucs2)" and "os". Wire format is determined on a
per-packet basis, the parser needs to know what it is. Internal format
will initially be utf8, but read on for long term plan. ucs2 is as
used by MS in main SMB protocol (ie. intel byte order). "os" format is
whatever the OS uses for the filesystem. 

Step 1: change all parsing code to use a single set of functions to
convert from wire string format to internal string format.

Step 2: change all vfs functions accept internal format and convert to
"os" format before passing to the OS.

Step 3: use iconv for string conversion in wire converter and vfs

Step 4: Remove all unix_to_dos() and dos_to_unix() calls.

Step 5: Write utf8<->ucs2 conversion functions (initially using iconv)

Step 6: In functions that depend on character size (mostly wildcard
code) convert to ucs2 on entry to the function and convert back to
utf8 on exit. This will be *slow*.

Step 7: Incrementally convert more functions to use ucs2 internally,
with ucs2<->utf8 conversion on entry and exit of function

Step 8: When two functions that have been converted call each other
they can pass ucs2 direct, bypassing the conversion

Step 9: Initially "islands" of ucs2 appear in the code (first in
wildcard code) then these islands spread. When they cover most of the 
code, we change internal format to being ucs2, and instead of
ucs2<->utf8 conversion only on those functions that are not yet
converted

Step 10: We are now completely ucs2 converted. Party.

So what has been done? I have done most of Step 1 in head already (see
srvstr_*() functions). These functions need to be made to use
iconv and I need to change lots of strcpy() calls in the pure ascii
code to use srvstr_push_ascii() and srvstr_pull_ascii().

Iverg has been looking at steps 2 and 3. The rest need doing when the
time comes.

Thee big disadvantage of this plan is that until step 9 we will be
making smbd *much* slower. Luckily this can be offset by doing some of
step 8 on timing critical paths. We gain back all our speed on step 9.

Step 4 is also nasty, but necessary. It will initially break
everything but english. To make this work I plan on inventing a new
"nasty" string format which will be deliberately broken for english,
and add that to ivonv. Then for testing we can use that format
internally. That way we can test multi-byte functionality without
learning japanese.

The big advantage of this plan is that the only code that needs to
know about client string formats and the absolute mess that SMB makes
of them is the parsing code, and that can just call srvstr_*() with
the right flags to get the work done. The "logic" code in smbd (which
is most of the code) only has to deal with the internal formats. We no
longer have the problem of "I wonder what string format this char* is
in?" that we have now.

Up to now we have handled internationalisation by adding dos_to_unix()
and unix_to_dos() calls wherever we found that something was
broken. That is what led to our major string handling problem as we
mixed the string format handling into the main smb logic. You can
also never get it right as with SMB the way you determine what format
a string is on is based on bits in the packet header, which you don't
have access to when you are deep in some utility function. That's why
the decision must be moved to the parsing code.

Cheers, Tridge




More information about the samba-technical mailing list