i18n question.

Andrew Bartlett abartlet at samba.org
Sat Mar 6 03:18:42 GMT 2004


On Sat, 2004-03-06 at 13:52, Kenichi Okuyama wrote:
> Dear Michael,
> 
> >>>>> "Michael" == Andrew Bartlett <abartlet at samba.org> writes:
> Michael> The problem is, this isn't java - so UCS2/UTF16 is out.  We have to
> Michael> operate in an environment of mulitbyte 'C' strings.  We can't do a UTF16
> Michael> -> UTF8 conversion every time we call stat().  That happens a *lot*...
> 
> I'd like to point one thing, then ask questions.
> 
> Pointout: UTF16 is not UCS2. What we really need is not UTF16->UTF8,
>           but is UCS2->UTF8 ( and vice versa, ofcourse ).

MS has redefined their use of USC2 to be UTF16.  I'm not sure how well
they have done that, but their current interface doco claims to be using
UTF16.

> Questions:
> Q1) Doesn't that just means we need conversion cache?
>     Conversion between UTF8<->UCS2 will not take time if we know
>     what to use. I thought in old 2.2.8 or somewhere, we used to
>     have this conversion cache table which worked quite fast.
> 
>     We do call stat() many times, but we call stat() against "same
>     string" many times.

I think this is inventing a solution for a non-problem.  Why make every
single system call need to consult a character conversion cache, when we
can instead convert to UTF8 at the wire interfaces?  The issues that
result inside samba from the fact that we are using UTF8 are minor in
comparison.

> Q2) I don't see what you mean by "skip UCS2 because this isn't
>     java".

Java's 'system call' interface is natively unicode, which means you
don't need to think about this nearly as much.

>     UCS2 is, for Windows, 16bit ushort per word, 1 word per
>     character encoding. We do not need to worry about Multi-Byte
>     ( which measn you will not know where is THE NEXT character
>       until you really scan the string ).
>     Once any string is converted to UCS2, we can treat them just
>     like ascii, except that we do need to care for 16bit length.

Except that all the basic system functions, from getpwnam() to printf()
etc are not expecting UCS2 inputs.

> Q3) Wasn't UCS2 part of 'C' string from ANSI-C?
>     Or are you saying " 'C' string " in meaning of old K&R ?

A 'C' string is a string that is null terminated, with no intermediate
NULL bytes, such as you can pass to printf().  USC2 is not compatible
with that, as every second character may often be NULL, terminating the
string.

Andrew Bartlett

-- 
Andrew Bartlett                                 abartlet at pcug.org.au
Manager, Authentication Subsystems, Samba Team  abartlet at samba.org
Student Network Administrator, Hawker College   abartlet at hawkerc.net
http://samba.org     http://build.samba.org     http://hawkerc.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.samba.org/archive/samba-technical/attachments/20040306/ecfe677b/attachment.bin


More information about the samba-technical mailing list