i18n question.

Mon Mar 8 04:29:22 GMT 2004

Hi list,

Wow, while I've been away over the weekend, there seems to be a lot
of discussions about internationalisation here. It tooks me a while
just to read through these materials, I hope I haven't missed out
too much :)

According to what I understand, let me clarify few points.
`unix charset' has to be ascii-compatible, as many C compilers
recognise `\0' is the only terminating code. Also, if we fix
`unix charset' to UCS2 or UTF-16, then we have to change all the
values of existing variables to terminate in unicode null equivalents
and that transitions would cost us a lot. Also, we must ensure that
future codes to follow this newly-set rule.

But if we have variable unix charset, then we face a problem of
manipulating multibyte characters with ascii code. Together with the
legacy as well as interoperability issues we have over years, it is
not possible in reality to use UTF-8 in our file systems.

> 
> I'd like to propose a compromise instead. In Samba4 we have a much
> cleaner separation between frontend and backend than we have in
> Samba3. This separation is achieved via the NTVFS layer. I would like
> to propose that we do this:
> 
>  *) assume "internal charset" == "unix charset", like we do now
> 
>  *) build a "charset translation" NTVFS module that can be used in
>   those less common cases where you wish to use a different charset
>   for some shares.
> 
>  *) the "charset translation" module would be very small, and would
>   take one parametric parameter per instance. That parameter would say
>   what charset to translate to. 
> 
>  *) the module would be a pass-thru module, so you would configure it
>   along with any other modules you define for the share, and it would
>   filter requests on the way through (just like a anti-virus or audit
>   module).
> 
>  *) the module would have a performance penalty, but that penalty
>   would only be paid by shares that use a charset that is not the same
>   as the global unix charset for the server. You can use whatever
>   fancy cache schemes you like to try to reduce this performance
>   penalty if you think it is worthwhile.
> 
> So we would have a share something like this:
> 
> [legacy]
> 	ntvfs module = charset-translate
> 	translate:charset = EUC
> 	path = /legacy
> 
> Does that sound OK?
> 

Provided that iconv supports correct UTF-8 to unix charset encodings,
this compromise solution could be our common ground. But I can agree
with it if and only if it could satisfy the following problem.

Under the current implementation, a multibyte string manipulation is
done with respect to UCS2. Whenever string standardisations, comparisons
and substitutions are necessary, firstly the function assumes the
string is in ascii, and when it gets non-ascii code due course it
throws away the work done so far, converts the entire string into
UCS2, perform string operations in UCS2, convert back to unix charset.
Now I haven't done any performance testings yet, but it is certainly
slow operation. This is one of the biggest reason we want to fix
internal codeset to UCS2, as it is capable of manipulating string in
consistant way, regardless of whether the character set is in MB or not.

At the end of the day, we want the least amount of conversions as
possible. If your compromise is based on the argument that the current
two-steps string manipulation, fast-path and (very) slow-path method
stays asis basis, then that is adding one extra conversion (+ overhead
of calling VFS module) and I don't believe that is solving the root
of this problem.

Regards,
--
Shiro Yamada
shiro at miraclelinux.com