i18n question.

Tue Mar 9 12:56:04 GMT 2004

Hi Simo,

Sorry about the long post. 

> On Mon, 2004-03-08 at 17:16, Benjamin Riefenstahl wrote:
>> The problem is that some of the FE encodings and the variant of
>> UTF-8 mandated on Mac OS X don't conform to all of the rules stated
>> before on this thread.  So things get complicated with exceptional
>> handling, work-arounds, #ifdefs and even add-on modules.

Simo Sorce <simo.sorce at xsec.it> writes:
> Which rules does not conform exactly please? (real question I'm
> simply not an expert of Mac OS X)

Michael B Allen <mba2000 at ioplex.com> stated on of the rules as:

> - all multibyte characters start with the high bit set.

Mac OS X uses de-composed UTF-8 for the file system.  This is a fixed,
non-changeable constant (which otherwise is a good thing IMO).  The
de-composition is enforced by the Mac OS X kernel.

De-composed Unicode means that characters like adieresis (ä) are
represented not as <U+00E4> ("pre-composed"), but as the sequence
<U+0061,U+0308> ("de-composed") where the character U+0061 is just
ASCII 'a'.  In UTF-8 that's pre-composed {0xC3,0xA4} and de-composed
{0x61,0xCC,0x88}.

Windows fonts don't support the de-composed scheme, Windows programs
like Windows Explorer get confused when the Mac kernel does this
translation automatically and de-composed strings can not be input by
users easily.  So the encoding conversion from and to the wire format
has to include a composition/de-composition step.  In that translation
<U+0061,U+0308> is translated to <U+00E4>.  As far as Samba is
concerned this makes <U+0061,U+0308> a sequence that violates the
third rule.

Search for BROKEN_UNICODE_COMPOSE_CHARACTERS in the Samba 3 sources
for the #ifdefs that were installed to cope with this.

>> This is, after all, the reason why Unicode and UTF-8 were invented
>> in the first place.  Because being really encoding agnostic is hard
>> in practice.
>
> Yes but have to live with filesystems not able to support utf8

Sure.  That is why other systems convert on I/O, so they can rely on
Unicode internally, and don't have to expend the effort of being
encoding agnostic themself.

> have you read my post that explain why it is not good for us to have
> a unix charset AND an internal charset that are not the same charset
> ?  I sent it on Mon, 08 Mar 2004 17:16:05 +0100

I think you mean the one from "2004-03-07 17:21 +0100".

I understand that the major problem is that conversion on I/O would
mean that strings would have to be converted several times for a
single SMB request.  I also assume that this has been tested and
measured.  Otherwise I would have thought that when calling the OS API
causing a kernel transition and hitting the disk or even just
interpreting the bytes of cached disk blocks is by far the dominant
factor.

I think it was Tridge who said that a scheme of using
multi-representation string objects to cache the information (his
description used different words) was considered but rejected because
of the complexity and the amount of change required to Samba code.

benny