i18n question.

Benjamin Riefenstahl Benjamin.Riefenstahl at epost.de
Tue Mar 9 12:56:04 GMT 2004


Hi Simo,


Sorry about the long post. 

> On Mon, 2004-03-08 at 17:16, Benjamin Riefenstahl wrote:
>> The problem is that some of the FE encodings and the variant of
>> UTF-8 mandated on Mac OS X don't conform to all of the rules stated
>> before on this thread.  So things get complicated with exceptional
>> handling, work-arounds, #ifdefs and even add-on modules.

Simo Sorce <simo.sorce at xsec.it> writes:
> Which rules does not conform exactly please? (real question I'm
> simply not an expert of Mac OS X)

Michael B Allen <mba2000 at ioplex.com> stated on of the rules as:

> - all multibyte characters start with the high bit set.

Mac OS X uses de-composed UTF-8 for the file system.  This is a fixed,
non-changeable constant (which otherwise is a good thing IMO).  The
de-composition is enforced by the Mac OS X kernel.

De-composed Unicode means that characters like adieresis (ä) are
represented not as <U+00E4> ("pre-composed"), but as the sequence
<U+0061,U+0308> ("de-composed") where the character U+0061 is just
ASCII 'a'.  In UTF-8 that's pre-composed {0xC3,0xA4} and de-composed
{0x61,0xCC,0x88}.

Windows fonts don't support the de-composed scheme, Windows programs
like Windows Explorer get confused when the Mac kernel does this
translation automatically and de-composed strings can not be input by
users easily.  So the encoding conversion from and to the wire format
has to include a composition/de-composition step.  In that translation
<U+0061,U+0308> is translated to <U+00E4>.  As far as Samba is
concerned this makes <U+0061,U+0308> a sequence that violates the
third rule.

Search for BROKEN_UNICODE_COMPOSE_CHARACTERS in the Samba 3 sources
for the #ifdefs that were installed to cope with this.

>> This is, after all, the reason why Unicode and UTF-8 were invented
>> in the first place.  Because being really encoding agnostic is hard
>> in practice.
>
> Yes but have to live with filesystems not able to support utf8

Sure.  That is why other systems convert on I/O, so they can rely on
Unicode internally, and don't have to expend the effort of being
encoding agnostic themself.

> have you read my post that explain why it is not good for us to have
> a unix charset AND an internal charset that are not the same charset
> ?  I sent it on Mon, 08 Mar 2004 17:16:05 +0100

I think you mean the one from "2004-03-07 17:21 +0100".

I understand that the major problem is that conversion on I/O would
mean that strings would have to be converted several times for a
single SMB request.  I also assume that this has been tested and
measured.  Otherwise I would have thought that when calling the OS API
causing a kernel transition and hitting the disk or even just
interpreting the bytes of cached disk blocks is by far the dominant
factor.

I think it was Tridge who said that a scheme of using
multi-representation string objects to cache the information (his
description used different words) was considered but rejected because
of the complexity and the amount of change required to Samba code.


benny



More information about the samba-technical mailing list