i18n question.

Mon Mar 8 16:16:05 GMT 2004

Hi Simo,

Simo Sorce <simo.sorce at xsec.it> writes:
> What's the problem in beeing charset agnostic (with some rules of
> course) ?

The problem is that some of the FE encodings and the variant of UTF-8
mandated on Mac OS X don't conform to all of the rules stated before
on this thread.  So things get complicated with exceptional handling,
work-arounds, #ifdefs and even add-on modules.

This is, after all, the reason why Unicode and UTF-8 were invented in
the first place.  Because being really encoding agnostic is hard in
practice.

> I would really like to know where really is the problem we are
> trying to fix with this proposal :-)

I am not yet proposing anything tangible.  Just trying to find out why
some of the strategies that I know about were not or can not be
applied.

I came to the list some time ago because I used to have problems
porting Samba 3 to Mac OS X 10.2 because of a variant of the issues
discussed now.  (Note that when Apple ported Samba 3 to 10.3, they
actually used an early release candidate, maybe in part because of the
fast-path optimizations in later versions which broke Samba 3 on Mac
OS X.)

Some of the things said in this discussion sounded more like gut
reactions than the results of investigation, so I tried to add input.
When people say "this-or-that doesn't work" I propose ideas based on
my own my experience.  I expect that ideas have to be investigated in
context and even tried out to be really usefull and they can't be
implemented right away.  But OTOH I don't like to accept a blanket
"doesn't work" without good reason.  Even if there is a good reason
against some strategy, it's good to know that reason.

> Truth is: the code is vast.  Changing the internals of such code is
> a _big_ operation, you can't do that in a few days of hacking ...

No, of course not.  But I expect that adding optimizations based on
assumptions that do actually *not* cover all cases isn't making it
easier either.

> and all people dealing with the code have to switch to a totally
> different way to deal with the code they do.

That is partially true.  But in this specific case the C language and
compiler could help through static typing and compiler warnings.  In
that respect, UTF-16 is actually better that UTF-8.

>> [using named global constants instead of direct string literals]
>> Some people would think this would *enhance* the quality of the
>> code immediately and could be an incentive to improve it even more.
>> [...]
>
> I do not see where or why this should *enhance* the code, can you
> explain this point?

In the short run you get to name the string constants.  I value that
for its documentation value (if the names are chosen wisely).  In the
medium run it can pay to look at the ways that those constants are
used and making sure that all the places that use them, use them
consistently and correctly.

You can achieve the same with thorough commenting and code review, but
just having to name the strings can make it much more obvious.

benny