IDL [string] attribute (was svn commit: samba r11105 ...)

Mon Oct 17 20:07:26 GMT 2005

On Mon, 17 Oct 2005 20:31:36 +0200
Jelmer Vernooij <jelmer at samba.org> wrote:

> >>> It it impossible to support a simple null terminated utf8
> >>> string with the new [string] approach?
> >>
> >> The problem is not so much in that pidl can't support "simple"
> >> null-terminated utf8 strings, but more in the fact that it is not
> >> possible to use it in other IDL compilers (such as MIDL or
> >> WIDL). Please let me know which of the two you prefer, and I'll
> >> fix it.
> >
> >
> > How about providing an optional wire encoding in parathesis like
> > [string(UTF-8)] where the encoding is the standard identifier used
> > by iconv? This way you can keep unistr AND support [string]. This
> > is what I think I'm going to do with midlc but I haven't really
> > looked into it yet so I don't know if it works out in practice.
> 
> We used to have a data type 'string' (as you might have seen) from
> which unistr was derived. We are now migrating to using [string]
> everywhere, but optionally also specifying the character set (for
> automatic conversion, very similar to what you propose), with the
> "charset" attribute. For example:
> 
> [charset(UTF16),string] uint16 *foo_bar;
> 
> The allowed arguments for charset() are currently UTF8, DOS, UTF16,
> UNIX and UTF16_BE, but I'd be happy to change that if there is good
> reason to.

This confuses me. The term charset and UTF16 doesn't imply an encoding
which makes it sound like maybe it controls the internal charset which
is of course not the case since I would think that all strings would be
represented internally with the locale dependant or at least predefined
codeset (e.g. UTF-8 on Linux, UTF-16LE on Windows, etc).

So I would much prefer to see the term 'encoding' with the standard
codeset identifiers (UTF-8, CP850, UTF-16LE, etc) to be consistent
with the POSIX locale system, iconv, etc. Or to be really consistent
with POSIX, the term 'codeset' is also good but I think it's a little
overloaded. The point is you know these identifiers can represent
*encodings* since they can be used with iconv_open(3).

Also, about the type 'uint16' - for midlc I was going to represent
[string] wchar_t *str with the locale dependant codeset internally
(e.g. UTF-8) and UTF-16LE as the wire encoding. This allows existing
IDL written for Windows to be used elsewhere (e.g. UNIX) as-is. Note that
this means that in the generated headers and stubs, the 'wchar_t' type
must be converted to a generic character type to be defined by the user
(e.g. if locale charset is UTF-8 then typedef unsigned char char_t)
because 'wchar_t' is a standard C library type. Then, as an optional
extension to existing MIDL syntax, a codeset identifier can be specified
like [codeset(UTF-8),string] wchar_t *str to generate string marshalling
routines that use the specified encoding for the wire encoding.

Mike