IDL [string] attribute (was svn commit: samba r11105 ...)

Mon Oct 17 20:36:36 GMT 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael B Allen wrote:

> On Mon, 17 Oct 2005 20:31:36 +0200 Jelmer Vernooij
> <jelmer at samba.org> wrote:
>
>>> How about providing an optional wire encoding in parathesis
>>> like [string(UTF-8)] where the encoding is the standard
>>> identifier used by iconv? This way you can keep unistr AND
>>> support [string]. This is what I think I'm going to do with
>>> midlc but I haven't really looked into it yet so I don't know
>>> if it works out in practice.
>>
>> We used to have a data type 'string' (as you might have seen)
>> from which unistr was derived. We are now migrating to using
>> [string] everywhere, but optionally also specifying the character
>> set (for automatic conversion, very similar to what you propose),
>> with the "charset" attribute. For example:
>>
>> [charset(UTF16),string] uint16 *foo_bar;
>>
>> The allowed arguments for charset() are currently UTF8, DOS,
>> UTF16, UNIX and UTF16_BE, but I'd be happy to change that if
>> there is good reason to.
>
>
> This confuses me. The term charset and UTF16 doesn't imply an
> encoding which makes it sound like maybe it controls the internal
> charset which is of course not the case since I would think that
> all strings would be represented internally with the locale
> dependant or at least predefined codeset (e.g. UTF-8 on Linux,
> UTF-16LE on Windows, etc).
>
> So I would much prefer to see the term 'encoding' with the standard
> codeset identifiers (UTF-8, CP850, UTF-16LE, etc) to be consistent
> with the POSIX locale system, iconv, etc. Or to be really
> consistent with POSIX, the term 'codeset' is also good but I think
> it's a little overloaded. The point is you know these identifiers
> can represent *encodings* since they can be used with
> iconv_open(3).

I'd be happy to make those changes (rename some of the codesets and
rename charset -> codeset). However, I'd like to keep "DOS" as codeset
identifier, since the actual DOS codepage usually depends on the "dos
charset" setting in Samba.

>
> Also, about the type 'uint16' - for midlc I was going to represent
> [string] wchar_t *str with the locale dependant codeset internally
> (e.g. UTF-8) and UTF-16LE as the wire encoding. This allows
> existing IDL written for Windows to be used elsewhere (e.g. UNIX)
> as-is. Note that this means that in the generated headers and
> stubs, the 'wchar_t' type must be converted to a generic character
> type to be defined by the user (e.g. if locale charset is UTF-8
> then typedef unsigned char char_t) because 'wchar_t' is a standard
> C library type.

I'm not a big fan of assuming codesets depending on the data type. For
example, having a seperate codeset() attribute allows us to do
conversion on text data that doesn't have a [string] attribute (such
as the string structure used by the lsa and netlogon interfaces). It
also forces the user to think about the format data will have on the
wire.

Doing "automatic" conversion is still possible if the user wants it,
simplify by defining something like this on top of your IDL file:

#define wchar_t [charset(UTF-16)] wchar_t
#define char [charset(DOS)] char

I can be convinced otherwise though, if you can come up with a good
reason :-)

> Then, as an optional extension to existing MIDL syntax, a codeset
> identifier can be specified like [codeset(UTF-8),string] wchar_t
> *str to generate string marshalling routines that use the specified
> encoding for the wire encoding.

wchar_t would work in Samba as well - it's an alias for uint16, but
latter is used more often at the moment.

Cheers,

Jelmer
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDVAtTPa9Uoh7vUnYRAhLWAKCLzmv47F4es5uU2P8/gnXDF0efdACePwPb
Ss6nrShkTNyREpNWwHhQmnU=
=ykBp
-----END PGP SIGNATURE-----