Porting Samba's CPython extensions to Python 3

Mon Sep 7 11:43:19 UTC 2015

On 09/04/2015 05:04 AM, Andrew Bartlett wrote:
> On Fri, 2015-08-28 at 12:57 +0200, Petr Viktorin wrote:
>> Hello,
>> Sorry for this long mail: a lot has happened since the last 
>> discussions,
>> and I need to refresh some points buried in the e-mail thread here:
>> https://lists.samba.org/archive/samba-technical/2015
>> -March/106177.html
>>
>>
>> In previous discussions, we agreed on a strategy for porting Samba to
>> Python 3. the stand-alone libraries would get a supported Python 3 
>> port.
>> Patches for the rest of Samba would be tolerated if they do not
>> inconvenience other developers, and they would be unsupported (if it
>> breaks, it's on whomever cares about Python 3 to fix it).
>>
>> With the patches for the last stand-alone library reviewed, I think 
>> it's
>> time to revive that discussion, to get a better idea of how porting
>> Samba to Python 3 should work.
>> Specifically, I'd like to come to understand what would least
>> inconvenience you, while allowing some kind of progress on this 
>> front.
>>
>> In the mentioned thread, there is an idea that there is no rush to 
>> port
>> – Python 2 will be around for another five years.
>> But, while five years is a lot of time, if we spend time waiting 
>> there
>> *will* be a rush later. I'm trying to avoid that. If five years is an
>> absolute deadline for porting to py3, testing, and removing support 
>> for
>> py2, I think it does make sense to start.
>> In particular, waiting until enterprise Linux distributions switch to
>> Python 3 creates a Catch-22 that would most likely result in everyone
>> waiting till the last possible moment, and then rushing wildly. Like
>> Samba, a distribution wants to switch all at once; but to do that the
>> code must be ready.
> 
> While I'm not as much convinced about the timing from that perspective,
> I am convinced that I would rather keep working with you than start
> this again in a few years.
>
> I really do appreciate your patience and dedication to handling this
> difficult area.

Thanks. I appreciate your willingness to merge patches that may not
bring results in the short term.

>> Moving from the "when" to the "how":
>>
>> Generally, there is opposition against a bespoke compatibility layer,
>> which could not be tested well and would not get much use beyond 
>> Samba.
>> As with any code written by one external developer, if I got hit by a
>> bus, the compatibility layer could bitrot.
>>
>> However, *some* kind of a compatibility layer is needed.
>> The string type in Python 2 was split to "bytes" and "unicode", and
>> there is a need to either differentiate these two, or use unicode
>> everywhere in Python 2 (which would change the semantics of the 
>> Python 2
>> version, which is not practical for a project of Samba's size).
>> So, my approach is to differentiate between three kinds of strings:
>> - bytes (PyBytes; called "str" in py2, "bytes" in py3)
>> - native ("PyStr"; UTF-8 encoded "str" in py2; "str" in py3)
>> - text ("PyUnicode"; called "unicode" in py3, "str" in py3)
>> This string split is *the* difficult part of porting C extensions.
>> Compared to this, other decisions are fairly trivial: either use the 
>> py2
>> spelling or the py3 spelling of the same thing, and choose a point on
>> the spectrum between shared macros or inline #ifdefs.
>> Correspondingly, aside from the bytes/text split, the rest of the
>> porting process is largely mechanical.
> 
> I don't understand why we need the PyStr_FromString macros however,
> given we didn't need them for Ldb?

But we did – they're added at the top of pyldb.c:

https://git.samba.org/?p=metze/samba/wip.git;a=blobdiff;f=lib/ldb/pyldb.c;h=308ecb7e9bbf527d5477fd22bae397241c12cee6;hp=e279f9777b17e3261fca7c7de481560531c9155a;hb=d93936ab20be0def5b10edf02a062ff5c60a648f;hpb=df42cacc273b0d24464c2fcf5524a3f6bfab37dd

Some PyString are ported to PyBytes, some to PyStr, depending on what
they should be in Python 3.

>> The ideal solution for Samba would be if a compatibility layer was
>> distributed with Python itself. Unfortunately, this can't really 
>> work:
>> no features are added to Python 2.7 any more, and even if they were,
>> they couldn't be present in older 2.7 releases.
>>
>> Realistically, I see three options for Samba, if it decides to start
>> porting:
>>
>> 1) Include relevant macros in the files that need them. This is used 
>> in
>> the stand-alone libraries (which typically have one Python module 
>> each).
>> This makes the code clear to anyone who knows C-API for Python 2 or 
>> 3;
>> but when adding new macros it requires some care to have consistency.
>>
>> 2) Put all compatibility macros in a shared header. This obscures the
>> code somewhat, with an additional header to know about, but ensures 
>> that
>> the set of macros is the same throughout the project, and allows
>> documenting them fairly easily.
> 
> I think we have to do this, or 3).  We have a strong preference against
> duplicated functions and macros. 
> 
>> 3) Use a third-party library for the compatibility macros. This way, 
>> the
>> compat layer can be shared with other projects; it also makes it 
>> easier
>> to keep it tested and documented.
> 
> We would prefer that, and we can import that as a third_party codebase.

OK. I will work on integrating py3c into Samba, along with continuing to
promote it across the Python ecosystem.

>> Regardless of which option is chosen, I have a pretty good idea about
>> what a compatibility layer would look like.
>> I have written a tested, documented library called py3c [0] that
>> contains all the necessary macros. To encure consistency, this is 
>> where
>> I've been pulling macros from when porting the stand-alone libraries.
>> The library is not officially recognized by Python upstream (their 
>> first
>> suggestion nowadays would probably be to port to Cython or CFFI). 
> 
> Yies, that would be a bit change.  Thanks for not suggesting that :-)
> 
>> But, I
>> am in the process of absorbing parts of Python's C Porting Howto [1].
>>
>> A superset of the macros I'd need for Samba are at:
>> https://github.com/encukou/py3c/blob/master/include/py3c/compat.h
>>
>> The first part is specific to the porting strategy I use for Samba; 
>> it
>> boils down to "use PyStr for native strings":
>>
>> * PyStr_* maps to PyString_* or PyUnicode_*
>> * Python 2: PyBytes_* maps to PyString_*
>> (You can ignore the static function PyStr_Concat, this wart is not
>> needed for Samba.)
>>
>> The rest emulates py2 or py3 API in the other Python.
>> (Unfortunately I can't use a single Python's API for both.)
>>
>> * Python 3: PyInt_* maps to PyLong_*
>> * Module initialization uses the py3 syntax (except the function
>> declaration – "MODULE_INIT_FUNC(name)" instead of "static PyObject
>> *PyInit_name(void)").
> 
> So, the reason for the PyStr_ stuff is to avoid having and accidental
> PyBytes -> PyString -> PyUnicode, either in the compiler or in
> someone's head?

I'm not sure if I understand the question correctly, but it looks like
that's one of the reasons.

A simple hard reason is that __str__/__repr__ functions need to return
the native string type, and Python itself has no universal spelling for
that.

The alternatives to introducing a "new type" for native strings are:
- use unicode on both versions (which would change the semantics on
Python 2)
- use bytes on both versions (which would require using b'' everywhere
in Python 3).

[...]
>>
>> I'm attaching draft patches that port "samba.netbios" using options 1
>> (inline macros) and 2 (shared header). (For the shared header,
>> additional buildsystem integration would be needed, and possibly a
>> better location for the header.)
> 
> The PyStr stuff grates with me a bit, but I guess that's OK.  Others
> may have stronger views however.  It looks harmless enough. 
> 
>> Let me know if you have any thoughts on this matter. And, thank you 
>> for
>> your continued patience.
> 
> Thanks for your continued work on this.
> 
> The key will be finding all the right places to deal with PyString_* in
> the generated headers, but PIDL knows what things are Unicode because
> it has a charset annotation. 

Well, *finding* them is not hard, since PyString causes a compile-time
error on py3. The key is figuring out what to do with them.
I'm slowly progressing on a proof of concept.

-- 
Petr Viktorin