[PATCHES] Port ltdb & ldb to Python 3

Petr Viktorin pviktori at redhat.com
Thu Jun 11 06:55:26 MDT 2015


Hello,

Here are initial patches porting tdb & ldb to Python 3. These libraries
deal a lot with text-like data, so I expect some discussion around them.
I'm not as familiar as others on the list with the use of these
libraries, and I didn't find relevant documentation, so I'll write out
some of my assumptions here; please correct me if I'm wrong.

tdb is routinely used to store binary data, and in my understanding
that's its primary use case, so it should primarily have a "bytes"-based
interface. That's what these patches add.
If a text-based interface is needed in more than a few cases (i.e. if
manual encode/decode is expected to be a big pain), it can be added (see
below).

ldb, on the other hand, stores text, and I remember someone on this list
mentioned that LDAP is a text-only protocol. Is that really the case?
Unlike with a "bytes" default, you can't do manual encode/decode to get
binary data, since the data might not be valid UTF-8. So if binary data
is allowed in ldb, general code will need to use a bytes-based
interface, or expect decode errors. (LDAP DNs and attribute names are
text-only, but some values could be binary.)
The patches add a text-only interface (serializing the text as UTF-8),
but again an additional bytes-based interface can be added.

If both text and bytes are needed, I can see two ways of doing it:
1. Add a special dict-like attribute with text-only data, for example:
    msg['text'] = 'blablabla'
    assert msg['text'] == 'blablabla'
    msg.raw[b'data'] = b'\xf0\x0d'
    assert msg.raw[b'data'] == b'\xf0\x0d'
    # error: msg[b'text'] = 'blablabla'
    # error: msg.raw['data'] = b'\xf0\x0d'
2. Do as "os.listdir" does: choose the type of the value based on the
type of the key:
    msg['text'] = 'text'
    assert msg['text'] == 'text'
    msg[b'data'] = b'\xf0\x0d'
    assert msg[b'data'] == b'\xf0\x0d'
    # error: msg[b'text'] = 'blablabla'
    # error: msg['data'] = b'\xf0\x0d'

The first is a bit more explicit and discoverable. The second is easier
to type, but more confusing. So, I'd prefer the former, but not very
strongly.

There are also options in the style of "Allow storing either text or
bytes; then return bytes for valid UTF-8 and text otherwise", which I do
not recommend as it would be easier to write incorrect code (that passes
naïve tests) than correct code: correct code would basically need to
type-check every time it got a value.


The patches also contain some fixes; specifically the last three fix
bugs that might result in memory leaks and segfaults.

-- 
Petr Viktorin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: py3-tdb-ldb.patch
Type: text/x-patch
Size: 99333 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20150611/5c10e8c3/attachment-0001.bin>


More information about the samba-technical mailing list