Potential data inconsistency after a failed CTDB transaction commit

Fri Feb 22 02:44:59 MST 2013

Hi Samba Team,

I think I have found a potential problem with ctdb transactions in samba's
dbwrap_ctdb code.

The scenario is the following:

We have 2 clients - one (c1) is in the middle of a transaction and another
one (c2) is trying to acquire a g_lock to do a transaction as well. c1
sends CTDB_CONTROL_TRANS3_COMMIT message to the daemon, changes are
propagated to some of the nodes (not all; node with c2 did not receive the
changes), but suddenly c1 dies (which causes a recovery to start). However,
there is an open time window now for c2, when it can acquire the g_lock and
do all stores/fetches it needs while the recovery isn't finished yet. When
c2 later sends CTDB_CONTROL_TRANS3_COMMIT, it gets an error, because the
recovery is in process.

If fetches/stores of c1 and c2 overlap, we are in trouble. Both of the
transaction change sets have the same DB sequence number. The code in
db_ctdb_transaction_commit (dbwrap_ctdb.c) does the following check after
the commit failure:

        if (new_seqnum == old_seqnum) {
            /* Recovery prevented all our changes: retry. */
            goto again;
        }
        if (new_seqnum != (old_seqnum + 1)) {
            DEBUG(0, (__location__ " ERROR: new_seqnum[%lu] != "
                  "old_seqnum[%lu] + (0 or 1) after failed "
                  "TRANS3_COMMIT - this should not happen!\n",
                  (unsigned long)new_seqnum,
                  (unsigned long)old_seqnum));
            ret = -1;
            goto done;
        }

        /*
         * Recovery propagated our changes to all nodes, completing
         * our commit for us - succeed.
         */

If the sequence number is by 1 larger than it was before the transaction,
we are thinking that the local TDB has our set of changes propagated by the
recovery, but it might be not true, if the case described above happens. As
a result, we end up with the changes from the previous half-commited
transaction and our current transaction changes discarded, while the
overall commit status is OK. If we try to recommit the changes, they will
be rejected, since the RSNs of the records of the set we are trying to
commit is the same as the latest one. I think, the simplest and the most
reasonable reaction in this case would be to return an error and let
clients redo the transaction.

Originally, I have found this problem in the samba 3.6 code. Both of the
clients were on the same node. When the client disconnect was discovered on
the CTDB daemon side, a notification was sent to other clients sleeping in
sys_poll, which allowed them to react very quickly on the lock holder's
death.

As I see, the g_lock mechanism has been significantly improved in the
latest samba code. It seems not to rely on CONTROL_REGISTER_NOTIFY to
propagate notifications about client disconnects and tries to acquire the
locks with intervals of 5-10 seconds. But even now I don't see a way the
scenario above can be prevented. A timer on one of the clients can expire
at the right moment for this to happen (though it's probability is very
small).

Does it look like the real problem in the latest code?

Thanks and regards,
Maxim