Potential data inconsistency after a failed CTDB transaction commit

Thu Feb 28 07:22:30 MST 2013

Hi Michael,

Please see my answers inline.

2013/2/22 Michael Adam <obnox at samba.org>
>
> > The scenario is the following:
> >
> > We have 2 clients - one (c1) is in the middle of a transaction and another
> > one (c2) is trying to acquire a g_lock to do a transaction as well. c1
> > sends CTDB_CONTROL_TRANS3_COMMIT message to the daemon, changes are
> > propagated to some of the nodes (not all; node with c2 did not receive the
> > changes),
>
> A client trying to start a transaction should not
> prevent a transaction commit (trans3 commit control) from
> writing the changes locally, since the trans3 commit sends
> the update record control to the remote nodes, which only
> do a transaction on the local db copy. And on the other
> hand the transaction start logic makes an entry in the
> g_lock for the persistent db in question and waits for
> the in-flight transaction to finish. So I don't see the
> potential block here. (Maybe I am missing something subtle.)

Right, client c2 patiently waits until c1 releases the lock, so only
one client can be performing a transaction - no problem here. But if a
c1 unexpectedly dies, c2 can get a g_lock, while the daemon was in the
middle of pushing changes to other nodes.

> If the trans3_commit really fails to push the changes to the
> node where c2 is working, then it must be due to another
> reason.

In my tests I kill a client process with SIGKILL while it's performing
transactions.

If the daemon detects that the client is no longer connected (in
queue_io_read), it calls ctdb_daemon_read_cb with 0 bytes read which
subsequently calls client's destructor. If we see in the destructor,
that the transaction commit is in process, the recovery mode is
triggered. This can happen in the middle of commit.

server/ctdb_daemon.c: ctdb_client_destructor

    if (client->num_persistent_updates != 0) {
        DEBUG(DEBUG_ERR,(__location__ " Client disconnecting with %u
persistent updates in flight. Starting recovery\n",
client->num_persistent_updates));
        client->ctdb->recovery_mode = CTDB_RECOVERY_ACTIVE;
    }

> Could you tell me what versions of Samba and CTDB you
> are using, and also especially, what version of CTDB
> Samba was compiled against?

I'm using ctdb 2.0.

> Is the c1 client a samba process? Or what kind of client?

I have written previously about the problem with pstore command hang
in the email below.

https://lists.samba.org/archive/samba-technical/2013-February/090288.html

Trying to fix that, I have ported the ideas from samba into the
command line client code dealing with transactions. I used samba 3.6
branch as a reference but peered into the latest code from time to
time. The code uses a g_lock (like in samba version 3.6), no local TDB
transactions (only writes to marshalling buffer) and
CTDB_CONTROL_TRANS3_COMMIT for pushing changes to all nodes in the
cluster.

The clients in this case are instances of tests/src/ctdb_transaction.c
test (slightly modified to do several stores) on multiple nodes.

> I think the important thing here is that a recovery starts
> while the transaction commit was not quite done.
>
> > and do all stores/fetches it needs while the
> > recovery isn't finished yet.
>
> You can not do any stores while the recovery is active!

> What you can do is to fetch contents from the databases (which is
> part of the problem you found) and with that do operations in memory,
> which is what the transaction code does (before the transaction
> commit): The transaction code maintains all the changes
> in memory in a buffer (the so called marshall buffer), and only
> the trans3-commit code rolls these changes out to the local databases.

This is true. But as soon as the client dies a g_lock might be
acquired by some other client. As you have pointed out above, if the
recovery starts after the lock has been acquired, we can safely do
stores (that use marshalling buffer only) and fetches (that read from
local TDB).

> Here is my summary of what happened:
>
> - c2's transaction start happend in the race between c1 dying and the
>   recovery having locked g_lock.
> - c2's transaction changes also happened before the recovery
>   started commiting the collected changes to the local databases.
>   so c2's transaction marshall buffer is based on the old state
>   of before c1's transaction changes.
> - c2's transaction commit only started when the recovery
>   was finished, and so c2's trans3_commit failed to roll out
>   the changes to any node, since c1's transactional changes
>   were already applied. But since both transactions of
>   c1 and c2 did the same sequence numbner incrementation,
>   the code above assumes that c2's transaction succeeds.
>
> So the bottom line is that the sequence number is too coarse
> a check to see if the change of the transaction has really
> been applied.

Yes, that's exactly what I have observed.

Btw, in the earlier samba code (that is currently in ctdb command line
client) when CTDB_CONTROL_TRANS2_COMMIT and local TDB transactions
were used, this problem did not exist. After the commit failure we
would simply replay a local TDB transaction on client redoing both
stores and fetches. If any data fetched didn't match what we had in
the marshalling buffer (i.e. after commit failure the fetch returned a
different data than before), the error was returned. The solution with
storing a hash of change set is a very good way for achieving the same
goal in the current code.

The problem is very hard to reproduce. I have observed it a couple of
times, but for testing purposes it might make sense to temporarily add
delays in the code on the daemon side, that sends update record
messages to other nodes. Not really sure, whether it's possible to
test it in the realistic setting, since the probability of
encountering this case is extremely small.

Regards,
Maxim