Potential data inconsistency after a failed CTDB transaction commit

Fri Feb 22 07:21:11 MST 2013

Hi Maxim,

On 2013-02-22 at 09:44 +0000, Maxim Skurydin wrote:
> 
> I think I have found a potential problem with ctdb transactions in samba's
> dbwrap_ctdb code.

First of all, thank you for mail and your excellent analysis!

I think the initial analysis of how the problem occurred
might not be 100% correct, but the later analysis of the problem
regarding the g_lock race and the sequence number is a very good
catch indeed!

I'll comment in detail, now:

> The scenario is the following:
> 
> We have 2 clients - one (c1) is in the middle of a transaction and another
> one (c2) is trying to acquire a g_lock to do a transaction as well. c1
> sends CTDB_CONTROL_TRANS3_COMMIT message to the daemon, changes are
> propagated to some of the nodes (not all; node with c2 did not receive the
> changes),

A client trying to start a transaction should not
prevent a transaction commit (trans3 commit control) from
writing the changes locally, since the trans3 commit sends
the update record control to the remote nodes, which only
do a transaction on the local db copy. And on the other
hand the transaction start logic makes an entry in the
g_lock for the persistent db in question and waits for
the in-flight transaction to finish. So I don't see the
potential block here. (Maybe I am missing something subtle.)

If the trans3_commit really fails to push the changes to the
node where c2 is working, then it must be due to another
reason.

Could you tell me what versions of Samba and CTDB you
are using, and also especially, what version of CTDB
Samba was compiled against?

> but suddenly c1 dies (which causes a recovery to start).

Is the c1 client a samba process? Or what kind of client?

I think the important thing here is that a recovery starts
while the transaction commit was not quite done.

> However, there is an open time window now for c2, when it can
> acquire the g_lock

That sounds correct. The recovery and c2 race for
the lock on the g_lock.tdb. So c2 can win, and
hence start a transaction.

> and do all stores/fetches it needs while the
> recovery isn't finished yet.

You can not do any stores while the recovery is active!

What you can do is to fetch contents from the databases (which is
part of the problem you found) and with that do operations in memory,
which is what the transaction code does (before the transaction
commit): The transaction code maintains all the changes
in memory in a buffer (the so called marshall buffer), and only
the trans3-commit code rolls these changes out to the local databases.

> When c2 later sends CTDB_CONTROL_TRANS3_COMMIT, it gets an
> error, because the recovery is in process.

We have made changes to the transaction and recovery code to be
more robust against such cases:
Currently. When a recovery kicks in, the transaction commit is
abandoned, recovery finishes the changes that the transaction
started to roll out and the recovery-end code also sends the
transaction replies to the client.

So I thought everything was in good shape.
I was mistaken as you point out next... :-(

> If fetches/stores of c1 and c2 overlap, we are in trouble. Both of the
> transaction change sets have the same DB sequence number. The code in
> db_ctdb_transaction_commit (dbwrap_ctdb.c) does the following check after
> the commit failure:
> 
>         if (new_seqnum == old_seqnum) {
>             /* Recovery prevented all our changes: retry. */
>             goto again;
>         }
>         if (new_seqnum != (old_seqnum + 1)) {
>             DEBUG(0, (__location__ " ERROR: new_seqnum[%lu] != "
>                   "old_seqnum[%lu] + (0 or 1) after failed "
>                   "TRANS3_COMMIT - this should not happen!\n",
>                   (unsigned long)new_seqnum,
>                   (unsigned long)old_seqnum));
>             ret = -1;
>             goto done;
>         }
> 
>         /*
>          * Recovery propagated our changes to all nodes, completing
>          * our commit for us - succeed.
>          */
> 
> If the sequence number is by 1 larger than it was before the transaction,
> we are thinking that the local TDB has our set of changes propagated by the
> recovery, but it might be not true, if the case described above happens. As
> a result, we end up with the changes from the previous half-commited
> transaction and our current transaction changes discarded, while the
> overall commit status is OK.

This is excellently observed.

Here is my summary of what happened:

- c2's transaction start happend in the race between c1 dying and the
  recovery having locked g_lock.
- c2's transaction changes also happened before the recovery
  started commiting the collected changes to the local databases.
  so c2's transaction marshall buffer is based on the old state
  of before c1's transaction changes.
- c2's transaction commit only started when the recovery
  was finished, and so c2's trans3_commit failed to roll out
  the changes to any node, since c1's transactional changes
  were already applied. But since both transactions of
  c1 and c2 did the same sequence numbner incrementation,
  the code above assumes that c2's transaction succeeds.

So the bottom line is that the sequence number is too coarse
a check to see if the change of the transaction has really
been applied.

The fix I will try to write as soon as possible is to
add db record that stores a checksum of the transaction's
marshall buffer (along with the serverid of the changing
process or so), so that we have realistic chance to
tell it was _our_ diff that was applied.

This will allow us to correctly fail in the race condition
you found. I do currently not see a robust means to
prevent the transactions start or the changes inside
the transaction while a recovery is running. So for now,
this will be as good as it gets.

I hope you can test the patch once I have it.

Cheers - Michael

> If we try to recommit the changes, they will
> be rejected, since the RSNs of the records of the set we are trying to
> commit is the same as the latest one. I think, the simplest and the most
> reasonable reaction in this case would be to return an error and let
> clients redo the transaction.

> Originally, I have found this problem in the samba 3.6 code. Both of the
> clients were on the same node. When the client disconnect was discovered on
> the CTDB daemon side, a notification was sent to other clients sleeping in
> sys_poll, which allowed them to react very quickly on the lock holder's
> death.
> 
> As I see, the g_lock mechanism has been significantly improved in the
> latest samba code. It seems not to rely on CONTROL_REGISTER_NOTIFY to
> propagate notifications about client disconnects and tries to acquire the
> locks with intervals of 5-10 seconds. But even now I don't see a way the
> scenario above can be prevented. A timer on one of the clients can expire
> at the right moment for this to happen (though it's probability is very
> small).
> 
> 
> Does it look like the real problem in the latest code?
> 
> 
> Thanks and regards,
> Maxim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20130222/3f042432/attachment.pgp>