ctdb autobuild problems

Martin Schwenke martin at meltin.net
Sat Dec 31 04:30:19 UTC 2016


On Fri, 30 Dec 2016 16:53:44 +1100, Martin Schwenke <martin at meltin.net>
wrote:

> On Thu, 29 Dec 2016 10:59:44 +0100, Stefan Metzmacher <metze at samba.org>
> wrote:
> 
> > I noticed more ctdb related flakey tests recently,
> > I guess it's related to the push on December 18th ot 19th.
> > 
> > For a long time we had tests/simple/54_transaction_loop_recovery.sh,
> > while it seems the push on December 2nd fixed it.
> > Now it's back together with a few others:
> > 
> > 2016-11-11-0425/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 60s)
> > 2016-11-11-0425/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-17-1226/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 61s)
> > 2016-11-17-1226/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-19-1222/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 61s)
> > 2016-11-19-1222/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-22-0819/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 59s)
> > 2016-11-22-0819/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-23-0423/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 61s)
> > 2016-11-23-0423/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-27-0418/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 61s)
> > 2016-11-27-0418/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-29-0422/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 60s)
> > 2016-11-29-0422/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-29-0822/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 59s)
> > 2016-11-29-0822/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-30-0422/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 60s)
> > 2016-11-30-0422/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-11-30-0827/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 61s)
> > 2016-11-30-0827/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-12-02-0826/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 60s)
> > 2016-12-02-0826/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-12-21-0031/ctdb.stdout:TEST FAILED:
> > tests/simple/76_ctdb_pdb_recovery.sh (status 1) (duration: 64s)
> > 2016-12-21-0031/ctdb.stdout:*FAILED* tests/simple/76_ctdb_pdb_recovery.sh
> > 2016-12-21-0415/ctdb.stdout:FAILED
> > 2016-12-21-0415/ctdb.stdout:TEST FAILED: tests/eventd/eventd_022.sh
> > (status 1) (duration: 11s)
> > 2016-12-21-0415/ctdb.stdout:*FAILED* tests/eventd/eventd_022.sh
> > 2016-12-25-0826/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 61s)
> > 2016-12-25-0826/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 2016-12-25-1625/ctdb.stdout:TEST FAILED:
> > tests/simple/70_recoverpdbbyseqnum.sh (status 1) (duration: 34s)
> > 2016-12-25-1625/ctdb.stdout:*FAILED* tests/simple/70_recoverpdbbyseqnum.sh
> > 2016-12-28-0815/ctdb.stdout:TEST FAILED:
> > tests/simple/18_ctdb_reloadips.sh (status 1) (duration: 33s)
> > 2016-12-28-0815/ctdb.stdout:*FAILED* tests/simple/18_ctdb_reloadips.sh
> > 2016-12-28-1221/ctdb.stdout:TEST FAILED:
> > tests/simple/54_transaction_loop_recovery.sh (status 1) (duration: 63s)
> > 2016-12-28-1221/ctdb.stdout:*FAILED*
> > tests/simple/54_transaction_loop_recovery.sh
> > 
> > Can you have a look at it?  
> 
> Yeah, for some reason we look to have a bit more flakiness.  I'm
> running some tests to try to recreate...

OK, I have a theory for the "simple" tests but haven't been able to
drill all the way down to understand what is happening.  I'll do a
brain dump because I need to go and do other things...  :-)

It isn't really that individual tests are flakey.  Instead the state
of one node is bad (banned) at the beginning of a test, so that test
fails immediately without trying to do anything useful. The problem is
occurring in the clean-up at the end of the previous test.

The reason is that ctdb_takeover_helper is timing out on a
TAKEOVER_IP.  I can see eventd running the "takeip" event and seeing
it complete, so I think that part is solid.  However, ctdbd thinks that
the TAKEOVER_IP is still in flight, so it appears that
ctdb_do_takeip_callback() isn't being called.  The lack of a "sending
TAKE_IP for" message in the logs seems to confirm this.

I'll try to look at this some more in the next couple of days.  It
shouldn't be too hard to track down...  (famous last words)

peace & happiness,
martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20161231/8d805453/attachment.sig>


More information about the samba-technical mailing list