[PATCH] Re: autobuild[master] failure on sn-devel-144 for task ctdb during test (bug 12170)

Martin Schwenke martin at meltin.net
Tue Aug 23 06:14:14 UTC 2016


On Mon, 22 Aug 2016 21:11:21 +1000, Martin Schwenke <martin at meltin.net>
wrote:

> On Mon, 22 Aug 2016 11:10:43 +0200, Stefan Metzmacher
> <metze at samba.org> wrote:
> 
> > Hi Amitay and Martin,
> > 
> > I got the following failure on master (which just an WHATNEW.txt change)
> > 
> > Can you have a look?
> > 
> > TEST PASSED: tests/simple/77_ctdb_db_recovery.sh (duration: 38s)
> > ==========================================================================
> > --==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--
> > Running test tests/simple/78_ctdb_large_db_recovery.sh (15:14:11)
> > --==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--
> > Cluster is HEALTHY
> > create persistent test database large_persistent_db.tdb
> > wipe test database large_persistent_db.tdb
> > creating dummy record data
> > 1+0 records in
> > 1+0 records out
> > 10240 bytes (10 kB) copied, 0.00140118 s, 7.3 MB/s
> > Adding 345 records
> > Failed to execute "ctdb pstore large_persistent_db.tdb record185
> > /tmp/tmp.kai1PmB3jR" on node(s) "0"
> > connect() failed, errno=111
> > Failed to connect to CTDB daemon
> > (/memdisk/metze/a/b538487/ctdb/ctdb/tests/var/sock.0)
> > *** TEST COMPLETED (RC=1) AT 2016-08-21 15:14:15, CLEANING UP...
> > Restarting CTDB (scheduled)...
> > Attempting to politely shutdown daemons...
> > connect() failed, errno=111
> > Failed to connect to CTDB daemon
> > (/memdisk/metze/a/b538487/ctdb/ctdb/tests/var/sock.0)
> > connect() failed, errno=111
> > Failed to connect to CTDB daemon
> > (/memdisk/metze/a/b538487/ctdb/ctdb/tests/var/sock.1)
> > connect() failed, errno=111
> > Failed to connect to CTDB daemon
> > (/memdisk/metze/a/b538487/ctdb/ctdb/tests/var/sock.2)
> > Sleeping for a while...
> > =1|.|
> > Killing remaining daemons...
> > Starting 3 ctdb daemons...
> > Node 2 will have no public IPs.
> > Waiting for cluster to become ready...
> > <120|...........|11|
> > OK
> > Setting RerecoveryTimeout to 1
> > Forcing a recovery...
> > =2|..|
> > Doing a sync...
> > ctdb is ready
> > ==========================================================================
> > TEST FAILED: tests/simple/78_ctdb_large_db_recovery.sh (status 1)
> > (duration: 29s)
> > ==========================================================================  
> 
> It looks like the ctdbd's all died unexpectedly.  Without the contents
> of /memdisk/metze/a/b538487/ctdb/ctdb/tests/var/daemon.*.log it will be
> impossible to know why.  :-(
> 
> I see a lot of cases like this in
> https://git.samba.org/metze/samba-autobuild/ctdb.stdout but most are in
> restarts after a test result has been decided.
> 
> We're not seeing this in our local overnight tests... I've done a quick
> grep through recent results.
> 
> Were you running another autobuild (private, some other branch?) at
> the same time?  If so, it could be due to
> ctdb/tests/simple/scripts/local_daemons.bash:daemons_stop() killing
> daemons from a parallel test run.  This isn't new and shouldn't
> really come into play, since the daemons should respond to "ctdb
> shutdown".  However, I should obviously fix it, now that I've noticed
> it!  Will try to do that tomorrow... too tired now.

It seems pretty clear that this is the issue.  Each time I see it in
the log, all ctdbd processes have gone away.  The timeout after
"ctdb shutdown" and before killing was only 1 second, so with 2 separate
autobuilds (e.g. master, 4.5) running in parallel, it isn't unlikely
that this could happen.  The change has been in master since March but
was unlikely before 4.5 branched due to less chance of parallel
autobuilds running on the same machine.

The fix changes is to use ctdbd_wrapper to shutdown daemons. This
uses the PID file rather than pkill with a weak pattern.  The code is
now much better!

Please review and push...

peace & happiness,
martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ctdb.patch
Type: text/x-patch
Size: 11969 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160823/d6c3b007/ctdb.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160823/d6c3b007/attachment.sig>


More information about the samba-technical mailing list