Hitting Could not claim version: NT_STATUS_IO_TIMEOUT

Martin Schwenke martin at meltin.net
Thu May 23 06:34:23 UTC 2019


On Thu, 23 May 2019 16:27:22 +1000, Martin Schwenke <martin at meltin.net>
wrote:

> On Thu, 16 May 2019 12:14:39 -0700, Richard Sharpe via samba-technical
> <samba-technical at lists.samba.org> wrote:
> 
> > I am hitting the following (in 4.7.1) in a CTDB-setup, and am looking
> > for hints on how to debug it.
> > 
> > --------------------
> > [2019/05/16 18:49:24.648838,  5, pid=21261, effective(0, 0), real(0,
> > 0)] ../lib/dbwrap/dbwrap.c:160(dbwrap_check_lock_order)
> >   check lock order 3 for g_lock.tdb
> > [2019/05/16 18:49:24.648880, 10, pid=21261, effective(0, 0), real(0,
> > 0)] ../lib/dbwrap/dbwrap.c:115(debug_lock_order)
> >   lock order:  1:<none> 2:<none> 3:g_lock.tdb
> > [2019/05/16 18:49:24.648908, 10, pid=21261, effective(0, 0), real(0,
> > 0)] ../source3/lib/dbwrap/dbwrap_ctdb.c:1112(fetch_locked_internal)
> >   Locking db 1294615339 key 73616D62615F76657273
> > [2019/05/16 18:49:24.648935,  5, pid=21261, effective(0, 0), real(0,
> > 0)] ../lib/dbwrap/dbwrap.c:128(dbwrap_lock_order_state_destructor)
> >   release lock order 3 for g_lock.tdb
> > [2019/05/16 18:49:24.648958, 10, pid=21261, effective(0, 0), real(0,
> > 0)] ../lib/dbwrap/dbwrap.c:115(debug_lock_order)
> >   lock order:  1:<none> 2:<none> 3:<none>
> > [2019/05/16 18:49:24.648963, 10, pid=21261, effective(0, 0), real(0,
> > 0)] ../source3/lib/dbwrap/dbwrap_ctdb.c:980(db_ctdb_record_destr)
> >   Unlocking db 1294615339 key 73616D62615F76657273
> > [2019/05/16 18:49:24.648977,  1, pid=21261, effective(0, 0), real(0,
> > 0)] ../source3/smbd/server.c:1497(smbd_claim_version)
> >   smbd_claim_version: g_lock_lock(G_LOCK_READ) failed: NT_STATUS_IO_TIMEOUT
> > [2019/05/16 18:49:24.649000,  1, pid=21261, effective(0, 0), real(0,
> > 0)] ../source3/smbd/server.c:2024(main)
> >   main: Could not claim version: NT_STATUS_IO_TIMEOUT
> > ---------------------
> > 
> > This is a four-node ctdb setup which was upgraded recently, but as far
> > as I can see all nodes are at the same version and the problem is
> > grabbing the lock.
> > 
> > This piece of code took the error path:
> > 
> >         status = g_lock_lock(ctx, name, G_LOCK_READ,
> >                              (struct timeval) { .tv_sec = 60 });
> >         if (!NT_STATUS_IS_OK(status)) {
> >                 DBG_WARNING("g_lock_lock(G_LOCK_READ) failed: %s\n",
> >                             nt_errstr(status));
> >                 TALLOC_FREE(ctx);
> >                 return status;
> >         }
> > 
> > Anyone have any idea what that is about or how to debug it further?  
> 
> Just so you have a reply, we have seen a variant of this
> (G_LOCK_WRITE failed) in our nightly testing but haven't been able
> to understand it.  We are looking at.
> 
> The logs usually show smbd_claim_version() timing out on all 4 nodes
> about 60 seconds after starting, with at least 2 nodes having the
> problem at the same time.  Occasionally this is seen on only a single
> node.
> 
> Before commit 97ad353a67ce0232d7ca5637f1bf8886e2df1aca fixed a bug in
> the CTDB test scripts we could see this continuing for a long time as
> ctdbd restarted smbd on nodes independently when the startup event
> timed out... but there may have 1 smbd alive at all times, meaning that
> (perhaps) g_lock.tdb didn't get wiped.
> 
> However, it didn't propagate from test to test, so when ctdbd was shut
> down on all nodes and then started again, the problem went away,
> presumably due to g_lock.tdb being wiped.
> 
> Commit 97ad353a67ce0232d7ca5637f1bf8886e2df1aca will probably hide this
> now in our testing, because ctdbd will be restarted simultaneously on
> all nodes if the cluster does not become healthy.
> 
> That's about all we know right now...  :-(

Oh!  One more thing.  If ctdbd were trying to migrate a record and was
unable to take a record lock for more than 10 seconds then it should
log a message saying "Unable to get RECORD lock on database ...".
However, we don't see that.

peace & happiness,
martin



More information about the samba-technical mailing list