Hitting Could not claim version: NT_STATUS_IO_TIMEOUT
Martin Schwenke
martin at meltin.net
Thu May 23 06:34:23 UTC 2019
On Thu, 23 May 2019 16:27:22 +1000, Martin Schwenke <martin at meltin.net>
wrote:
> On Thu, 16 May 2019 12:14:39 -0700, Richard Sharpe via samba-technical
> <samba-technical at lists.samba.org> wrote:
>
> > I am hitting the following (in 4.7.1) in a CTDB-setup, and am looking
> > for hints on how to debug it.
> >
> > --------------------
> > [2019/05/16 18:49:24.648838, 5, pid=21261, effective(0, 0), real(0,
> > 0)] ../lib/dbwrap/dbwrap.c:160(dbwrap_check_lock_order)
> > check lock order 3 for g_lock.tdb
> > [2019/05/16 18:49:24.648880, 10, pid=21261, effective(0, 0), real(0,
> > 0)] ../lib/dbwrap/dbwrap.c:115(debug_lock_order)
> > lock order: 1:<none> 2:<none> 3:g_lock.tdb
> > [2019/05/16 18:49:24.648908, 10, pid=21261, effective(0, 0), real(0,
> > 0)] ../source3/lib/dbwrap/dbwrap_ctdb.c:1112(fetch_locked_internal)
> > Locking db 1294615339 key 73616D62615F76657273
> > [2019/05/16 18:49:24.648935, 5, pid=21261, effective(0, 0), real(0,
> > 0)] ../lib/dbwrap/dbwrap.c:128(dbwrap_lock_order_state_destructor)
> > release lock order 3 for g_lock.tdb
> > [2019/05/16 18:49:24.648958, 10, pid=21261, effective(0, 0), real(0,
> > 0)] ../lib/dbwrap/dbwrap.c:115(debug_lock_order)
> > lock order: 1:<none> 2:<none> 3:<none>
> > [2019/05/16 18:49:24.648963, 10, pid=21261, effective(0, 0), real(0,
> > 0)] ../source3/lib/dbwrap/dbwrap_ctdb.c:980(db_ctdb_record_destr)
> > Unlocking db 1294615339 key 73616D62615F76657273
> > [2019/05/16 18:49:24.648977, 1, pid=21261, effective(0, 0), real(0,
> > 0)] ../source3/smbd/server.c:1497(smbd_claim_version)
> > smbd_claim_version: g_lock_lock(G_LOCK_READ) failed: NT_STATUS_IO_TIMEOUT
> > [2019/05/16 18:49:24.649000, 1, pid=21261, effective(0, 0), real(0,
> > 0)] ../source3/smbd/server.c:2024(main)
> > main: Could not claim version: NT_STATUS_IO_TIMEOUT
> > ---------------------
> >
> > This is a four-node ctdb setup which was upgraded recently, but as far
> > as I can see all nodes are at the same version and the problem is
> > grabbing the lock.
> >
> > This piece of code took the error path:
> >
> > status = g_lock_lock(ctx, name, G_LOCK_READ,
> > (struct timeval) { .tv_sec = 60 });
> > if (!NT_STATUS_IS_OK(status)) {
> > DBG_WARNING("g_lock_lock(G_LOCK_READ) failed: %s\n",
> > nt_errstr(status));
> > TALLOC_FREE(ctx);
> > return status;
> > }
> >
> > Anyone have any idea what that is about or how to debug it further?
>
> Just so you have a reply, we have seen a variant of this
> (G_LOCK_WRITE failed) in our nightly testing but haven't been able
> to understand it. We are looking at.
>
> The logs usually show smbd_claim_version() timing out on all 4 nodes
> about 60 seconds after starting, with at least 2 nodes having the
> problem at the same time. Occasionally this is seen on only a single
> node.
>
> Before commit 97ad353a67ce0232d7ca5637f1bf8886e2df1aca fixed a bug in
> the CTDB test scripts we could see this continuing for a long time as
> ctdbd restarted smbd on nodes independently when the startup event
> timed out... but there may have 1 smbd alive at all times, meaning that
> (perhaps) g_lock.tdb didn't get wiped.
>
> However, it didn't propagate from test to test, so when ctdbd was shut
> down on all nodes and then started again, the problem went away,
> presumably due to g_lock.tdb being wiped.
>
> Commit 97ad353a67ce0232d7ca5637f1bf8886e2df1aca will probably hide this
> now in our testing, because ctdbd will be restarted simultaneously on
> all nodes if the cluster does not become healthy.
>
> That's about all we know right now... :-(
Oh! One more thing. If ctdbd were trying to migrate a record and was
unable to take a record lock for more than 10 seconds then it should
log a message saying "Unable to get RECORD lock on database ...".
However, we don't see that.
peace & happiness,
martin
More information about the samba-technical
mailing list