Hitting Could not claim version: NT_STATUS_IO_TIMEOUT

Thu May 23 06:27:22 UTC 2019

Hi Richard,

On Thu, 16 May 2019 12:14:39 -0700, Richard Sharpe via samba-technical
<samba-technical at lists.samba.org> wrote:

> I am hitting the following (in 4.7.1) in a CTDB-setup, and am looking
> for hints on how to debug it.
> 
> --------------------
> [2019/05/16 18:49:24.648838,  5, pid=21261, effective(0, 0), real(0,
> 0)] ../lib/dbwrap/dbwrap.c:160(dbwrap_check_lock_order)
>   check lock order 3 for g_lock.tdb
> [2019/05/16 18:49:24.648880, 10, pid=21261, effective(0, 0), real(0,
> 0)] ../lib/dbwrap/dbwrap.c:115(debug_lock_order)
>   lock order:  1:<none> 2:<none> 3:g_lock.tdb
> [2019/05/16 18:49:24.648908, 10, pid=21261, effective(0, 0), real(0,
> 0)] ../source3/lib/dbwrap/dbwrap_ctdb.c:1112(fetch_locked_internal)
>   Locking db 1294615339 key 73616D62615F76657273
> [2019/05/16 18:49:24.648935,  5, pid=21261, effective(0, 0), real(0,
> 0)] ../lib/dbwrap/dbwrap.c:128(dbwrap_lock_order_state_destructor)
>   release lock order 3 for g_lock.tdb
> [2019/05/16 18:49:24.648958, 10, pid=21261, effective(0, 0), real(0,
> 0)] ../lib/dbwrap/dbwrap.c:115(debug_lock_order)
>   lock order:  1:<none> 2:<none> 3:<none>
> [2019/05/16 18:49:24.648963, 10, pid=21261, effective(0, 0), real(0,
> 0)] ../source3/lib/dbwrap/dbwrap_ctdb.c:980(db_ctdb_record_destr)
>   Unlocking db 1294615339 key 73616D62615F76657273
> [2019/05/16 18:49:24.648977,  1, pid=21261, effective(0, 0), real(0,
> 0)] ../source3/smbd/server.c:1497(smbd_claim_version)
>   smbd_claim_version: g_lock_lock(G_LOCK_READ) failed: NT_STATUS_IO_TIMEOUT
> [2019/05/16 18:49:24.649000,  1, pid=21261, effective(0, 0), real(0,
> 0)] ../source3/smbd/server.c:2024(main)
>   main: Could not claim version: NT_STATUS_IO_TIMEOUT
> ---------------------
> 
> This is a four-node ctdb setup which was upgraded recently, but as far
> as I can see all nodes are at the same version and the problem is
> grabbing the lock.
> 
> This piece of code took the error path:
> 
>         status = g_lock_lock(ctx, name, G_LOCK_READ,
>                              (struct timeval) { .tv_sec = 60 });
>         if (!NT_STATUS_IS_OK(status)) {
>                 DBG_WARNING("g_lock_lock(G_LOCK_READ) failed: %s\n",
>                             nt_errstr(status));
>                 TALLOC_FREE(ctx);
>                 return status;
>         }
> 
> Anyone have any idea what that is about or how to debug it further?

Just so you have a reply, we have seen a variant of this
(G_LOCK_WRITE failed) in our nightly testing but haven't been able
to understand it.  We are looking at.

The logs usually show smbd_claim_version() timing out on all 4 nodes
about 60 seconds after starting, with at least 2 nodes having the
problem at the same time.  Occasionally this is seen on only a single
node.

Before commit 97ad353a67ce0232d7ca5637f1bf8886e2df1aca fixed a bug in
the CTDB test scripts we could see this continuing for a long time as
ctdbd restarted smbd on nodes independently when the startup event
timed out... but there may have 1 smbd alive at all times, meaning that
(perhaps) g_lock.tdb didn't get wiped.

However, it didn't propagate from test to test, so when ctdbd was shut
down on all nodes and then started again, the problem went away,
presumably due to g_lock.tdb being wiped.

Commit 97ad353a67ce0232d7ca5637f1bf8886e2df1aca will probably hide this
now in our testing, because ctdbd will be restarted simultaneously on
all nodes if the cluster does not become healthy.

That's about all we know right now...  :-(

peace & happiness,
martin