[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Fri Oct 15 18:55:10 UTC 2021

Hi Martin

Thanks for your analysis, seems I was wrong before in my diagnosis.

When I said "problem starts here" I meant the problem of inaccessible
folders, which is the ultimate problem we are trying to solve.

The folders remained inaccessible until we did a wipe of all volatile
databases and restarted the cluster. Something keeps breaking and then not
recovering.

Digging deeper into the logs I found some problems starting earlier.

Looks like the initial cause is a snafu by the distributed filesystem we
are using. I will get onto questioning them now.

If you want to dig into why ctdb was unable to recover on its own, I have
put my analysis below and will attach the log files. beware they are long
and repetitive

---------------------------------------------------------------------------------------------------------------------------------

First indication is on node 1, at 1am GMT

   Oct 14 01:04:13 samba-ape1-1 mount.objectivefs[2384091]: Diskcache reads
slower than 247 ms, temporarily reducing disk reads
   Oct 14 01:04:13 samba-ape1-1 kernel: ksoftirqd/5: page allocation
failure: order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null)
   Oct 14 01:04:13 samba-ape1-1 kernel: swapper/1: page allocation failure:
order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null)
   Oct 14 01:04:13 samba-ape1-1 kernel: ,cpuset=/,mems_allowed=0
   Oct 14 01:04:13 samba-ape1-1 kernel: ksoftirqd/4: page allocation
failure: order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP),
nodemask=(null),cpuset=/,mems_allowed=0
   Oct 14 01:04:13 samba-ape1-1 kernel: CPU: 4 PID: 36 Comm: ksoftirqd/4
Kdump: loaded Tainted: G        W    L   --------- -  -
4.18.0-240.15.1.el8_3.x86_64 #1
   Oct 14 01:04:13 samba-ape1-1 kernel: Hardware name: Amazon EC2
i3en.2xlarge/, BIOS 1.0 10/16/2017
   Oct 14 01:04:13 samba-ape1-1 kernel: Call Trace:
   Oct 14 01:04:13 samba-ape1-1 kernel: swapper/6: page allocation failure:
order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=
   Oct 14 01:04:13 samba-ape1-1 kernel: dump_stack+0x5c/0x80
   Oct 14 01:04:13 samba-ape1-1 kernel: mount.objective: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)

`mount.objectivefs` is the distributed filesystem we are using. I will
raise this bug with them.

Things run smooth for about an hour then we hit another page allocation
failure

    Oct 14 02:22:42 samba-ape1-1 kernel: warn_alloc: 147 callbacks
suppressed
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/2: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC),
nodemask=(null),cpuset=/,mems_allowed=0
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/7: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/0: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/5: page allocation
failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
    Oct 14 02:22:42 samba-ape1-1 kernel: ,cpuset=77
    Oct 14 02:22:42 samba-ape1-1 kernel: ksoftirqd/4: page allocation
failure: order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null)
    Oct 14 02:22:42 samba-ape1-1 kernel: /
    Oct 14 02:22:42 samba-ape1-1 kernel: ,cpuset=
    Oct 14 02:22:42 samba-ape1-1 kernel: ,mems_allowed=0
    Oct 14 02:22:42 samba-ape1-1 kernel: /
    Oct 14 02:22:42 samba-ape1-1 kernel: /,mems_allowed=0
    Oct 14 02:22:42 samba-ape1-1 kernel: CPU: 7 PID: 54 Comm: ksoftirqd/7
Kdump: loaded Tainted: G        W    L   --------- -  -
4.18.0-240.15.1.el8_3.x86_64 #1
    Oct 14 02:22:42 samba-ape1-1 kernel: ,mems_allowed=0
    Oct 14 02:22:42 samba-ape1-1 kernel: Hardware name: Amazon EC2
i3en.2xlarge/, BIOS 1.0 10/16/2017

Call trace is different this time

After a few hundred lines of page allocation failures and callbacks we get
these

  Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: failed to
alloc buffer for rx queue 6
  Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: refilled rx
qid 6 with only 106 buffers (from 185)
  Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: failed to
alloc buffer for rx queue 1
  Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: refilled rx
qid 1 with only 55 buffers (from 129)

Then back and forth between eth0 and page allocation failures for a while

Finally ctdb starts to complain

  Oct 14 02:24:38 samba-ape1-1 kernel: ret_from_fork+0x35/0x40
  Oct 14 02:24:38 samba-ape1-1 chronyd[6557]: Forward time jump detected!
  Oct 14 02:24:38 samba-ape1-1 chronyd[6557]: Can't synchronise: no
selectable sources
  Oct 14 02:24:38 samba-ape1-1 ctdbd[4128234]: No event for 15 seconds!

            Oct 14 02:24:40 samba-ape1-1 ctdbd[4128234]: Tearing down
connection to dead node :4
  Oct 14 02:24:40 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: node
10.1.3.19:4379 is dead: 3 connected
  Oct 14 02:24:40 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: connected
to 10.1.3.19:4379 - 4 connected
  Oct 14 02:24:42 samba-ape1-1 ctdbd[4128234]: ctdb_listen_event: Incoming
queue active, rejecting connection from 10.5.21.119
  Oct 14 02:24:42 samba-ape1-1 ctdbd[4128234]: ctdb_listen_event: Incoming
queue active, rejecting connection from 10.5.21.119
  Oct 14 02:24:42 samba-ape1-1 ctdbd[4128234]: ctdb_listen_event: Incoming
queue active, rejecting connection from 10.5.21.119
  Oct 14 02:24:43 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:680 control timed out. reqid:118226989
opcode:91 dstnode:2
  Oct 14 02:24:43 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:793 ctdb_control_recv failed
  Oct 14 02:24:43 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:1047 ctdb_control for getnodes failed
ret:-1 res:-1
  Oct 14 02:24:43 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_recoverd.c:2377 Unable to get nodemap from recovery
master 2
  Oct 14 02:24:44 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: Found a Tx
that wasn't completed on time, qid 6, index 19.

Meanwhile on node 2 (recovery master at this time):

  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: dead count reached for node 1
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: 10.5.21.119:4379: node
10.13.3.99:4379 is dead: 3 connected
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]: Async operation failed
with ret=-1 res=-1 opcode=80
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]: Async wait failed -
fail_count=1
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]:
../../ctdb/server/ctdb_client.c:1622 Failed to read node capabilities.
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]:
../../ctdb/server/ctdb_recoverd.c:370 Failed to get node capabilities
  Oct 14 02:24:42 samba-usw2-2 ctdb-recoverd[30454]:
../../ctdb/server/ctdb_recoverd.c:2528 Unable to update node capabilities.
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:42 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:45 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:45 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:47 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:48 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:48 samba-usw2-2 ctdbd[30352]: Tearing down connection to
dead node :1
  Oct 14 02:24:51 samba-usw2-2 ctdbd[30352]: 10.5.21.119:4379: connected to
10.13.3.99:4379 - 4 connected
  Oct 14 02:24:52 samba-usw2-2 ctdb-recoverd[30454]: Node 1 has changed
flags - now 0x0  was 0x2
  Oct 14 02:24:52 samba-usw2-2 ctdb-recoverd[30454]: Remote node 1 had
flags 0x0, local had 0x2 - updating local

About this time "high hopcount" starts to appear on nodes 1 and  2

  Oct 14 02:24:51 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:625 reqid 118226989 not found
  Oct 14 02:24:51 samba-ape1-1 ctdbd[4128234]: Tearing down connection to
dead node :2
  Oct 14 02:24:51 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: node
10.5.21.119:4379 is dead: 3 connected
  Oct 14 02:24:51 samba-ape1-1 ctdb-recoverd[4128336]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:51 samba-ape1-1 ctdb-recoverd[4128336]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:51 samba-ape1-1 ctdb-recoverd[4128336]: Async operation
failed with ret=-1 res=-1 opcode=80
  Oct 14 02:24:51 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: connected
to 10.5.21.119:4379 - 4 connected
  Oct 14 02:24:52 samba-ape1-1 ctdb-recoverd[4128336]: Node 1 has changed
flags - now 0x0  was 0x2
  Oct 14 02:24:53 samba-ape1-1 ctdbd[4128234]: dead count reached for node 3
  Oct 14 02:24:53 samba-ape1-1 ctdbd[4128234]: Tearing down connection to
dead node :3
  Oct 14 02:24:53 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: node
10.9.11.95:4379 is dead: 3 connected
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]: ctdb_control error:
'node is disconnected'
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]: Async operation
failed with ret=-1 res=-1 opcode=80
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]: Async wait failed -
fail_count=2
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_client.c:1622 Failed to read node capabilities.
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_recoverd.c:370 Failed to get node capabilities
  Oct 14 02:24:53 samba-ape1-1 ctdb-recoverd[4128336]:
../../ctdb/server/ctdb_recoverd.c:2528 Unable to update node capabilities.
  Oct 14 02:24:53 samba-ape1-1 ctdbd[4128234]: 10.13.3.99:4379: connected
to 10.9.11.95:4379 - 4 connected
  Oct 14 02:25:00 samba-ape1-1 ctdbd[4128234]: High hopcount 97
dbid:locking.tdb key:0x92424339 reqid=088ff395 pnn:1 src:1 lmaster:2
header->dmaster:0 dst:2
  Oct 14 02:25:00 samba-ape1-1 ctdbd[4128234]: High hopcount 99
dbid:locking.tdb key:0x92424339 reqid=088ff395 pnn:1 src:1 lmaster:2
header->dmaster:0 dst:2

hopcounts hit 1000 in about a minute, 10000 in 12 minutes, and 100000 in 2
hours.
hopcounts continue to increase until 12:54, which is about the same time
node 2 is restarted (boot logs on node 2 start at 12:52)

  Oct 14 12:47:56 samba-ape1-1 ctdbd[4128234]: High hopcount 377097
dbid:locking.tdb key:0x807ae9d7 reqid=287ce281 pnn:1 src:0 lmaster:2
header->dmaster:0 dst:2
  Oct 14 12:47:56 samba-ape1-1 ctdbd[4128234]: High hopcount 13899
dbid:locking.tdb key:0x807ae9d7 reqid=089a85ec pnn:1 src:1 lmaster:2
header->dmaster:0 dst:2
  Oct 14 12:47:56 samba-ape1-1 ctdbd[4128234]: High hopcount 84098
dbid:locking.tdb key:0x92424339 reqid=0b7fffab pnn:1 src:3 lmaster:2
header->dmaster:0 dst:2
  Oct 14 12:54:10 samba-ape1-1 ctdbd[4128234]: High hopcount 97
dbid:locking.tdb key:0x92424339 reqid=28b5077a pnn:1 src:0 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:10 samba-ape1-1 rsyslogd[3489957]: imjournal: 25996 messages
lost due to rate-limiting (20000 allowed within 600 seconds)
  Oct 14 12:54:10 samba-ape1-1 ctdbd[4128234]: High hopcount 99
dbid:locking.tdb key:0x92424339 reqid=0de2611b pnn:1 src:4 lmaster:2
header->dmaster:4 dst:2
  Oct 14 12:54:10 samba-ape1-1 smbd[2710455]: [2021/10/14 12:54:10.899136,
 0] ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
  Oct 14 12:54:10 samba-ape1-1 smbd[2710455]:  db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/locking.tdb.1 key
7A70496E4466F9ADBE3DF538430000000000000000000000, chain 91584 needed 1
attempts, 31800705 milliseconds, chainlock: 0.027000 ms, CTDB 31800705.
  Oct 14 12:54:13 samba-ape1-1 ctdbd[4128234]: High hopcount 397
dbid:locking.tdb key:0x92424339 reqid=0de25a99 pnn:1 src:4 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:13 samba-ape1-1 ctdbd[4128234]: High hopcount 399
dbid:locking.tdb key:0x92424339 reqid=0de25b5d pnn:1 src:4 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:13 samba-ape1-1 ctdbd[4128234]: High hopcount 399
dbid:locking.tdb key:0x92424339 reqid=0de25abb pnn:1 src:4 lmaster:2
header->dmaster:3 dst:2

Both hopcound and db_ctdb_fetch_locked errors in the logs until 12:55,
after which only db_ctdb_fetch_locked errors appear

  Oct 14 12:54:58 samba-ape1-1 ctdbd[4128234]: High hopcount 699
dbid:locking.tdb key:0x92424339 reqid=089adfea pnn:1 src:1 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:58 samba-ape1-1 ctdbd[4128234]: High hopcount 697
dbid:locking.tdb key:0x92424339 reqid=089adfa7 pnn:1 src:1 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:54:58 samba-ape1-1 ctdbd[4128234]: High hopcount 697
dbid:locking.tdb key:0x92424339 reqid=089adfa0 pnn:1 src:1 lmaster:2
header->dmaster:3 dst:2
  Oct 14 12:55:00 samba-ape1-1 smbd[2824067]: [2021/10/14 12:55:00.854317,
 0] ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
  Oct 14 12:55:00 samba-ape1-1 smbd[2824067]:  db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/locking.tdb.1 key
7A70496E4466F9ADBE3DF538430000000000000000000000, chain 91584 needed 1
attempts, 15885080 milliseconds, chainlock: 0.074000 ms, CTDB 15885079.↷Oct
14 12:55:02 samba-ape1-1 smbd[2684088]: [2021/10/14 12:55:02.763524,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
  Oct 14 12:55:02 samba-ape1-1 smbd[2684088]:  db_ctdb_fetch_locked for
/var/cache/dbdir/volatile/locking.tdb.1 key
7A70496E4466F9ADBE3DF538430000000000000000000000, chain 91584 needed 1
attempts, 34119752 milliseconds, chainlock: 0.038000 ms, CTDB 34119751.↷Oct
14 12:55:03 samba-ape1-1 smbd[2838843]: [2021/10/14 12:55:03.073739,  0]
../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)

Errors of the exact same key continue for a couple hours until the volatile
db directory is wiped and ctdb is restarted.

On Thu, Oct 14, 2021 at 11:48 PM Martin Schwenke <martin at meltin.net> wrote:

> Hi Isaac,
>
> Comments inline below...
>
> On Thu, 14 Oct 2021 14:20:38 -0700, Isaac Stone <isaac.stone at som.com>
> wrote:
>
> > Hi, same problem happened again this morning, I have some more context
> from
> > the logs
> >
> > There was an election triggered by a dead node
> >
> > In our configuration only nodes 0 and 2 have `lmaster capability=true`
> and
> > `recmaster capability=true` in /etc/ctdb/ctdb.conf.
> > On other nodes these are set to false.
> >
> > Looking at this
> >
> > Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Tearing down connection to dead node :2
> > Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > 10.1.3.19:4379: node 10.5.21.119:4379 is dead: 3 connected
> > Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01
> > ctdb-recoverd[32407]: Recmaster node 2 is disconnected/deleted. Force
> > election
> > Oct 14 12:52:26 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Recovery mode set to ACTIVE
> > Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: This
> > node (4) is now the recovery master
> > Oct 14 12:52:27 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Remote
> > node (3) is now the recovery master
> > Oct 14 12:52:32 samba-server-s02-usw2-use1-follower-01
> > ctdb-recoverd[32407]: Election period ended, master=0
> >
> > Seems there was a network interruption in connection to node 2, which was
> > the recmaster/lmaster
> >
> > Does it mean nodes 3 and 4 were both elected? Or does `master=0` mean 0
> is
> > master?
> >
> > node 0 _should_ be the master.
>
> No, it means that nodes 4 and then 3 won rounds of the election.  If
> you look in the log on node 3 you'll probably find that it noticed the
> dead node at about the same time, so also initiated an election.  Both
> nodes will have let the other win.  I now have a patch to demote those
> election progress messages to INFO level.
>
> The final message indicates who won (i.e. 0) - the election ends via a
> timeout.
>
> So, this is a bug (and those messages were useful).  A node that can't
> win an election shouldn't start an election with a standard vote for
> itself. A node that can't win an election should probably go into
> recovery, set the master to the unknown value and then wait for another
> node to win an election.  The biggest problem is if no node can be
> master then the cluster sits in recovery until a potential master
> reappears. We could treat the recmaster capability as a preference and
> allow 2nd tier nodes to be recmaster but that is probably confusing...
> and much harder to implement.
>
> I won't try to fix this now because that would touch code that I'm
> modifying for another change.  These other changes should also avoid
> the above confusion if you're using the recovery lock.  I'm planning to
> convert the recovery lock to a cluster lock.  If this is in use then
> the first node to take the lock is leader/recovery-master and no other
> form of election is run.  This avoids a class of problems where nodes
> elect themselves before they are connected to other nodes and then are
> unable to take the lock, so may get banned... or where there are other
> election bugs...  ;-)
>
> After I've done that, I'll trivially fix this problem.
>
> Note that lmaster is a per-record concept:
>
>
> https://wiki.samba.org/index.php/CTDB_database_design#Distributed_database_design
>
> The capability indicates a node's ability to be lmaster for records.
>
> > This seems to be the start of the problem
> >
> > Oct 14 12:52:38 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Recovery has started
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_open_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: leases.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: locking.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: brlock.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_tcon_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_session_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_client_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: smbXsrv_version_global.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: netlogon_creds_cli.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: g_lock.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: printer_list.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: share_info.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: secrets.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: account_policy.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: registry.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: group_mapping.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: passdb.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: winbindd_idmap.tdb
> > Oct 14 12:52:39 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> Freeze
> > db: ctdb.tdb
> > Oct 14 12:52:41 samba-server-s02-usw2-use1-follower-01
> > process-agent[575510]: 2021-10-14 12:52:41 UTC | PROCESS | INFO |
> > (collector.go:160 in runCheck) | Finish container check #846640 in
> 19.823µs
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: share_info.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db share_info.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: secrets.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db secrets.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: account_policy.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db account_policy.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: passdb.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db passdb.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: winbindd_idmap.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db winbindd_idmap.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: group_mapping.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db group_mapping.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: registry.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db registry.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: ctdb.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db ctdb.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: smbXsrv_client_global.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_client_global.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: smbXsrv_tcon_global.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_tcon_global.tdb
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: smbXsrv_version_global.tdb generation 1805205230
> > Oct 14 12:52:50 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_version_global.tdb
> > Oct 14 12:52:51 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Clearing the tracking database for dbid 0x2d608c16
> > Oct 14 12:52:53 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > 10.1.3.19:4379: connected to 10.5.21.119:4379 - 4 connected
> >
> > Reconnected to the former master here
> >
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: netlogon_creds_cli.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db netlogon_creds_cli.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: g_lock.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db g_lock.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: brlock.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db brlock.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: smbXsrv_open_global.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_open_global.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db smbXsrv_open_global.tdb reqid 232936286 generation
> > 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db smbXsrv_open_global.tdb reqid 232936287 generation
> > 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db smbXsrv_open_global.tdb reqid 232936288 generation
> > 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db smbXsrv_open_global.tdb reqid 232936289 generation
> > 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: locking.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db locking.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936290 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936291 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936292 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936293 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936294 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936295 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936296 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936297 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936298 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936299 generation 1805205230
> >
> > ... 290 total reqids resent
> >
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936572 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936573 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936574 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936575 generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> resent
> > ctdb_call for db locking.tdb reqid 232936576 generation 1805205230
>
> I also now have a patch to log per-database summaries of resent calls
> at NOTICE level and demote the per-call messages to INFO level. ;-)
>
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: printer_list.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db printer_list.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: smbXsrv_session_global.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db smbXsrv_session_global.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
> > [2021/10/14 12:52:56.088975,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
> > [2021/10/14 12:52:56.089017,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
> > [2021/10/14 12:52:56.088993,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
> > [2021/10/14 12:52:56.089028,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[3530045]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key E9E66C7E, chain
> > 48608 needed 1 attempts, 25787 milliseconds, chainlock: 0.034000 ms, CTDB
> > 25787.146000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[173413]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 45E18236, chain
> > 61197 needed 1 attempts, 29600 milliseconds, chainlock: 0.018000 ms, CTDB
> > 29600.219000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[2065835]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key C8D42267, chain
> > 20590 needed 1 attempts, 26734 milliseconds, chainlock: 0.029000 ms, CTDB
> > 26734.891000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[188298]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_open_global.tdb.4 key 06EED6D3, chain
> > 11119 needed 1 attempts, 27704 milliseconds, chainlock: 0.026000 ms, CTDB
> > 27704.504000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]: Thaw
> > db: leases.tdb generation 1805205230
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 ctdbd[32305]:
> > Release freeze handle for db leases.tdb
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
> > [2021/10/14 12:52:56.155650,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475003]:
> >  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> > 7A70496E4466F9AD0000CE3D080000000000000000000000, chain 28289 needed 1
> > attempts, 5792 milliseconds, chainlock: 5645.945000 ms, CTDB 146.268000
> ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
> > [2021/10/14 12:52:56.155717,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482694]:
> >  db_ctdb_fetch_locked for
> > /var/cache/dbdir/volatile/smbXsrv_session_global.tdb.4 key 861E88DE,
> chain
> > 99225 needed 1 attempts, 5454 milliseconds, chainlock: 5309.574000 ms,
> CTDB
> > 144.484000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
> > [2021/10/14 12:52:56.235506,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1482030]:
> >  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> > 7A70496E4466F9AD0F3DF538430000000000000000000000, chain 57341 needed 1
> > attempts, 103316 milliseconds, chainlock: 0.064000 ms, CTDB
> 103316.804000 ms
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
> > [2021/10/14 12:52:56.244963,  0]
> > ../../source3/lib/dbwrap/dbwrap_ctdb.c:1222(fetch_locked_internal)
> > Oct 14 12:52:56 samba-server-s02-usw2-use1-follower-01 smbd[1475823]:
> >  db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.4 key
> > 7A70496E4466F9ADEF084B9B2C0000000000000000000000, chain 17348 needed 1
> > attempts, 849678 milliseconds, chainlock: 0.024000 ms, CTDB 849678.676000
> > msd 232936329 generation 1805205230
> >
> > etc etc until we restart the whole cluster
> >
> > (high hopcount logs happening at the same time, though not as high as
> > before)
>
> So, recovery started just after 12:52:32 and the node went away ~25
> seconds before 12:52:26, so ~12:52:01.  Note that quite a few of the
> warnings are explained by either the node going away or the recovery -
> anything less than ~55 seconds.
>
> There are some outliers here at ~103s and ~850s, so I don't think this
> is where the problem started.  I think this is where it ended!  My gut
> tells me to go back ~850s and see what was happening on node 2.  The
> problem was occurring then on some node and the recovery seems to have
> solved it with node 2 excluded.
>
> Still no explanation for the high hop counts.  If they are fairly small
> (100s, not 1000s) then they may just indicate contention for records,
> and given your geographically distributed nodes, this may be a problem
> that you can't work around without read-only or sticky records.
>
> If you continue to see huge hop counts then you need to understand the
> source of the corruption.
>
> In future, please consider attaching the log file and just quoting the
> relevant entries.  Lines seem to have wrapped at ~80 characters, which
> made this quite difficult to browse...  :-)
>
> Thanks...
>
> peace & happiness,
> martin
>
>