[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Mon Oct 18 04:44:41 UTC 2021

Hi Isaac,

Comments inline.  This is quite long but I encourage you to read all the way through, understand and comment...  :-)

On Fri, 15 Oct 2021 11:55:10 -0700, Isaac Stone <isaac.stone at som.com>
wrote:

> Thanks for your analysis, seems I was wrong before in my diagnosis.
> 
> When I said "problem starts here" I meant the problem of inaccessible
> folders, which is the ultimate problem we are trying to solve.
> 
> The folders remained inaccessible until we did a wipe of all volatile
> databases and restarted the cluster. Something keeps breaking and then not
> recovering.
> 
> Digging deeper into the logs I found some problems starting earlier.
> 
> Looks like the initial cause is a snafu by the distributed filesystem we
> are using. I will get onto questioning them now.
> 
> If you want to dig into why ctdb was unable to recover on its own, I have
> put my analysis below and will attach the log files. beware they are long
> and repetitive

Well, you do seem to have found the problem.  I'll cut other stuff out
and just address that.  Full logs don't seem to be attached... could
have got stripped by the mailing list.  Probably doesn't matter, except
for some finer details.

> First indication is on node 1, at 1am GMT
> 
>    Oct 14 01:04:13 samba-ape1-1 mount.objectivefs[2384091]: Diskcache reads
> slower than 247 ms, temporarily reducing disk reads
>    Oct 14 01:04:13 samba-ape1-1 kernel: ksoftirqd/5: page allocation
> failure: order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null)
>    Oct 14 01:04:13 samba-ape1-1 kernel: swapper/1: page allocation failure:
> order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP), nodemask=(null)
>    Oct 14 01:04:13 samba-ape1-1 kernel: ,cpuset=/,mems_allowed=0
>    Oct 14 01:04:13 samba-ape1-1 kernel: ksoftirqd/4: page allocation
> failure: order:0, mode:0x484020(GFP_ATOMIC|__GFP_COMP),
> nodemask=(null),cpuset=/,mems_allowed=0
>    Oct 14 01:04:13 samba-ape1-1 kernel: CPU: 4 PID: 36 Comm: ksoftirqd/4
> Kdump: loaded Tainted: G        W    L   --------- -  -
> 4.18.0-240.15.1.el8_3.x86_64 #1
>    Oct 14 01:04:13 samba-ape1-1 kernel: Hardware name: Amazon EC2
> i3en.2xlarge/, BIOS 1.0 10/16/2017
>    Oct 14 01:04:13 samba-ape1-1 kernel: Call Trace:
>    Oct 14 01:04:13 samba-ape1-1 kernel: swapper/6: page allocation failure:
> order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null),cpuset=
>    Oct 14 01:04:13 samba-ape1-1 kernel: dump_stack+0x5c/0x80
>    Oct 14 01:04:13 samba-ape1-1 kernel: mount.objective: page allocation
> failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
> 
> `mount.objectivefs` is the distributed filesystem we are using. I will
> raise this bug with them.

Page allocation failure is your problem, not theirs.  ;-)

It means that you are running out of memory.

> After a few hundred lines of page allocation failures and callbacks we get
> these
> 
>   Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: failed to
> alloc buffer for rx queue 6
>   Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: refilled rx
> qid 6 with only 106 buffers (from 185)
>   Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: failed to
> alloc buffer for rx queue 1
>   Oct 14 02:23:34 samba-ape1-1 kernel: ena 0000:00:05.0 eth0: refilled rx
> qid 1 with only 55 buffers (from 129)

Now the kernel is failing to allocate receive buffers for network
packets.  I'm not sure what level this is but you would expect it to be
fixed by a re-transmit.  I guess the sender may get a RST for this -
not sure.

If this problem makes its way through to CTDB (instead of the kernel
being unable to allocate memory) then you'd expect to see:

  D_ERR("read error alloc failed for %u\n", pkt_size);

or

  DBG_ERR("read error realloc failed for %u\n", pkt_size);

The log snippets you pasted might have missed one of these, or the rx
queue alloc failures are enough to cause the problem.

So, let's assume that we've hit one of these problems... and it means
what I think it means.  That is, a record has been sent from one node to
another, the sender has marked the record as having a new owner but the
recipient did not receive it.

> [...]

> hopcounts hit 1000 in about a minute, 10000 in 12 minutes, and 100000 in 2
> hours.
> hopcounts continue to increase until 12:54, which is about the same time
> node 2 is restarted (boot logs on node 2 start at 12:52)
> 
>   Oct 14 12:47:56 samba-ape1-1 ctdbd[4128234]: High hopcount 377097
> dbid:locking.tdb key:0x807ae9d7 reqid=287ce281 pnn:1 src:0 lmaster:2
> header->dmaster:0 dst:2

To understand the problem, the first thing is to understand how CTDB
migrates records. There is an overview here:

  https://wiki.samba.org/index.php/CTDB_database_design#Distributed_database_design

The important packets here are REQ_DMASTER and REPLY_DMASTER.  These
are sent *without acknowledgement*.  This is a fairly reasonable
assumption because CTDB uses TCP connections between nodes and these
connections should be reliable.  CTDB is also trying to be fast.  The
usual time a packet won't be received is when a node disappears and this
is handled by doing a database recovery.

For these packets, CTDB will construct whatever packet it needs to send,
update the "dmaster" (to point to the recipient) in the local copy of
the record and then queue the packet. If the packet is never received
then the 2 nodes may have a conflicting idea of which is "dmaster" for a
record.  This means that attempts to migrate the record will result in
an endless loop with an escalating hop count.

How to avoid this?  Options...

* Never run out of memory

  Seriously!  This is the choice we made last time we saw this problem.
  You're out of memory, so you're in bad shape and might not be able to
  recover on the node that is out of memory.

  Software contains bugs, especially memory handling bugs.  For this
  issue we have carefully and repeatedly audited this code without
  finding any logic errors.  However, there may be numerous places in
  ctdbd, smbd and other software on the system where it will behave
  badly when memory allocation fails.  The software may not notice,
  possibly crashing or behaving randomly.

  CTDB ships with the 05.system event script.  This implements
  quite cheap resource monitoring with options documented in the
  ctdb-script.options(5) manual page.  If memory usage on a node exceeds
  the "unhealthy threshold" (not configured by default) then the node is
  marked unhealthy, so public IPs will be failed over and smbd clients
  will be disconnected, so memory usage will drop.  We used to actually
  shut down CTDB in this case but decided this was overkill.  05.system
  also causes a *lot* of memory usage information to be logged when the
  "unhealthy threshold" is exceeded, so that an administrator can try to
  understand why the node ran out of memory. I suggest using this
  script with an (non-default) "unhealthy threshold" for memory usage -
  a setting of 99% may save you.

  Of course, if memory usage drops and the node becomes healthy again
  then if too many clients attach a memory full condition may be hit
  again (rinse and repeat), so the node may flap between unhealthy and
  healthy.  This isn't really a CTDB problem... it is more of a
  provisioning problem.

  If you have better resource usage monitoring implemented on nodes
  then considering running "ctdb disable" when resources become
  dangerously constrained and "ctdb enable" if they've returned to a
  usable level.  Then you don't need 05.system.

* Implement a "hopcount limit" in CTDB

  Monitoring memory usage isn't completely foolproof.  You can have
  sudden spikes where *lots* of clients suddenly connect and consume a
  lot of memory.  This can happen between monitoring events, so you may
  not be able to avoid running out of memory.

  So, we could implement a hopcount limit.  When this is exceeded we
  would do a database recovery.  This would stop the endless loop.

  There is a potential problem here.  Database recovery imposes
  additional load on cluster nodes, particularly the recovery master.
  If nodes are already low on available memory (and potentially
  swapping heavily) then they may not have headroom to complete
  recovery.  You can then end up in a completely different type of mess.

* Implement ACKs for REQ_DMASTER and REPLY_DMASTER in CTDB

  When a node sends a record to another node, it would flag the record
  with a (new) flag UNACKED_DMASTER.  The recipient would send a (new)
  ACK_DMASTER response and the sender clears the UNACKED_DMASTER flag.

  The corner cases are now when an ACK isn't sent or an ACK isn't
  received.  This is fairly easily handled.  If a node receives a
  REQ_DMASTER or REPLY_DMASTER from node that is marked as the
  UNACKED_DMASTER then it simply resends the correct response. This
  safely fixes the problem in the cases I've looked at.

  However, while ACK_DMASTER packets would be quite small (since they
  would contain "only" the key), they (and the resulting extra
  database lock + write) would still impose a performance overhead on
  *every* record migration... and the problem they're addressing should
  not exist... so, while possibly a foolproof option, perhaps only one
  to be taken by overly-cautious fools...  ;-)

My votes go to the 1st 2 options.  However, I may be missing important
details, so I'd love to hear some comments if anything I've said
doesn't make sense...

peace & happiness,
martin