CTDB_MANAGES_SAMBA and CTDB_MANAGES_WINBIND - handles restart after crash

Mon Dec 18 06:34:59 UTC 2017

On Mon, 18 Dec 2017 00:08:37 -0600, Steve French via samba-technical
<samba-technical at lists.samba.org> wrote:

> If winbind (or Samba) crashes (something that the ctdb event scripts
> can detect by their polling), I noticed that unlike the systemd
> service configuration - our ctdb event scripts are not setup to
> auto-restart services after crash.
> 
> For example,
> if CTDB_MANAGES_WINBIND=yes (see
> https://wiki.samba.org/index.php/Configuring_clustered_Samba where it
> recommends this)
> 
> and winbind ever crashes then it won't be restarted, on the other hand
> with 'normal' configuration
> 
> of systemd the winbind.service file would have something like
> 
> [Service]
> Restart=on-failure
> RestartSec=4
> ...
> [Install]
> WantedBy=multi-user.target
> 
> in it so systemd would automatically restart winbind 4 seconds after failure.
> 
> Should ctdb events scripts for winbind (and similarly samba) be set if
> the monitor ("wbinfo -p") fails - to do a service restart of winbind?

The historical answer here is "no".  Rather than just making the
service unavailable, as in the un-clustered case, doing an automatic
restart when clustered might cause failovers back and forth if the
service state keeps on flapping. This will encourage the client to keep
reconnecting to different nodes and I suppose this might result in data
corruption of the client keeps taking locks and doing partial file
updates.  I'd be interested in seeing if other confirm this idea.

> In addition, the "ctdb scriptstatus" output is strange if there is an
> error (like winbind is crashed so the "wbinfo -p" fails in ctdb's
> winbind monitoring script ) - if winbind event script (or any script)
> the following ones in the list are not executed - rather than
> reporting an error and continuing to report the status of the other
> services

CTDB only has a single binary state for healthy/unhealthy.  If a monitor
event fails in a particular script then there's no point continuing to
try to monitor other services because monitoring has already failed and
the node will be marked as unhealthy.  This also allows scripts to
implicitly depend on each other - if an early script fails then it
might not make sense to run the rest of the scripts.

peace & happiness,
martin