ctdb nfs rquotad: inconsistency between startup and checks

Fri Mar 18 03:17:15 UTC 2022

Hi Andreas,

On Thu, 17 Mar 2022 17:23:14 -0300, Andreas Hasenack via
samba-technical <samba-technical at lists.samba.org> wrote:

> On Wed, Mar 16, 2022 at 9:22 PM Martin Schwenke <martin at meltin.net> wrote:
> > However, this doesn't obey the principle of least surprise, and there is
> > a design flaw here. I have overloaded the meaning of variables like
> > nfs_rquotad_service. There is an assumption that an unset variable
> > means it is started elsewhere, which is only really true for Sys-V init.  
> 
> When you say "started elsewhere", do you mean "elsewhere in ctdb", or
> in the installed system?

Oh, I mean that it is started by the old nfs, nfs-kernel-server or
similar init script.  It starts most of the NFS services.

> nfs is complicated, so many services, under so many conditions
> depending on what is used and what isn't.
> 
> Upstream tried to grow some smarts into the systemd service units, and
> many of those start somewhat automatically when needed (like when
> /etc/krb5.keytab is present - which in itself is not 100% correct all
> the time, but helps).
> 
> It must be a nightmare for usptream ctdb to try to handle all of this.

It is a bit challenging.  I thought I had it right...  :-)

> > So, choices are, in nfs-linux-kernel-callout:
> >
> > 1. Insert the following in service_stop() and service_start(), before
> >    each "# Default to stopping/starting by hand" comment:
> >
> >      case "$nfs_distro_style" in
> >      systemd-*) return
> >      esac  
> 
> Why does it matter if it's systemd or sysv in this case? I'm missing
> that bit of info. ctdb doesn't use systemctl, or call out to
> /etc/init.d/<name> directly as far as I can tell. It only uses the
> "service" call, which works for both systemd and sysv. Are you trying
> to support a deployment which has neither systemd or sysv?
> 
> Why not behave like basic_start() and basic_stop(), which just skip
> the service if the corresponding service variable isn't set? Sorry if
> I'm missing something, I know upstream has to consider many more
> deployment scenarios.

The "restart_every" count in the *.check files wants to be able
to trigger a restart of an individual service that isn't "healthy".
So, in the Sys-V init case, where a single init script starts many
services, that mechanism wants to be able to manually restart an
individual service.

So, the above suggestion by me is wrong because it would build more
implied behaviour into the code.

Instead, we need to go with "ugly".  Services that are automatically
started by some other init script should have a value of, say, "AUTO".
basic_stop() and basic_start() would ignore this, treating it like "".
However, service_stop() and service_start() would still do the
hand-stop/start as necessary.

I've implemented this at:

  https://git.samba.org/?p=martins/samba.git;a=shortlog;h=refs/heads/ctdb-nfs

This call-out should behave sanely if you set nfs_rquotad_service="".

I've also made the tests pass for Debian-style init/systemd, though they
are still not automatically run for all styles.  I also didn't add
tests for a modified call-out...  it works...

So the call-out will now be self-consistent.

I can also now see that I should have just had service_restart_cmd in
the *.check files and then I could have had a single service_restart
function for this purpose in the call-out, which could have just done
an OS level service restart for the case where the service is defined.
I won't do that now because it means I would have to maintain the old
code anyway for backward compatibility.  Live and learn...

After all of that, you should still make sure you remove
50.rquotad.check to avoid warnings like this being repeatedly logged:

  WARNING: rquotad failed RPC check:
  rpcinfo: RPC: Program not registered
  program rquotad version 1 is not available
  Trying to restart service "rquotad"...
  &Starting quotarpc: OK

> In the future maybe these variables should be factored our into a
> separate (or existing) config file, and sourced by all scripts, so
> that site local changes are made only there, and not in the actual
> script that is run, separating data and code. Otherwise in an upgrade
> which changes the script (code) there would be a config prompt (in
> debian/ubuntu), and I think in rpm systems the script would be backed
> up as .rpmsave, and either the old script would be run, or the new one
> but without the local site changes.

So far I have tried to make the NFS call-out script self-contained,
because loading configuration can cause extra complexity.  However, I
can see your point.  One problem is that the defaults are set depending
on $CTDB_NFS_DISTRO_STYLE and then the configuration should override
the defaults.  That's OK: we could just load the configuration file
twice - first to get CTDB_NFS_DISTRO_STYLE and a 2nd time to override
defaults.  It is better a repetitive check to see each value is set
before setting the default... or insane, unmaintainable shell eval magic
to avoid the repetition.

I might do that later.  It doesn't seem difficult and will make at
least 1 user's life easier...  :-)

I don't think we will treat any of this as a bug and backport it to
current release or the upcoming 4.16 release.  Removing
50.rquotad.check should be done if a service is disabled.  I should
probably make sure that is documented somewhere...

peace & happiness,
martin