ctdb event scripts

Sat Nov 12 10:07:42 UTC 2016

[+Amitay]

On Fri, 11 Nov 2016 16:22:53 -0800, Steve French <smfrench at gmail.com>
wrote:

> On Fri, Nov 11, 2016 at 11:30 AM, Martin Schwenke <martin at meltin.net> wrote:
> 
> > On Fri, 11 Nov 2016 10:43:32 -0600, Steve French <smfrench at gmail.com>
> > wrote:
> >  
> > > On Thu, Nov 10, 2016 at 7:39 PM, Martin Schwenke <martin at meltin.net>  
> > wrote:  
> > >  
>  [...]  
>  [...]  
> > debugging it  
>  [...]  
> > part  
>  [...]  
>  [...]  
> > experimented  
>  [...]  
>  [...]  
> > >
> > > That works and is useful - but dumps to a log so a little harder to read
> > > for quick checks.  
> >
> > Oh, I didn't mean "ctdb eventscript monitor".  I meant something like:
> >
> >   /etc/ctdb/event.d/50.samba monitor
> >
> > There is enough boilerplate in the scripts so that they can find the
> > functions file and the configuration.
> >
> > CTDB won't notice that you're running it, unless it stores state and a
> > subsequent run by CTDB uses that state.  For example, 60.nfs counts
> > RPC failures before failing the "monitor" event.  If you run a script by
> > hand it will update the same state.
> >
> > Alternatively, if you just want to know why a monitor cycle failed
> > without grovelling through logs, you can always see the output of the
> > last monitor cycle by running:
> >
> >   ctdb scriptstatus monitor
> >
> > Cool feature idea: implement an option to scriptstatus that shows the
> > output of the last failure.  This would require daemon support, since
> > that's where the information comes from.

> Yes - that is a cool idea.

Let's see if Amitay agrees and wants to add it to his new event
daemon.  :-)

> I have been mulling over ways to check other services too (nfs kernel
> service stuck e.g.,   or a systems management process stuck) - ctdb
> eventscripts are interesting

NFS is interesting.  We have good configurable RPC service checks.
See ctdb/config/nfs-checks.d/README in the source and the default RPC
service configurations in the same directory.

The other thing about the current RPC checks is that if the cluster
filesystem is so slow that the rpcinfo checks are timing out then it
makes no sense to mark a node unhealthy.  You just end up marking all
nodes unhealthy and end up with a lot of churn.  So, there must be a
better answer...

As far as services being stuck, I've heard at least 1 suggestion that
we use something like nfsstat to figure out of NFS is actually making
progress.  I've never taken the time to (and been sure it is possible
to) figure out a way of recognising that NFS is not idle (i.e. requests
are coming in) and requests are actually completing successfully. I'd
be happy to take advice on that!  :-)

The current RPC checks do the job most of the time...

peace & happiness,
martin