[Samba] Maximum monitor timeout count 20 reached. Making node unhealthy

Thu Apr 8 04:55:39 UTC 2021

On Tue, Apr 06, 2021 at 08:31:33PM -0700, Isaac Stone via samba wrote:
>Running clustered samba + ctdb, pushing our new system from dev to prod and
>ran into this issue. Never saw in dev and staging in six months of testing,
>no idea what it means
>
>We are running a cluster of only one node while we transfer the production
>data to the new system, so the box complaining is the only box that exists
>as far as ctdb knows (the only entry in the nodes file is itself)
>
>Was down for an hour and a half today repeating every ~45 seconds
>
>"Maximum monitor timeout count 20 reached. Making node unhealthy"

The message comes from here ctdb/server/ctdb_monitor.c

/*
   called when a health monitoring event script finishes
  */
static void ctdb_health_callback(struct ctdb_context *ctdb, int status, void *p)
{
         struct ctdb_node *node = ctdb->nodes[ctdb->pnn];
         TDB_DATA data;
         struct ctdb_node_flag_change c;
         uint32_t next_interval;
         int ret;
         TDB_DATA rddata;
         struct ctdb_srvid_message rd;
         const char *state_str = NULL;

         c.pnn = ctdb->pnn;
         c.old_flags = node->flags;

         ZERO_STRUCT(rd);
         rd.pnn   = ctdb->pnn;
         rd.srvid = 0;

         rddata.dptr = (uint8_t *)&rd;
         rddata.dsize = sizeof(rd);

         if (status == ECANCELED) {
                 DEBUG(DEBUG_ERR,("Monitoring event was cancelled\n"));
                 goto after_change_status;
         }

         if (status == ETIMEDOUT) {
                 ctdb->monitor->event_script_timeouts++;

                 if (ctdb->monitor->event_script_timeouts >=
                     ctdb->tunable.monitor_timeout_count) {
                         DEBUG(DEBUG_ERR,
                               ("Maximum monitor timeout count %u reached."
                                " Making node unhealthy\n",

So it has run a health monitoring script, and it has
timed out (ETIMEDOUT_) more than 20 times.

The script is invoked here:

/*
   see if the event scripts think we are healthy
  */
static void ctdb_check_health(struct tevent_context *ev,
                               struct tevent_timer *te,
                               struct timeval t, void *private_data)
....
         ret = ctdb_event_script_callback(ctdb,
                                          ctdb->monitor->monitor_context,
                                          ctdb_health_callback,
                                          ctdb, CTDB_EVENT_MONITOR, "%s", "");

from here ctdb/server/eventscript.c:

/*
   run the event script in the background, calling the callback when
   finished.  If mem_ctx is freed, callback will never be called.
  */
int ctdb_event_script_callback(struct ctdb_context *ctdb,
                                TALLOC_CTX *mem_ctx,
                                void (*callback)(struct ctdb_context *, int, void *),
                                void *private_data,
                                enum ctdb_event call,
                                const char *fmt, ...)
{
         va_list ap;
         int ret;

         va_start(ap, fmt);
         ret = ctdb_event_script_run(ctdb, mem_ctx, callback, private_data,
                                     call, fmt, ap);
         va_end(ap);

         return ret;
}

so I'd start looking at the monitoring scripts. From the ctdb manpage:

scriptstatus
This command displays which event scripts where run in the previous monitoring cycle and the result of each script. If a script failed with an error, causing the node to become unhealthy, the output from that script is also shown.

This command is deprecated. It's provided for backward compatibility. In place of ctdb scriptstatus, use ctdb event status.

Example
# ctdb scriptstatus
00.ctdb              OK         0.011 Sat Dec 17 19:40:46 2016
01.reclock           OK         0.010 Sat Dec 17 19:40:46 2016
05.system            OK         0.030 Sat Dec 17 19:40:46 2016
06.nfs               OK         0.014 Sat Dec 17 19:40:46 2016
10.interface         OK         0.041 Sat Dec 17 19:40:46 2016
11.natgw             OK         0.008 Sat Dec 17 19:40:46 2016
11.routing           OK         0.007 Sat Dec 17 19:40:46 2016
13.per_ip_routing    OK         0.007 Sat Dec 17 19:40:46 2016
20.multipathd        OK         0.007 Sat Dec 17 19:40:46 2016
31.clamd             OK         0.007 Sat Dec 17 19:40:46 2016
40.vsftpd            OK         0.013 Sat Dec 17 19:40:46 2016
41.httpd             OK         0.015 Sat Dec 17 19:40:46 2016
49.winbind           OK         0.022 Sat Dec 17 19:40:46 2016
50.samba             ERROR      0.077 Sat Dec 17 19:40:46 2016
   OUTPUT: ERROR: samba tcp port 445 is not responding

I'm not a ctdb expert, but I hope this helps !