[Samba] Maximum monitor timeout count 20 reached. Making node unhealthy

Isaac Stone isaac.stone at som.com
Thu Apr 8 17:12:57 UTC 2021


Has not happened again, so it seems to be a one-time fluke. There is
evidence that the network was down when these errors were occurring.

My theory is that the network went down, ctdb health check timed out, when
the network came back ctdb recovered just fine.

There were other problems that I have identified separately, this error is
I think a read herring.

thanks

On Wed, Apr 7, 2021 at 9:55 PM Jeremy Allison <jra at samba.org> wrote:

> On Tue, Apr 06, 2021 at 08:31:33PM -0700, Isaac Stone via samba wrote:
> >Running clustered samba + ctdb, pushing our new system from dev to prod
> and
> >ran into this issue. Never saw in dev and staging in six months of
> testing,
> >no idea what it means
> >
> >We are running a cluster of only one node while we transfer the production
> >data to the new system, so the box complaining is the only box that exists
> >as far as ctdb knows (the only entry in the nodes file is itself)
> >
> >Was down for an hour and a half today repeating every ~45 seconds
> >
> >"Maximum monitor timeout count 20 reached. Making node unhealthy"
>
> The message comes from here ctdb/server/ctdb_monitor.c
>
> /*
>    called when a health monitoring event script finishes
>   */
> static void ctdb_health_callback(struct ctdb_context *ctdb, int status,
> void *p)
> {
>          struct ctdb_node *node = ctdb->nodes[ctdb->pnn];
>          TDB_DATA data;
>          struct ctdb_node_flag_change c;
>          uint32_t next_interval;
>          int ret;
>          TDB_DATA rddata;
>          struct ctdb_srvid_message rd;
>          const char *state_str = NULL;
>
>          c.pnn = ctdb->pnn;
>          c.old_flags = node->flags;
>
>          ZERO_STRUCT(rd);
>          rd.pnn   = ctdb->pnn;
>          rd.srvid = 0;
>
>          rddata.dptr = (uint8_t *)&rd;
>          rddata.dsize = sizeof(rd);
>
>          if (status == ECANCELED) {
>                  DEBUG(DEBUG_ERR,("Monitoring event was cancelled\n"));
>                  goto after_change_status;
>          }
>
>          if (status == ETIMEDOUT) {
>                  ctdb->monitor->event_script_timeouts++;
>
>                  if (ctdb->monitor->event_script_timeouts >=
>                      ctdb->tunable.monitor_timeout_count) {
>                          DEBUG(DEBUG_ERR,
>                                ("Maximum monitor timeout count %u reached."
>                                 " Making node unhealthy\n",
>
> So it has run a health monitoring script, and it has
> timed out (ETIMEDOUT_) more than 20 times.
>
> The script is invoked here:
>
> /*
>    see if the event scripts think we are healthy
>   */
> static void ctdb_check_health(struct tevent_context *ev,
>                                struct tevent_timer *te,
>                                struct timeval t, void *private_data)
> ....
>          ret = ctdb_event_script_callback(ctdb,
>                                           ctdb->monitor->monitor_context,
>                                           ctdb_health_callback,
>                                           ctdb, CTDB_EVENT_MONITOR, "%s",
> "");
>
> from here ctdb/server/eventscript.c:
>
> /*
>    run the event script in the background, calling the callback when
>    finished.  If mem_ctx is freed, callback will never be called.
>   */
> int ctdb_event_script_callback(struct ctdb_context *ctdb,
>                                 TALLOC_CTX *mem_ctx,
>                                 void (*callback)(struct ctdb_context *,
> int, void *),
>                                 void *private_data,
>                                 enum ctdb_event call,
>                                 const char *fmt, ...)
> {
>          va_list ap;
>          int ret;
>
>          va_start(ap, fmt);
>          ret = ctdb_event_script_run(ctdb, mem_ctx, callback, private_data,
>                                      call, fmt, ap);
>          va_end(ap);
>
>          return ret;
> }
>
> so I'd start looking at the monitoring scripts. From the ctdb manpage:
>
> scriptstatus
> This command displays which event scripts where run in the previous
> monitoring cycle and the result of each script. If a script failed with an
> error, causing the node to become unhealthy, the output from that script is
> also shown.
>
> This command is deprecated. It's provided for backward compatibility. In
> place of ctdb scriptstatus, use ctdb event status.
>
> Example
> # ctdb scriptstatus
> 00.ctdb              OK         0.011 Sat Dec 17 19:40:46 2016
> 01.reclock           OK         0.010 Sat Dec 17 19:40:46 2016
> 05.system            OK         0.030 Sat Dec 17 19:40:46 2016
> 06.nfs               OK         0.014 Sat Dec 17 19:40:46 2016
> 10.interface         OK         0.041 Sat Dec 17 19:40:46 2016
> 11.natgw             OK         0.008 Sat Dec 17 19:40:46 2016
> 11.routing           OK         0.007 Sat Dec 17 19:40:46 2016
> 13.per_ip_routing    OK         0.007 Sat Dec 17 19:40:46 2016
> 20.multipathd        OK         0.007 Sat Dec 17 19:40:46 2016
> 31.clamd             OK         0.007 Sat Dec 17 19:40:46 2016
> 40.vsftpd            OK         0.013 Sat Dec 17 19:40:46 2016
> 41.httpd             OK         0.015 Sat Dec 17 19:40:46 2016
> 49.winbind           OK         0.022 Sat Dec 17 19:40:46 2016
> 50.samba             ERROR      0.077 Sat Dec 17 19:40:46 2016
>    OUTPUT: ERROR: samba tcp port 445 is not responding
>
> I'm not a ctdb expert, but I hope this helps !
>
>


More information about the samba mailing list