[SCM] CTDB repository - branch 1.0.112 updated - ctdb-1.0.111-142-g16a5cad

Thu Sep 2 20:01:13 MDT 2010

The branch, 1.0.112 has been updated
       via  16a5cad37fa9093beb3ab5e4c24bbd61056c89f8 (commit)
       via  35b719c8e2d97ec7014401a132937a01a1f2da7f (commit)
      from  d0c57b915d225bcf4c924ff57df7abb99b3ebfd1 (commit)

http://gitweb.samba.org/?p=sahlberg/ctdb.git;a=shortlog;h=1.0.112


- Log -----------------------------------------------------------------
commit 16a5cad37fa9093beb3ab5e4c24bbd61056c89f8
Author: Ronnie Sahlberg <ronniesahlberg at gmail.com>
Date:   Fri Sep 3 11:58:27 2010 +1000

    When memory allocations for recovery fails,
    dont dereference a null pointer while trying to print the log message for the failure.
    
    also shutdown ctdb with ctdb_fatal()

commit 35b719c8e2d97ec7014401a132937a01a1f2da7f
Author: Rusty Russell <rusty at rustcorp.com.au>
Date:   Thu Sep 2 12:44:21 2010 +0930

    eventscript: make sure we die when we timeout.
    
    Volker noticed that system() can hang on a futex: we do this inside a
    signal handler simply to dump extra diagnostics when we timeout, which is
    very questionable but usually works.
    
    Add a timeout of 90 seconds: after that, commit suicide.
    (This is a workaround for this branch: master does this correctly).
    
    Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>

-----------------------------------------------------------------------

Summary of changes:
 server/ctdb_recover.c |    6 ++----
 server/eventscript.c  |   13 +++++++++++++
 2 files changed, 15 insertions(+), 4 deletions(-)


Changeset truncated at 500 lines:

diff --git a/server/ctdb_recover.c b/server/ctdb_recover.c
index f61b6e7..b48b4e7 100644
--- a/server/ctdb_recover.c
+++ b/server/ctdb_recover.c
@@ -340,10 +340,8 @@ static int traverse_pulldb(struct tdb_context *tdb, TDB_DATA key, TDB_DATA data,
 	}
 	params->pulldata = talloc_realloc_size(NULL, params->pulldata, rec->length + params->len);
 	if (params->pulldata == NULL) {
-		DEBUG(DEBUG_ERR,(__location__ " Failed to expand pulldb_data to %u (%u records)\n", 
-			 rec->length + params->len, params->pulldata->count));
-		params->failed = true;
-		return -1;
+		DEBUG(DEBUG_CRIT,(__location__ " Failed to expand pulldb_data to %u\n", rec->length + params->len));
+		ctdb_fatal(params->ctdb, "failed to allocate memory for recovery. shutting down\n");
 	}
 	params->pulldata->count++;
 	memcpy(params->len+(uint8_t *)params->pulldata, rec, rec->length);
diff --git a/server/eventscript.c b/server/eventscript.c
index c403772..37306db 100644
--- a/server/eventscript.c
+++ b/server/eventscript.c
@@ -34,6 +34,13 @@ static struct {
 
 static void ctdb_event_script_timeout(struct event_context *ev, struct timed_event *te, struct timeval t, void *p);
 
+static void sigalarm(int sig)
+{
+	/* all the child processes will be running in the same process group */
+	kill(-getpgrp(), SIGKILL);
+	_exit(1);
+}
+
 /*
   ctdbd sends us a SIGTERM when we should time out the current script
  */
@@ -42,6 +49,12 @@ static void sigterm(int sig)
 	char tbuf[100], buf[200];
 	time_t t;
 
+	/* Calling system() inside a signal handler can do strange things:
+	 * it usually works, and that's enough for us: it's only for debugging.
+	 * But make sure we terminate. */
+	signal(SIGTERM, sigalarm);
+	alarm(90);
+
 	DEBUG(DEBUG_ERR,("Timed out running script '%s' after %.1f seconds pid :%d\n", 
 		 child_state.script_running, timeval_elapsed(&child_state.start), getpid()));
 


-- 
CTDB repository