Rev 569: merge from ronnie in http://samba.org/~tridge/ctdb

Tue Jul 10 04:59:23 GMT 2007

------------------------------------------------------------
revno: 569
revision-id: tridge at samba.org-20070710045923-a102dnq558tb1c0v
parent: tridge at samba.org-20070708110909-b89ygenzjww0kazl
parent: sahlberg at ronnie-20070710030935-8vbusw2q37a1mm5g
committer: Andrew Tridgell <tridge at samba.org>
branch nick: tridge
timestamp: Tue 2007-07-10 14:59:23 +1000
message:
  merge from ronnie
modified:
  config/events.d/60.nfs         nfs-20070601141008-hy3h4qgbk1jd2jci-1
  server/ctdb_recoverd.c         recoverd.c-20070503213540-bvxuyd9jm1f7ig90-1
  tools/ctdb.c                   ctdb_control.c-20070426122705-9ehj1l5lu2gn9kuj-1
  web/nfs.html                   nfs.html-20070608234340-a8i1dxro7a7i6jz6-1
    ------------------------------------------------------------
    revno: 432.1.121
    merged: sahlberg at ronnie-20070710030935-8vbusw2q37a1mm5g
    parent: sahlberg at ronnie-20070710024346-wmbpi73nq4uc3h8v
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Tue 2007-07-10 13:09:35 +1000
    message:
      use the socketkiller to kill off all lock manager sessions as well
    ------------------------------------------------------------
    revno: 432.1.120
    merged: sahlberg at ronnie-20070710024346-wmbpi73nq4uc3h8v
    parent: sahlberg at ronnie-20070710002420-n8v37hequ5tj3rz7
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Tue 2007-07-10 12:43:46 +1000
    message:
      update the documentation for NFS to mention that the lock manager must 
      run on the same port on all nodes.
      
      remove the CTDB_MANAGES_NFSLOCK variable that is no longer used
    ------------------------------------------------------------
    revno: 432.1.119
    merged: sahlberg at ronnie-20070710002420-n8v37hequ5tj3rz7
    parent: sahlberg at ronnie-20070710000726-nssik88h3the46ea
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Tue 2007-07-10 10:24:20 +1000
    message:
      make it possible to specify how many times ctdb killtcp will try to RST 
      the tcp connection
      
      change the 60.nfs script to run ctdb killtcp in the foreground so we 
      dont get lots of these running in parallel when there are a lot of tcp 
      connections to rst
    ------------------------------------------------------------
    revno: 432.1.118
    merged: sahlberg at ronnie-20070710000726-nssik88h3the46ea
    parent: sahlberg at ronnie-20070709234514-5xij3ft3msqanw54
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Tue 2007-07-10 10:07:26 +1000
    message:
      run the ctdb killtcp in the background
    ------------------------------------------------------------
    revno: 432.1.117
    merged: sahlberg at ronnie-20070709234514-5xij3ft3msqanw54
    parent: sahlberg at ronnie-20070709074015-xqypz1b1mjdaisxx
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Tue 2007-07-10 09:45:14 +1000
    message:
      dont restart the tcp service after a ip takeover,   it is more efficient 
      to just kill off the tcp connections
    ------------------------------------------------------------
    revno: 432.1.116
    merged: sahlberg at ronnie-20070709074015-xqypz1b1mjdaisxx
    parent: sahlberg at ronnie-20070709032117-zh4bv6m3teth2l8f
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Mon 2007-07-09 17:40:15 +1000
    message:
      nicer handling of DISCONNECTED flag  when we update the node flags from 
      a remote message
    ------------------------------------------------------------
    revno: 432.1.115
    merged: sahlberg at ronnie-20070709032117-zh4bv6m3teth2l8f
    parent: sahlberg at ronnie-20070709025515-xzh5dloctgx1aitl
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Mon 2007-07-09 13:21:17 +1000
    message:
      when a remote node has sent us a message to update the flags for a node,   
      dont let those messages modify the DISCONNECTED flag.
      
      the DISCONNECTED flag must be managed locally since it describes whether 
      the local node can communicate with the remote node or not
    ------------------------------------------------------------
    revno: 432.1.114
    merged: sahlberg at ronnie-20070709025515-xzh5dloctgx1aitl
    parent: sahlberg at ronnie-20070709023300-32933wlyy4sjmjpg
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Mon 2007-07-09 12:55:15 +1000
    message:
      a better way to fix the DISCONNECT|BANNED vs DISCONNECT bug
    ------------------------------------------------------------
    revno: 432.1.113
    merged: sahlberg at ronnie-20070709023300-32933wlyy4sjmjpg
    parent: sahlberg at ronnie-20070708223801-y484ulft770kzu0u
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Mon 2007-07-09 12:33:00 +1000
    message:
      when checking the nodemap flags for consitency while monitoring the 
      cluster,   we cant check that both the BANNED and the DISCONNECTED flags 
      are both set at the same time   since if a node becomes banned just 
      before it is DISCONNECTED   there is no guarantee that all other nodes 
      will have seen the BANNED flag.
      
      So we must first check the DISCONNECTED flag only   and only if the 
      DISCONNECTED flag is not set should we check the BANNED flag.
      
      
      othervise this can cause a recovery loop while some nodes thing the 
      disconnected node is DISCONNECTED|BANNED and other think it is just 
      DISCONNECTED
    ------------------------------------------------------------
    revno: 432.1.112
    merged: sahlberg at ronnie-20070708223801-y484ulft770kzu0u
    parent: sahlberg at ronnie-20070706052903-ubnhrajew6z45y29
    parent: tridge at samba.org-20070708110909-b89ygenzjww0kazl
    committer: Ronnie Sahlberg <sahlberg at ronnie>
    branch nick: ctdb
    timestamp: Mon 2007-07-09 08:38:01 +1000
    message:
      merge from tridge
=== modified file 'config/events.d/60.nfs'

--- a/config/events.d/60.nfs	2007-07-06 00:54:42 +0000
+++ b/config/events.d/60.nfs	2007-07-10 03:09:35 +0000
@@ -61,12 +61,32 @@
 	;;
 
      recovered)
-        # restart NFS to ensure that all TCP connections to the released ip
-	# are closed
+	[ -f /etc/ctdb/state/nfs/restart ] && [ ! -z "$LOCKD_TCPPORT" ] && {
+	        # RST all tcp connections used for NLM to ensure that they do
+		# not survive in ESTABLISHED state across a failover/failback
+		# and create an ack storm
+		netstat -tn |egrep "^tcp.*\s+[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+:${LOCKD_TCPPORT}\s+.*ESTABLISHED" | awk '{print $4" "$5}' | while read dest src; do
+			srcip=`echo $src | cut -d: -f1`
+			srcport=`echo $src | cut -d: -f2`
+			destip=`echo $dest | cut -d: -f1`
+			destport=`echo $dest | cut -d: -f2`
+			ctdb killtcp $srcip:$srcport $destip:$destport 1 >/dev/null 2>&1 
+#			ctdb killtcp $destip:$destport $srcip:$srcport 1 >/dev/null 2>&1
+		done
+	} > /dev/null 2>&1
+
 	[ -f /etc/ctdb/state/nfs/restart ] && {
-		( service nfs status > /dev/null 2>&1 && 
-                      service nfs restart > /dev/null 2>&1 &&
-		      service nfslock restart > /dev/null 2>&1 ) &
+	        # RST all tcp connections used for NFS to ensure that they do
+		# not survive in ESTABLISHED state across a failover/failback
+		# and create an ack storm
+		netstat -tn |egrep '^tcp.*\s+[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+:2049\s+.*ESTABLISHED' | awk '{print $4" "$5}' | while read dest src; do
+			srcip=`echo $src | cut -d: -f1`
+			srcport=`echo $src | cut -d: -f2`
+			destip=`echo $dest | cut -d: -f1`
+			destport=`echo $dest | cut -d: -f2`
+			ctdb killtcp $srcip:$srcport $destip:$destport 1 >/dev/null 2>&1 
+			ctdb killtcp $destip:$destport $srcip:$srcport 1 >/dev/null 2>&1
+		done
 	} > /dev/null 2>&1
 	/bin/rm -f /etc/ctdb/state/nfs/restart
 

=== modified file 'server/ctdb_recoverd.c'
--- a/server/ctdb_recoverd.c	2007-07-03 22:36:59 +0000
+++ b/server/ctdb_recoverd.c	2007-07-09 07:40:15 +0000
@@ -385,11 +385,6 @@
 	for (i=0;i<nodemap->num;i++) {
 		struct ctdb_node_flag_change c;
 		TDB_DATA data;
-		uint32_t flags = nodemap->nodes[i].flags;
-
-		if (flags & NODE_FLAGS_DISCONNECTED) {
-			continue;
-		}
 
 		c.vnn = nodemap->nodes[i].vnn;
 		c.flags = nodemap->nodes[i].flags;
@@ -1073,6 +1068,15 @@
 		return;
 	}
 
+	/* Dont let messages from remote nodes change the DISCONNECTED flag. 
+	   This flag is handled locally based on whether the local node
+	   can communicate with the node or not.
+	*/
+	c->flags &= ~NODE_FLAGS_DISCONNECTED;
+	if (nodemap->nodes[i].flags&NODE_FLAGS_DISCONNECTED) {
+		c->flags |= NODE_FLAGS_DISCONNECTED;
+	}
+
 	if (nodemap->nodes[i].flags != c->flags) {
 		DEBUG(0,("Node %u has changed flags - now 0x%x\n", c->vnn, c->flags));
 	}
@@ -1327,7 +1331,7 @@
 			}
 			if ((remote_nodemap->nodes[i].flags & NODE_FLAGS_INACTIVE) != 
 			    (nodemap->nodes[i].flags & NODE_FLAGS_INACTIVE)) {
-				DEBUG(0, (__location__ " Remote node:%u has different nodemap flags for %d (0x%x vs 0x%x)\n", 
+				DEBUG(0, (__location__ " Remote node:%u has different nodemap flag for %d (0x%x vs 0x%x)\n", 
 					  nodemap->nodes[j].vnn, i,
 					  remote_nodemap->nodes[i].flags, nodemap->nodes[i].flags));
 				do_recovery(rec, mem_ctx, vnn, num_active, nodemap, 

=== modified file 'tools/ctdb.c'
--- a/tools/ctdb.c	2007-07-05 00:00:51 +0000
+++ b/tools/ctdb.c	2007-07-10 00:24:20 +0000
@@ -308,10 +308,10 @@
  */
 static int kill_tcp(struct ctdb_context *ctdb, int argc, const char **argv)
 {
-	int i, ret;
+	int i, ret, numrst;
 	struct sockaddr_in src, dst;
 
-	if (argc < 2) {
+	if (argc < 3) {
 		usage();
 	}
 
@@ -325,7 +325,9 @@
 		return -1;
 	}
 
-	for (i=0;i<5;i++) {
+	numrst = strtoul(argv[2], NULL, 0);
+
+	for (i=0;i<numrst;i++) {
 		ret = ctdb_sys_kill_tcp(ctdb->ev, &src, &dst);
 
 		printf("ret:%d\n", ret);
@@ -889,7 +891,7 @@
 	{ "recover",         control_recover,           true,  "force recovery" },
 	{ "freeze",          control_freeze,            true,  "freeze all databases" },
 	{ "thaw",            control_thaw,              true,  "thaw all databases" },
-	{ "killtcp",         kill_tcp,                  false, "kill a tcp connection", "<srcip:port> <dstip:port>" },
+	{ "killtcp",         kill_tcp,                  false, "kill a tcp connection. Try <num> times.", "<srcip:port> <dstip:port> <num>" },
 	{ "tickle",          tickle_tcp,                false, "send a tcp tickle ack", "<srcip:port> <dstip:port>" },
 };
 

=== modified file 'web/nfs.html'
--- a/web/nfs.html	2007-06-12 04:43:26 +0000
+++ b/web/nfs.html	2007-07-10 02:43:46 +0000
@@ -47,16 +47,18 @@
 This file should look something like :
 <pre>
   CTDB_MANAGES_NFS=yes
-  CTDB_MANAGES_NFSLOCK=yes
+  LOCKD_TCPPORT=599
+  LOCKD_UDPPORT=599
   STATD_SHARED_DIRECTORY=/gpfs0/nfs-state
-  STATD_HOSTNAME=\"ctdb -P $STATD_SHARED_DIRECTORY/192.168.1.1 -H /etc/ctdb/statd-callout -p 97\"
+  STATD_HOSTNAME="ctdb -P $STATD_SHARED_DIRECTORY/192.168.1.1 -H /etc/ctdb/statd-callout -p 97"
 </pre>
 
 The CTDB_MANAGES_NFS line tells the events scripts that CTDB is to manage startup and shutdown of the NFS and NFSLOCK services.<br>
 
-The CTDB_MANAGES_NFSLOCK line tells the events scripts that CTDB is also to manage the nfs lock manager.<br>
+With this set to yes, CTDB will start/stop/restart these services as required.<br><br>
 
-With these set to yes, CTDB will start/stop/restart these services as required.<br><br>
+You need to make sure that the lock manager runs on the same port on all nodes in the cluster since some clients will have "issues" and take very long to recover if the port suddenly changes.<br>
+599 above is only an example. You can run the lock manager on any available port as long as you use the same port on all nodes.<br><br>
 
 STATD_SHARED_DIRECTORY is the shared directory where statd and the statd-callout script expects that the state variables and lists of clients to notify are found.<br>