[SCM] CTDB repository - branch master updated - ctdb-1.10-230-g518945e
Ronnie Sahlberg
sahlberg at samba.org
Thu Jul 28 17:06:55 MDT 2011
The branch, master has been updated
via 518945e59e2e48f07fcc0955f3aa81cd0d946aea (commit)
via 9d34be0233edf3bc022345c0494c4b2a4d7f8480 (commit)
via 61fc7fbd0235469df22deb6581c6bd47e30bc0be (commit)
via 0e60a738f9a6275ed45abc3d933f872d93132d92 (commit)
via 0a99e8742a261b1d3a2c8830f5c19ea6c2c47cad (commit)
via 6fcd867cc835ef1ffc1c50964f135c346503d40c (commit)
from c6bfba2bb66962b7b05d708f0747002700991472 (commit)
http://gitweb.samba.org/?p=ctdb.git;a=shortlog;h=master
- Log -----------------------------------------------------------------
commit 518945e59e2e48f07fcc0955f3aa81cd0d946aea
Merge: 0e60a738f9a6275ed45abc3d933f872d93132d92 9d34be0233edf3bc022345c0494c4b2a4d7f8480
Author: Ronnie Sahlberg <ronniesahlberg at gmail.com>
Date: Fri Jul 29 09:04:01 2011 +1000
Merge branch 'master' of 10.1.1.27:/shared/ctdb/ctdb-master
commit 9d34be0233edf3bc022345c0494c4b2a4d7f8480
Author: Martin Schwenke <martin at meltin.net>
Date: Thu Jul 28 15:22:42 2011 +1000
Tests: Initial test code for LCP2 IP allocation algorithm.
Move struct ctdb_public_ip_list to ctdb_private.h and put some
definitions for some functions from ctdb_takeover.c there. This
allows those functions to be called from unit tests.
Add ctdb_takeover_tests.c and the Makefile support to build it.
Signed-off-by: Martin Schwenke <martin at meltin.net>
commit 61fc7fbd0235469df22deb6581c6bd47e30bc0be
Author: Martin Schwenke <martin at meltin.net>
Date: Thu Jul 28 15:16:46 2011 +1000
IP allocation - add LCP2 algorithm.
The current non-deterministic IP allocation algorithm balances IPs
across the whole cluster. It does not consider different
interfaces/VLANs/subnets, so these different groups of IPs aren't
generally well balanced.
This adds the LCP2 algorithm for IP allocation and allows it to be
enabled by setting the "LCP2PublicIPs" tunable to 1.
The LCP2 algorithm calculates the imbalance of a node by totalling the
squares of the distances between each IP on the node. The IP distance
is defined as the length longest common prefix (LCP) of bits that is
found when comparing 2 IPs. The imbalance of a cluster is the maximum
imbalance for any node. At each step the algorithm selects an
allocation to the IP/node combination that results in the choosing the
allocation that best reduces the imbalance of the cluster.
The implementation splits out the IP allocation part of
ctdb_takeover_run() into new function ctdb_takeover_run_core(), and
then extracts out the basic IP assignment code into new functions
basic_allocate_unassigned() and basic_failback(). 3 new functions
lcp2_init(), lcp2_allocate_unassigned() and lcp2_failback() implement
the LCP2 algorithm, and are hooked into ctdb_takeover_run_core().
Signed-off-by: Martin Schwenke <martin at meltin.net>
commit 0e60a738f9a6275ed45abc3d933f872d93132d92
Merge: c6bfba2bb66962b7b05d708f0747002700991472 0a99e8742a261b1d3a2c8830f5c19ea6c2c47cad
Author: Ronnie Sahlberg <ronniesahlberg at gmail.com>
Date: Fri Jul 29 08:53:43 2011 +1000
Merge branch 'master' of 10.1.1.27:/shared/ctdb/ctdb-master
commit 0a99e8742a261b1d3a2c8830f5c19ea6c2c47cad
Author: Ronnie Sahlberg <ronniesahlberg at gmail.com>
Date: Fri Jul 29 08:41:35 2011 +1000
Update the delip command
Dont talloc_free(vnn) immediately but postphone it until later when
the eventscript callback has completed.
CQ S1026664
commit 6fcd867cc835ef1ffc1c50964f135c346503d40c
Author: Rusty Russell <rusty at rustcorp.com.au>
Date: Mon Jul 25 17:56:06 2011 +0930
eventscript: fix callback after free
ctdb_event_script_callback() takes a mem_ctx arg which it doesn't use, but
the implication is pretty clear, that when that mem_ctx is freed, the callback
shouldn't happen. Indeed, Ronnie reproduced a case where that callback
refers to freed memory, in the ip reallocation code under stress.
So attach the callback to the mem_ctx they give us, and remove it from the
script state structure when that's freed. It's a bit weird, but it works.
CQ: S1026179
Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>
-----------------------------------------------------------------------
Summary of changes:
Makefile.in | 7 +
include/ctdb_private.h | 34 ++
server/ctdb_takeover.c | 656 ++++++++++++++++++++++++++++++++-------
server/ctdb_tunables.c | 1 +
server/eventscript.c | 50 +++-
tests/src/ctdb_takeover_tests.c | 378 ++++++++++++++++++++++
6 files changed, 999 insertions(+), 127 deletions(-)
create mode 100644 tests/src/ctdb_takeover_tests.c
Changeset truncated at 500 lines:
diff --git a/Makefile.in b/Makefile.in
index 8fb9ea7..d53d3db 100755
--- a/Makefile.in
+++ b/Makefile.in
@@ -70,6 +70,7 @@ TEST_BINS=tests/bin/ctdb_bench tests/bin/ctdb_fetch tests/bin/ctdb_fetch_one \
tests/bin/ctdb_fetch_lock_once tests/bin/ctdb_store \
tests/bin/ctdb_randrec tests/bin/ctdb_persistent \
tests/bin/ctdb_traverse tests/bin/rb_test tests/bin/ctdb_transaction \
+ tests/bin/ctdb_takeover_tests
@INFINIBAND_BINS@
BINS = bin/ctdb @CTDB_SCSI_IO@ bin/smnotify bin/ping_pong bin/ltdbtool
@@ -190,6 +191,12 @@ tests/bin/ctdb_transaction: $(CTDB_CLIENT_OBJ) tests/src/ctdb_transaction.o
@echo Linking $@
@$(CC) $(CFLAGS) -o $@ tests/src/ctdb_transaction.o $(CTDB_CLIENT_OBJ) $(LIB_FLAGS)
+CTDB_TAKEOVER_OBJ = $(CTDB_SERVER_OBJ:server/ctdbd.o=)
+
+tests/bin/ctdb_takeover_tests: $(CTDB_TAKEOVER_OBJ) tests/src/ctdb_takeover_tests.o
+ @echo Linking $@
+ @$(CC) $(CFLAGS) -o $@ tests/src/ctdb_takeover_tests.o $(CTDB_TAKEOVER_OBJ) $(LIB_FLAGS)
+
tests/bin/ibwrapper_test: $(CTDB_CLIENT_OBJ) ib/ibwrapper_test.o
@echo Linking $@
@$(CC) $(CFLAGS) -o $@ ib/ibwrapper_test.o $(CTDB_CLIENT_OBJ) $(LIB_FLAGS)
diff --git a/include/ctdb_private.h b/include/ctdb_private.h
index 396427b..37f8a73 100644
--- a/include/ctdb_private.h
+++ b/include/ctdb_private.h
@@ -120,6 +120,7 @@ struct ctdb_tunable {
uint32_t stat_history_interval;
uint32_t deferred_attach_timeout;
uint32_t vacuum_fast_path_count;
+ uint32_t lcp2_public_ip_assignment;
};
/*
@@ -1410,4 +1411,37 @@ int32_t ctdb_local_schedule_for_deletion(struct ctdb_db_context *ctdb_db,
struct ctdb_ltdb_header *ctdb_header_from_record_handle(struct ctdb_record_handle *h);
+/* For unit testing ctdb_transaction.c. */
+struct ctdb_public_ip_list {
+ struct ctdb_public_ip_list *next;
+ uint32_t pnn;
+ ctdb_sock_addr addr;
+};
+uint32_t ip_distance(ctdb_sock_addr *ip1, ctdb_sock_addr *ip2);
+uint32_t ip_distance_2_sum(ctdb_sock_addr *ip,
+ struct ctdb_public_ip_list *ips,
+ int pnn);
+uint32_t lcp2_imbalance(struct ctdb_public_ip_list * all_ips, int pnn);
+void lcp2_init(struct ctdb_context * tmp_ctx,
+ struct ctdb_node_map * nodemap,
+ uint32_t mask,
+ struct ctdb_public_ip_list *all_ips,
+ uint32_t **lcp2_imbalances,
+ bool **newly_healthy);
+void lcp2_allocate_unassigned(struct ctdb_context *ctdb,
+ struct ctdb_node_map *nodemap,
+ uint32_t mask,
+ struct ctdb_public_ip_list *all_ips,
+ uint32_t *lcp2_imbalances);
+bool lcp2_failback(struct ctdb_context *ctdb,
+ struct ctdb_node_map *nodemap,
+ uint32_t mask,
+ struct ctdb_public_ip_list *all_ips,
+ uint32_t *lcp2_imbalances,
+ bool *newly_healthy);
+void ctdb_takeover_run_core(struct ctdb_context *ctdb,
+ struct ctdb_node_map *nodemap,
+ struct ctdb_public_ip_list **all_ips_p);
+
+
#endif
diff --git a/server/ctdb_takeover.c b/server/ctdb_takeover.c
index ddbc77f..5512acc 100644
--- a/server/ctdb_takeover.c
+++ b/server/ctdb_takeover.c
@@ -3,6 +3,7 @@
Copyright (C) Ronnie Sahlberg 2007
Copyright (C) Andrew Tridgell 2007
+ Copyright (C) Martin Schwenke 2011
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -1058,13 +1059,6 @@ int ctdb_set_single_public_ip(struct ctdb_context *ctdb,
return 0;
}
-struct ctdb_public_ip_list {
- struct ctdb_public_ip_list *next;
- uint32_t pnn;
- ctdb_sock_addr addr;
-};
-
-
/* Given a physical node, return the number of
public addresses that is currently assigned to this node.
*/
@@ -1255,112 +1249,119 @@ create_merged_ip_list(struct ctdb_context *ctdb)
return ip_list;
}
-/*
- make any IP alias changes for public addresses that are necessary
+/*
+ * This is the length of the longtest common prefix between the IPs.
+ * It is calculated by XOR-ing the 2 IPs together and counting the
+ * number of leading zeroes. The implementation means that all
+ * addresses end up being 128 bits long.
+ * Not static, so we can easily link it into a unit test.
+ *
+ * FIXME? Should we consider IPv4 and IPv6 separately given that the
+ * 12 bytes of 0 prefix padding will hurt the algorithm if there are
+ * lots of nodes and IP addresses?
*/
-int ctdb_takeover_run(struct ctdb_context *ctdb, struct ctdb_node_map *nodemap)
+uint32_t ip_distance(ctdb_sock_addr *ip1, ctdb_sock_addr *ip2)
{
- int i, num_healthy, retries, num_ips;
- struct ctdb_public_ip ip;
- struct ctdb_public_ipv4 ipv4;
- uint32_t mask, *nodes;
- struct ctdb_public_ip_list *all_ips, *tmp_ip;
- int maxnode, maxnum=0, minnode, minnum=0, num;
- TDB_DATA data;
- struct timeval timeout;
- struct client_async_data *async_data;
- struct ctdb_client_control_state *state;
- TALLOC_CTX *tmp_ctx = talloc_new(ctdb);
-
- /*
- * ip failover is completely disabled, just send out the
- * ipreallocated event.
- */
- if (ctdb->tunable.disable_ip_failover != 0) {
- goto ipreallocated;
- }
+ uint32_t ip1_k[IP_KEYLEN];
+ uint32_t *t;
+ int i;
+ uint32_t x;
- ZERO_STRUCT(ip);
+ uint32_t distance = 0;
- /* Count how many completely healthy nodes we have */
- num_healthy = 0;
- for (i=0;i<nodemap->num;i++) {
- if (!(nodemap->nodes[i].flags & (NODE_FLAGS_INACTIVE|NODE_FLAGS_DISABLED))) {
- num_healthy++;
+ memcpy(ip1_k, ip_key(ip1), sizeof(ip1_k));
+ t = ip_key(ip2);
+ for (i=0; i<IP_KEYLEN; i++) {
+ x = ip1_k[i] ^ t[i];
+ if (x == 0) {
+ distance += 32;
+ } else {
+ /* Count number of leading zeroes.
+ * FIXME? This could be optimised...
+ */
+ while ((x & (1 << 31)) == 0) {
+ x <<= 1;
+ distance += 1;
+ }
}
}
- if (num_healthy > 0) {
- /* We have healthy nodes, so only consider them for
- serving public addresses
- */
- mask = NODE_FLAGS_INACTIVE|NODE_FLAGS_DISABLED;
- } else {
- /* We didnt have any completely healthy nodes so
- use "disabled" nodes as a fallback
- */
- mask = NODE_FLAGS_INACTIVE;
- }
-
- /* since nodes only know about those public addresses that
- can be served by that particular node, no single node has
- a full list of all public addresses that exist in the cluster.
- Walk over all node structures and create a merged list of
- all public addresses that exist in the cluster.
+ return distance;
+}
- keep the tree of ips around as ctdb->ip_tree
- */
- all_ips = create_merged_ip_list(ctdb);
+/* Calculate the IP distance for the given IP relative to IPs on the
+ given node. The ips argument is generally the all_ips variable
+ used in the main part of the algorithm.
+ * Not static, so we can easily link it into a unit test.
+ */
+uint32_t ip_distance_2_sum(ctdb_sock_addr *ip,
+ struct ctdb_public_ip_list *ips,
+ int pnn)
+{
+ struct ctdb_public_ip_list *t;
+ uint32_t d;
- /* Count how many ips we have */
- num_ips = 0;
- for (tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next) {
- num_ips++;
- }
+ uint32_t sum = 0;
- /* If we want deterministic ip allocations, i.e. that the ip addresses
- will always be allocated the same way for a specific set of
- available/unavailable nodes.
- */
- if (1 == ctdb->tunable.deterministic_public_ips) {
- DEBUG(DEBUG_NOTICE,("Deterministic IPs enabled. Resetting all ip allocations\n"));
- for (i=0,tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next,i++) {
- tmp_ip->pnn = i%nodemap->num;
+ for (t=ips; t != NULL; t=t->next) {
+ if (t->pnn != pnn) {
+ continue;
}
- }
-
- /* mark all public addresses with a masked node as being served by
- node -1
- */
- for (tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next) {
- if (tmp_ip->pnn == -1) {
+ /* Optimisation: We never calculate the distance
+ * between an address and itself. This allows us to
+ * calculate the effect of removing an address from a
+ * node by simply calculating the distance between
+ * that address and all of the exitsing addresses.
+ * Moreover, we assume that we're only ever dealing
+ * with addresses from all_ips so we can identify an
+ * address via a pointer rather than doing a more
+ * expensive address comparison. */
+ if (&(t->addr) == ip) {
continue;
}
- if (nodemap->nodes[tmp_ip->pnn].flags & mask) {
- tmp_ip->pnn = -1;
- }
+
+ d = ip_distance(ip, &(t->addr));
+ sum += d * d; /* Cheaper than pulling in math.h :-) */
}
- /* verify that the assigned nodes can serve that public ip
- and set it to -1 if not
- */
- for (tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next) {
- if (tmp_ip->pnn == -1) {
+ return sum;
+}
+
+/* Return the LCP2 imbalance metric for addresses currently assigned
+ to the given node.
+ * Not static, so we can easily link it into a unit test.
+ */
+uint32_t lcp2_imbalance(struct ctdb_public_ip_list * all_ips, int pnn)
+{
+ struct ctdb_public_ip_list *t;
+
+ uint32_t imbalance = 0;
+
+ for (t=all_ips; t!=NULL; t=t->next) {
+ if (t->pnn != pnn) {
continue;
}
- if (can_node_serve_ip(ctdb, tmp_ip->pnn, tmp_ip) != 0) {
- /* this node can not serve this ip. */
- tmp_ip->pnn = -1;
- }
+ /* Pass the rest of the IPs rather than the whole
+ all_ips input list.
+ */
+ imbalance += ip_distance_2_sum(&(t->addr), t->next, pnn);
}
+ return imbalance;
+}
+
+/* Allocate any unassigned IPs just by looping through the IPs and
+ * finding the best node for each.
+ * Not static, so we can easily link it into a unit test.
+ */
+void basic_allocate_unassigned(struct ctdb_context *ctdb,
+ struct ctdb_node_map *nodemap,
+ uint32_t mask,
+ struct ctdb_public_ip_list *all_ips)
+{
+ struct ctdb_public_ip_list *tmp_ip;
- /* now we must redistribute all public addresses with takeover node
- -1 among the nodes available
- */
- retries = 0;
-try_again:
/* loop over all ip's and find a physical node to cover for
each unassigned ip.
*/
@@ -1372,26 +1373,26 @@ try_again:
}
}
}
+}
- /* If we dont want ips to fail back after a node becomes healthy
- again, we wont even try to reallocat the ip addresses so that
- they are evenly spread out.
- This can NOT be used at the same time as DeterministicIPs !
- */
- if (1 == ctdb->tunable.no_ip_failback) {
- if (1 == ctdb->tunable.deterministic_public_ips) {
- DEBUG(DEBUG_ERR, ("ERROR: You can not use 'DeterministicIPs' and 'NoIPFailback' at the same time\n"));
- }
- goto finished;
- }
-
+/* Basic non-deterministic rebalancing algorithm.
+ * Not static, so we can easily link it into a unit test.
+ */
+bool basic_failback(struct ctdb_context *ctdb,
+ struct ctdb_node_map *nodemap,
+ uint32_t mask,
+ struct ctdb_public_ip_list *all_ips,
+ int num_ips,
+ int *retries)
+{
+ int i;
+ int maxnode, maxnum=0, minnode, minnum=0, num;
+ struct ctdb_public_ip_list *tmp_ip;
- /* now, try to make sure the ip adresses are evenly distributed
- across the node.
- for each ip address, loop over all nodes that can serve this
- ip and make sure that the difference between the node
- serving the most and the node serving the least ip's are not greater
- than 1.
+ /* for each ip address, loop over all nodes that can serve
+ this ip and make sure that the difference between the node
+ serving the most and the node serving the least ip's are
+ not greater than 1.
*/
for (tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next) {
if (tmp_ip->pnn == -1) {
@@ -1455,7 +1456,7 @@ try_again:
want to spend too much time balancing the ip coverage.
*/
if ( (maxnum > minnum+1)
- && (retries < (num_ips + 5)) ){
+ && (*retries < (num_ips + 5)) ){
struct ctdb_public_ip_list *tmp;
/* mark one of maxnode's vnn's as unassigned and try
@@ -1464,14 +1465,403 @@ try_again:
for (tmp=all_ips;tmp;tmp=tmp->next) {
if (tmp->pnn == maxnode) {
tmp->pnn = -1;
- retries++;
- goto try_again;
+ (*retries)++;
+ return true;
+ }
+ }
+ }
+ }
+
+ return false;
+}
+
+/* Do necessary LCP2 initialisation. Bury it in a function here so
+ * that we can unit test it.
+ * Not static, so we can easily link it into a unit test.
+ */
+void lcp2_init(struct ctdb_context * tmp_ctx,
+ struct ctdb_node_map * nodemap,
+ uint32_t mask,
+ struct ctdb_public_ip_list *all_ips,
+ uint32_t **lcp2_imbalances,
+ bool **newly_healthy)
+{
+ int i;
+ struct ctdb_public_ip_list *tmp_ip;
+
+ *newly_healthy = talloc_array(tmp_ctx, bool, nodemap->num);
+ CTDB_NO_MEMORY_FATAL(tmp_ctx, *newly_healthy);
+ *lcp2_imbalances = talloc_array(tmp_ctx, uint32_t, nodemap->num);
+ CTDB_NO_MEMORY_FATAL(tmp_ctx, *lcp2_imbalances);
+
+ for (i=0;i<nodemap->num;i++) {
+ (*lcp2_imbalances)[i] = lcp2_imbalance(all_ips, i);
+ /* First step: is the node "healthy"? */
+ (*newly_healthy)[i] = ! (bool)(nodemap->nodes[i].flags & mask);
+ }
+
+ /* 2nd step: if a ndoe has IPs assigned then it must have been
+ * healthy before, so we remove it from consideration... */
+ for (tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next) {
+ if (tmp_ip->pnn != -1) {
+ (*newly_healthy)[tmp_ip->pnn] = false;
+ }
+ }
+}
+
+/* Allocate any unassigned addresses using the LCP2 algorithm to find
+ * the IP/node combination that will cost the least.
+ * Not static, so we can easily link it into a unit test.
+ */
+void lcp2_allocate_unassigned(struct ctdb_context *ctdb,
+ struct ctdb_node_map *nodemap,
+ uint32_t mask,
+ struct ctdb_public_ip_list *all_ips,
+ uint32_t *lcp2_imbalances)
+{
+ struct ctdb_public_ip_list *tmp_ip;
+ int dstnode;
+
+ int minnode;
+ uint32_t mindsum, dstdsum, dstimbl, minimbl;
+ struct ctdb_public_ip_list *minip;
+
+ bool should_loop = true;
+ bool have_unassigned = true;
+
+ while (have_unassigned && should_loop) {
+ should_loop = false;
+
+ DEBUG(DEBUG_DEBUG,(" ----------------------------------------\n"));
+ DEBUG(DEBUG_DEBUG,(" CONSIDERING MOVES (UNASSIGNED)\n"));
+
+ minnode = -1;
+ mindsum = 0;
+ minip = NULL;
+
+ /* loop over each unassigned ip. */
+ for (tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next) {
+ if (tmp_ip->pnn != -1) {
+ continue;
+ }
+
+ for (dstnode=0; dstnode < nodemap->num; dstnode++) {
+ /* only check nodes that can actually serve this ip */
+ if (can_node_serve_ip(ctdb, dstnode, tmp_ip)) {
+ /* no it couldnt so skip to the next node */
+ continue;
+ }
+ if (nodemap->nodes[dstnode].flags & mask) {
+ continue;
}
+
+ dstdsum = ip_distance_2_sum(&(tmp_ip->addr), all_ips, dstnode);
+ dstimbl = lcp2_imbalances[dstnode] + dstdsum;
+ DEBUG(DEBUG_DEBUG,(" %s -> %d [+%d]\n",
+ ctdb_addr_to_str(&(tmp_ip->addr)),
+ dstnode,
+ dstimbl - lcp2_imbalances[dstnode]));
+
+
+ if ((minnode == -1) || (dstdsum < mindsum)) {
+ minnode = dstnode;
+ minimbl = dstimbl;
+ mindsum = dstdsum;
+ minip = tmp_ip;
+ should_loop = true;
+ }
+ }
+ }
+
+ DEBUG(DEBUG_DEBUG,(" ----------------------------------------\n"));
+
+ /* If we found one then assign it to the given node. */
+ if (minnode != -1) {
+ minip->pnn = minnode;
+ lcp2_imbalances[minnode] = minimbl;
+ DEBUG(DEBUG_INFO,(" %s -> %d [+%d]\n",
+ ctdb_addr_to_str(&(minip->addr)),
+ minnode,
+ mindsum));
+ }
+
+ /* There might be a better way but at least this is clear. */
+ have_unassigned = false;
+ for (tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next) {
+ if (tmp_ip->pnn == -1) {
+ have_unassigned = true;
+ }
+ }
+ }
+
+ /* We know if we have an unassigned addresses so we might as
+ * well optimise.
+ */
+ if (have_unassigned) {
+ for (tmp_ip=all_ips;tmp_ip;tmp_ip=tmp_ip->next) {
--
CTDB repository
More information about the samba-cvs
mailing list