[RFC] Performance improvements

Swen Schillig swen at vnet.ibm.com
Thu Jul 19 08:06:40 UTC 2018


Hi Amitay

Thanks for your extensive email commenting and explaining CTDB.

Am 2018-07-18 07:46, schrieb Amitay Isaacs:
> On Tue, Jul 17, 2018 at 9:07 PM, Swen Schillig via samba-technical
> <samba-technical at lists.samba.org> wrote:
> > Hi Martin
> > 
> > As suggested by you, I added some debug messages to see how CTDB
> > performs under high load in the area of change.
> > 
> > The code is updated as discussed before
> >         - no patch-10 (leaving tevent immediate scheduling as is)
> >         - no alignment patches
> >         - buffer_size raised to 32k (see comment below)
> > 
> > As a prelimenary statement, even though the debug output is slowing
> > down the system it can be seen that the positive effects reported
> > earlier are still existing.
> > 
> > 
> 
> I have spent few years trying to figure out a better way of dealing
> with this problem (may be I am just too dumb!).

I never said that nor did I intend to give that impression
with any of my patches or mails.
If you felt that way, I'm really sorry, but it was never my intention.

> Just increasingbuffer size is *definitely* the wrong thing to do
> here.

What makes you say that " just increasing the buffer size" ?
I believe the patches I posted show that it is "slightly" more than 
that.

> There are couple of reasons for it.
> 
> 1. Even though it's not clear from the output if the output relates
> to
> a single fd or multiple fds, that's irrelevant.  On a large system,
> you can have thousands of smbd connected to ctdb.  Each smbd opens
> multiple connections to ctdb (last time I checked it was 3, but could
> be down to 2).  This means there is a huge increase in the amount of
> memory pre-allocated (1k to 32k is 32 times increase).

I stated that I'm NOT planning an increase to 32k but to 8k and
that's only talking about the default of an otherwise configurable 
value. Therefore I think this is a bit too much noise on that part.

> be a problem for big servers, but could be an issue for embedded
> systems.

Taking the numbers from above and assuming that we're talking about
3000 clients for a node that would result in a memory requirement of
(3000 clients * 2 queues * 8k buffer) = 48MB
not that much isn't it.

... embedded systems run clustered SAMBA ?

> 
> 2. The main issue is unfair scheduling and in the worst case
> starvation (which is often seen as "Handling event took xx seconds"
> issue).  This has to do with the way tevent handles events.

Martin mentioned those, however, I haven't seen any of them
in the logs during the tests.

> 
> If you read large buffer from an fd, then potentially there are large
> number of packets.  queue_process() schedules an immediate event if
> there are more than one packet in the buffer. If you look at the way
> loop_once() is coded, it will first try to finish all the pending
> immediate and timer events before processing fd events.  So a large
> buffer results (generally) in large number of packets, and ctdb
> spends
> more time processing them before it can do anything else.
> 
> The next problem is that if lots of smbd processes are sending
> packets
> to ctdb, then in the event_loop() function epoll_wait() can return
> multiple fds (this can be multiple smbd clients and tcp sockets to
> other nodes).  Now event_loop() processes each fd in a loop, which
> will end up calling queue_process() in ctdb.  This means now for each
> fd, queue_process() would end up setting up immediate events.  That
> means now ctdb has to do lot more work before the next epoll_wait()
> will get called.  In such a case, if epoll_wait() did not return any
> of the tcp sockets for the other nodes, the packets from that node
> will not get processed till all the packets for all the fds (returned
> by epoll_wait) are processed.  This is usually the culprit for
> "Handling event took xx seconds" (where xx is anywhere from tens to
> hundreds of seconds).  This unnecessarily causes a failover because
> one or more node(s) think that the node is unreachable.  This is the
> worst case starvation due to unfair scheduling of fd events.

Thanks for this really good and easy to understand description !

Ok, taking all your and Martins comments into consideration I changed
the patch-set the following way.

We only read up-to (configured) buffer_size from the socket and not
what could be read. This is the original behaviour but the code got
modified to keep as much as possible of the improvements 
(e.g. direct mem size calculation and not jumping back and forth until
we have the right size)

I removed all patches which either do not contribute directly to an
expected performance improvement or are otherwise not wanted by the
community. These are
	- patch-10 (tevent)
	- consolidate alignment
	- add tunable
	- increase buffer size to 8k

So, I hope I addressed all doubts and fears with the new tiny patch-
set. Please review and comment.

Thanks in advance for your support.

Cheers Swen.
-------------- next part --------------
From 99dbb4ad2b806bab1a060f95aad692af4fe3f9b8 Mon Sep 17 00:00:00 2001
From: Swen Schillig <swen at vnet.ibm.com>
Date: Wed, 7 Mar 2018 13:54:45 +0100
Subject: [PATCH 1/4] ctdb: calculate queue input buffer size correctly

The queue's input buffer is calculated in an iterative way.
This can result in a few back and forth jumping and
a few memory allocations and mem-free cycles.
This is very time consuming and not required, because the required
memory size can be calculated right away.

Signed-off-by: Swen Schillig <swen at vnet.ibm.com>
---
 ctdb/common/ctdb_io.c | 85 +++++++++++++++++++++++++++++----------------------
 1 file changed, 49 insertions(+), 36 deletions(-)

diff --git a/ctdb/common/ctdb_io.c b/ctdb/common/ctdb_io.c
index c6941241883..1a7a7de84f0 100644
--- a/ctdb/common/ctdb_io.c
+++ b/ctdb/common/ctdb_io.c
@@ -43,7 +43,6 @@ struct ctdb_buffer {
 	uint8_t *data;
 	uint32_t length;
 	uint32_t size;
-	uint32_t extend;
 };
 
 struct ctdb_queue_pkt {
@@ -103,16 +102,15 @@ static void queue_process(struct ctdb_queue *queue)
 		return;
 	}
 
+	/* Did we at least read the size into the buffer */
 	pkt_size = *(uint32_t *)queue->buffer.data;
 	if (pkt_size == 0) {
 		DEBUG(DEBUG_CRIT, ("Invalid packet of length 0\n"));
 		goto failed;
 	}
 
+	/* the buffer doesn't contain the full packet, return to get the rest */
 	if (queue->buffer.length < pkt_size) {
-		if (pkt_size > queue->buffer_size) {
-			queue->buffer.extend = pkt_size;
-		}
 		return;
 	}
 
@@ -158,10 +156,9 @@ failed:
 */
 static void queue_io_read(struct ctdb_queue *queue)
 {
-	int num_ready = 0;
+	size_t num_ready;
 	ssize_t nread;
-	uint8_t *data;
-	int navail;
+	struct ctdb_buffer *qb = &queue->buffer;
 
 	/* check how much data is available on the socket for immediately
 	   guaranteed nonblocking access.
@@ -177,41 +174,57 @@ static void queue_io_read(struct ctdb_queue *queue)
 		goto failed;
 	}
 
-	if (queue->buffer.data == NULL) {
-		/* starting fresh, allocate buf to read data */
-		queue->buffer.data = talloc_size(queue, queue->buffer_size);
-		if (queue->buffer.data == NULL) {
-			DEBUG(DEBUG_ERR, ("read error alloc failed for %u\n", num_ready));
-			goto failed;
+	if (num_ready > (qb->size - qb->length)) {
+		if (qb->size == 0) {
+			/* no data buffer available yet */
+			qb->data = talloc_size(queue, queue->buffer_size);
+			if (qb->data == NULL) {
+				DBG_ERR("read error alloc failed for %u\n",
+					queue->buffer_size);
+				queue->callback(NULL, 0, queue->private_data);
+				return;
+			}
+
+			qb->size = queue->buffer_size;
+			num_ready = MIN(num_ready, qb->size);
+			goto data_read;
 		}
-		queue->buffer.size = queue->buffer_size;
-	} else if (queue->buffer.extend > 0) {
-		/* extending buffer */
-		data = talloc_realloc_size(queue, queue->buffer.data, queue->buffer.extend);
-		if (data == NULL) {
-			DEBUG(DEBUG_ERR, ("read error realloc failed for %u\n", queue->buffer.extend));
-			goto failed;
+
+		if (qb->length < sizeof(qb->length)) {
+			/* data read is not sufficient to gather message size */
+			num_ready = MIN(num_ready, qb->size - qb->length);
+			goto data_read;
 		}
-		queue->buffer.data = data;
-		queue->buffer.size = queue->buffer.extend;
-		queue->buffer.extend = 0;
-	}
 
-	navail = queue->buffer.size - queue->buffer.length;
-	if (num_ready > navail) {
-		num_ready = navail;
+		/* get message length */
+		size_t m_length = *(uint32_t *)(qb->data);
+
+		if (m_length > qb->size) {
+			uint8_t *data = talloc_realloc_size(queue,
+							    qb->data,
+							    m_length);
+			if (data == NULL) {
+				DBG_ERR("read error realloc failed for %u\n",
+					m_length);
+				TALLOC_FREE(qb->data);
+				memset(qb, 0, sizeof(*qb));
+				queue->callback(NULL, 0, queue->private_data);
+				return;
+			}
+			qb->data = data;
+			qb->size = m_length;
+			num_ready = MIN(num_ready, qb->size - qb->length);
+		}
 	}
 
-	if (num_ready > 0) {
-		nread = sys_read(queue->fd,
-				 queue->buffer.data + queue->buffer.length,
-				 num_ready);
-		if (nread <= 0) {
-			DEBUG(DEBUG_ERR, ("read error nread=%d\n", (int)nread));
-			goto failed;
-		}
-		queue->buffer.length += nread;
+data_read:
+	nread = sys_read(queue->fd, qb->data + qb->length, num_ready);
+
+	if (nread <= 0) {
+		DEBUG(DEBUG_ERR, ("read error nread=%d\n", (int)nread));
+		goto failed;
 	}
+	queue->buffer.length += nread;
 
 	queue_process(queue);
 	return;
-- 
2.14.4


From 67b5ada970c8c2f2f575437584ecb4a6d1514f4c Mon Sep 17 00:00:00 2001
From: Swen Schillig <swen at vnet.ibm.com>
Date: Mon, 12 Mar 2018 11:00:55 +0100
Subject: [PATCH 2/4] ctdb: remove queue destructor as it isn't needed anymore

After

commit e097b7f8ff1a9992de1d11474dac4969e30cd679
Author: David Disseldorp <ddiss at suse.de>
Date:   Sun Jul 31 03:14:54 2011 +0200

    io: Make queue_io_read() safe for reentry

the destructor has no purpose anymore, therfore, remove it.

Signed-off-by: Swen Schillig <swen at vnet.ibm.com>
---
 ctdb/common/ctdb_io.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/ctdb/common/ctdb_io.c b/ctdb/common/ctdb_io.c
index 1a7a7de84f0..0fd17e5b162 100644
--- a/ctdb/common/ctdb_io.c
+++ b/ctdb/common/ctdb_io.c
@@ -64,7 +64,6 @@ struct ctdb_queue {
 	size_t alignment;
 	void *private_data;
 	ctdb_queue_cb_fn_t callback;
-	bool *destroyed;
 	const char *name;
 	uint32_t buffer_size;
 };
@@ -424,17 +423,6 @@ int ctdb_queue_set_fd(struct ctdb_queue *queue, int fd)
 	return 0;
 }
 
-/* If someone sets up this pointer, they want to know if the queue is freed */
-static int queue_destructor(struct ctdb_queue *queue)
-{
-	TALLOC_FREE(queue->buffer.data);
-	queue->buffer.length = 0;
-	queue->buffer.size = 0;
-	if (queue->destroyed != NULL)
-		*queue->destroyed = true;
-	return 0;
-}
-
 /*
   setup a packet queue on a socket
  */
@@ -467,7 +455,6 @@ struct ctdb_queue *ctdb_queue_setup(struct ctdb_context *ctdb,
 			return NULL;
 		}
 	}
-	talloc_set_destructor(queue, queue_destructor);
 
 	queue->buffer_size = ctdb->tunable.queue_buffer_size;
 	/* In client code, ctdb->tunable is not initialized.
-- 
2.14.4


From c83f700c325700f3c1ee7dfd607f12ad230fb9a4 Mon Sep 17 00:00:00 2001
From: Swen Schillig <swen at vnet.ibm.com>
Date: Mon, 12 Mar 2018 11:00:55 +0100
Subject: [PATCH 3/4] ctdb: Introduce buffer.offset to avoid memmove

The memmove operation is quiet expensive, therefore,
a new buffer attribute "offset" is introduced to support
an optimized buffer processing.

Signed-off-by: Swen Schillig <swen at vnet.ibm.com>
---
 ctdb/common/ctdb_io.c | 83 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 55 insertions(+), 28 deletions(-)

diff --git a/ctdb/common/ctdb_io.c b/ctdb/common/ctdb_io.c
index 0fd17e5b162..6c90b278831 100644
--- a/ctdb/common/ctdb_io.c
+++ b/ctdb/common/ctdb_io.c
@@ -43,6 +43,7 @@ struct ctdb_buffer {
 	uint8_t *data;
 	uint32_t length;
 	uint32_t size;
+	uint32_t offset;
 };
 
 struct ctdb_queue_pkt {
@@ -95,17 +96,18 @@ static void queue_process_event(struct tevent_context *ev, struct tevent_immedia
 static void queue_process(struct ctdb_queue *queue)
 {
 	uint32_t pkt_size;
-	uint8_t *data;
+	uint8_t *data = NULL;
 
 	if (queue->buffer.length < sizeof(pkt_size)) {
 		return;
 	}
 
 	/* Did we at least read the size into the buffer */
-	pkt_size = *(uint32_t *)queue->buffer.data;
+	pkt_size = *(uint32_t *)(queue->buffer.data + queue->buffer.offset);
 	if (pkt_size == 0) {
 		DEBUG(DEBUG_CRIT, ("Invalid packet of length 0\n"));
-		goto failed;
+		queue->callback(NULL, 0, queue->private_data);
+		return;
 	}
 
 	/* the buffer doesn't contain the full packet, return to get the rest */
@@ -114,20 +116,27 @@ static void queue_process(struct ctdb_queue *queue)
 	}
 
 	/* Extract complete packet */
-	data = talloc_memdup(queue, queue->buffer.data, pkt_size);
+	data = talloc_memdup(queue,
+			     queue->buffer.data + queue->buffer.offset,
+			     pkt_size);
+
 	if (data == NULL) {
 		D_ERR("read error alloc failed for %u\n", pkt_size);
 		return;
 	}
 
-	/* Shift packet out from buffer */
-	if (queue->buffer.length > pkt_size) {
-		memmove(queue->buffer.data,
-			queue->buffer.data + pkt_size,
-			queue->buffer.length - pkt_size);
-	}
+	queue->buffer.offset += pkt_size;
 	queue->buffer.length -= pkt_size;
 
+	if (queue->buffer.offset < pkt_size ||
+	    queue->buffer.offset > queue->buffer.size) {
+		D_ERR("buffer offset overflow\n");
+		TALLOC_FREE(queue->buffer.data);
+		memset(&queue->buffer, 0, sizeof(queue->buffer));
+		queue->callback(NULL, 0, queue->private_data);
+		return;
+	}
+
 	if (queue->buffer.length > 0) {
 		/* There is more data to be processed, schedule an event */
 		tevent_schedule_immediate(queue->im, queue->ctdb->ev,
@@ -135,20 +144,16 @@ static void queue_process(struct ctdb_queue *queue)
 	} else {
 		if (queue->buffer.size > queue->buffer_size) {
 			TALLOC_FREE(queue->buffer.data);
-			queue->buffer.size = 0;
+			memset(&queue->buffer, 0, sizeof(queue->buffer));
+		} else {
+			queue->buffer.offset = 0;
 		}
 	}
 
 	/* It is the responsibility of the callback to free 'data' */
 	queue->callback(data, pkt_size, queue->private_data);
-	return;
-
-failed:
-	queue->callback(NULL, 0, queue->private_data);
-
 }
 
-
 /*
   called when an incoming connection is readable
   This function MUST be safe for reentry via the queue callback!
@@ -170,10 +175,11 @@ static void queue_io_read(struct ctdb_queue *queue)
 	}
 	if (num_ready == 0) {
 		/* the descriptor has been closed */
-		goto failed;
+		queue->callback(NULL, 0, queue->private_data);
+		return;
 	}
 
-	if (num_ready > (qb->size - qb->length)) {
+	if (num_ready > (qb->size - (qb->length + qb->offset))) {
 		if (qb->size == 0) {
 			/* no data buffer available yet */
 			qb->data = talloc_size(queue, queue->buffer_size);
@@ -189,6 +195,16 @@ static void queue_io_read(struct ctdb_queue *queue)
 			goto data_read;
 		}
 
+		if (qb->offset + sizeof(qb->length) > qb->size) {
+			/* the remaining space in the buffer is not sufficient
+			 * move the data and try again.
+			 */
+			memmove(qb->data, qb->data + qb->offset, qb->length);
+			qb->offset = 0;
+			num_ready = MIN(num_ready, qb->size - qb->length);
+			goto data_read;
+		}
+
 		if (qb->length < sizeof(qb->length)) {
 			/* data read is not sufficient to gather message size */
 			num_ready = MIN(num_ready, qb->size - qb->length);
@@ -196,9 +212,12 @@ static void queue_io_read(struct ctdb_queue *queue)
 		}
 
 		/* get message length */
-		size_t m_length = *(uint32_t *)(qb->data);
+		size_t m_length = *(uint32_t *)(qb->data + qb->offset);
 
 		if (m_length > qb->size) {
+			/* the message is bigger than the buffer.size.
+			 * increase the buffer to read the full message
+			 */
 			uint8_t *data = talloc_realloc_size(queue,
 							    qb->data,
 							    m_length);
@@ -212,24 +231,32 @@ static void queue_io_read(struct ctdb_queue *queue)
 			}
 			qb->data = data;
 			qb->size = m_length;
-			num_ready = MIN(num_ready, qb->size - qb->length);
 		}
+
+		if (qb->offset) {
+			/* the only thing which could prevent a full message
+			 * read, is a buffer offset. Move data and try again.
+			 */
+			memmove(qb->data, qb->data + qb->offset, qb->length);
+			qb->offset = 0;
+		}
+
+		num_ready = MIN(num_ready, qb->size - qb->length);
 	}
 
 data_read:
-	nread = sys_read(queue->fd, qb->data + qb->length, num_ready);
+	nread = sys_read(queue->fd,
+			 qb->data + qb->length + qb->offset,
+			 num_ready);
 
 	if (nread <= 0) {
 		DEBUG(DEBUG_ERR, ("read error nread=%d\n", (int)nread));
-		goto failed;
+		queue->callback(NULL, 0, queue->private_data);
+		return;
 	}
-	queue->buffer.length += nread;
+	qb->length += nread;
 
 	queue_process(queue);
-	return;
-
-failed:
-	queue->callback(NULL, 0, queue->private_data);
 }
 
 
-- 
2.14.4


From 41fc7861eb7c702735fba5e64e58258b9cb48c0e Mon Sep 17 00:00:00 2001
From: Swen Schillig <swen at vnet.ibm.com>
Date: Mon, 12 Mar 2018 17:56:21 +0100
Subject: [PATCH 4/4] ctdb: Adding memory pool for queue callback

The received packet is copied into a newly allocated memory chunk for further
processing by the assigned callback. Once this is done, the memory is free'd.
This is repeated for each received packet making the memory allocation / free
an expensive task. To optimize this process, a memory pool is defined which
is sized identically to the queue's buffer and is therefore adjustable via
tunable.queue_buffer_size.

Signed-off-by: Swen Schillig <swen at vnet.ibm.com>
---
 ctdb/common/ctdb_io.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/ctdb/common/ctdb_io.c b/ctdb/common/ctdb_io.c
index 6c90b278831..d25d81b58bc 100644
--- a/ctdb/common/ctdb_io.c
+++ b/ctdb/common/ctdb_io.c
@@ -65,6 +65,7 @@ struct ctdb_queue {
 	size_t alignment;
 	void *private_data;
 	ctdb_queue_cb_fn_t callback;
+	char *data_pool;
 	const char *name;
 	uint32_t buffer_size;
 };
@@ -116,7 +117,7 @@ static void queue_process(struct ctdb_queue *queue)
 	}
 
 	/* Extract complete packet */
-	data = talloc_memdup(queue,
+	data = talloc_memdup(queue->data_pool,
 			     queue->buffer.data + queue->buffer.offset,
 			     pkt_size);
 
@@ -379,8 +380,9 @@ int ctdb_queue_send(struct ctdb_queue *queue, uint8_t *data, uint32_t length)
 		if (length2 == 0) return 0;
 	}
 
-	pkt = talloc_size(
-		queue, offsetof(struct ctdb_queue_pkt, buf) + length2);
+	pkt = talloc_size(queue->data_pool,
+			  offsetof(struct ctdb_queue_pkt, buf) + length2);
+
 	CTDB_NO_MEMORY(queue->ctdb, pkt);
 	talloc_set_name_const(pkt, "struct ctdb_queue_pkt");
 
@@ -491,5 +493,8 @@ struct ctdb_queue *ctdb_queue_setup(struct ctdb_context *ctdb,
 		queue->buffer_size = 1024;
 	}
 
+	queue->data_pool = talloc_pool(queue, queue->buffer_size);
+	CTDB_NO_MEMORY_NULL(ctdb, queue->data_pool);
+
 	return queue;
 }
-- 
2.14.4



More information about the samba-technical mailing list