Segfault in smbd: messaging_dgm_fde_active

Tue Jul 10 15:01:36 UTC 2018

Hi,

here's an updated patchset including a test that also demonstrates the
problem.

I've also included two fixes to avoid two other flakey tests.

See also https://gitlab.com/samba-team/devel/samba/pipelines/25487638

Please review and push:-)

Thanks!
metze

Am 10.07.2018 um 14:39 schrieb Stefan Metzmacher via samba-technical:
> Hi Jeremy,
> 
>>> OK, I'm reviewing this now, and I think I understand
>>> it, but once I'm done I might ask you to add some
>>> clarification to the commit message (or not, depending
>>> on how easy this is to understand :-).
>>
>> OK, I do think I understand it, and I think the patch hunk
>> in the destructor isn't quite right.
>>
>> Shoudn't it look like:
>>
>> @@ -150,6 +150,11 @@ static int msg_dgm_ref_destructor(struct msg_dgm_ref *r)
>>         if (refs == NULL) {
>>                 abort();
>>         }
>> +
>> +       if (r == next_ref) {
>> +               next_ref = r->next;
>> +       }
>> +
>>         DLIST_REMOVE(refs, r);
>>  
>>         TALLOC_FREE(r->fde);
>>
>> instead of:
>>
>> @ -152,6 +153,10 @@ static int msg_dgm_ref_destructor(struct msg_dgm_ref *r)
>>         }
>>         DLIST_REMOVE(refs, r);
>>  
>> +       if (r == next_ref) {
>> +               next_ref = NULL;
>> +       }
>> +
>>         TALLOC_FREE(r->fde);
>>  
>>         DBG_DEBUG("refs=%p\n", refs);
>>
>> i.e. Rather than setting arbitrarily to NULL, make next_ref
>> point to the 'r->next' pointer of the element currently being
>> removed *before* we remove it. That way we don't prematurely
>> terminate the list walk in msg_dgm_ref_recv().
>>
>> If I'm wrong, can you explain why (because then I
>> didn't understand it :-) ?
> 
> Your are right! Thanks for finding that!
> 
> metze
> 
> 

-------------- next part --------------
From 45db46607b63248310eb2ad3dd4a0ea783042202 Mon Sep 17 00:00:00 2001
From: Ralph Boehme <slow at samba.org>
Date: Tue, 19 Jun 2018 10:35:04 +0200
Subject: [PATCH 1/4] ctdb: close the correct pipe fd in a test

This was discovered in an autobuild with a patched tevent that used the
"poll" backend by default. Test failure:

$ bin/sock_daemon_test /dev/shm/sock_daemon_test.pid /dev/shm/sock_daemon_test.sock 5
test5[28011]: daemon started, pid=28011
test5[28011]: listening on /dev/shm/sock_daemon_test.sock
sock_daemon_test: ../ctdb/tests/src/sock_daemon_test.c:980: test5: Assertion `ret == i+1' failed.
Abgebrochen (Speicherabzug geschrieben)
metze at SERNOX14:~/devel/samba/4.0/master4-test$ test5[28011]: PID 28010 gone away, exiting
test5[28011]: Shutting down
sock_daemon_test: ../ctdb/tests/src/sock_daemon_test.c:964: test5:
Assertion `ret == EINTR' failed.

After an epic debugging session we spotted the problem.

Signed-off-by: Ralph Boehme <slow at samba.org>
Reviewed-by: Stefan Metzmacher <metze at samba.org>
---
 ctdb/tests/src/sock_daemon_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ctdb/tests/src/sock_daemon_test.c b/ctdb/tests/src/sock_daemon_test.c
index 916ba2236f03..80004b41c917 100644
--- a/ctdb/tests/src/sock_daemon_test.c
+++ b/ctdb/tests/src/sock_daemon_test.c
@@ -759,7 +759,7 @@ static int test5_client(const char *sockpath, int id, pid_t pid_server,
 			tevent_loop_once(ev);
 		}
 
-		close(fd[0]);
+		close(fd[1]);
 		state.fd = -1;
 
 		while (kill(pid_server, 0) == 0 || errno != ESRCH) {
-- 
2.17.1


From c3da01eb6b212e5c891c2d11a41333df5ab0b997 Mon Sep 17 00:00:00 2001
From: Stefan Metzmacher <metze at samba.org>
Date: Thu, 21 Jun 2018 06:31:03 +0200
Subject: [PATCH 2/4] python/tests: make the test_assoc_group_fail2() test more
 resilient against timing

On a busy system [e]poll() on the server will mark both the
old connection fd and also the listening fd as readable.

epoll() returns the events in order, so the server processes the
disconnect first.

With poll() we don't have an order of the events and the
server is likely to process the connect before the disconnect.

Signed-off-by: Stefan Metzmacher <metze at samba.org>
Reviewed-by: Ralph Boehme <slow at samba.org>
---
 python/samba/tests/dcerpc/raw_protocol.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/python/samba/tests/dcerpc/raw_protocol.py b/python/samba/tests/dcerpc/raw_protocol.py
index ff815e97ca63..7cc3a4a29b52 100755
--- a/python/samba/tests/dcerpc/raw_protocol.py
+++ b/python/samba/tests/dcerpc/raw_protocol.py
@@ -18,6 +18,7 @@
 
 import sys
 import os
+import time
 
 sys.path.insert(0, "bin/python")
 os.environ["PYTHONUNBUFFERED"] = "1"
@@ -4930,6 +4931,8 @@ class TestDCERPC_BIND(RawDCERPCTest):
         ack = self.do_generic_bind(ctx=ctx)
 
         self._disconnect("test_assoc_group_fail2")
+        self.assertNotConnected()
+        time.sleep(0.5)
         self.connect()
 
         ack2 = self.do_generic_bind(ctx=ctx,assoc_group_id=ack.u.assoc_group_id,
-- 
2.17.1


From 6310e77342d7bab6856aef1c343a060973a73c7b Mon Sep 17 00:00:00 2001
From: Stefan Metzmacher <metze at samba.org>
Date: Tue, 10 Jul 2018 16:21:55 +0200
Subject: [PATCH 3/4] s4:messaging: add local.messaging.multi_ctx.multi_ctx
 test

This tests the usage of multiple imessaging_contexts in one process
and also freeing two of them during a message handler.

Bug: https://bugzilla.samba.org/show_bug.cgi?id=13514

Signed-off-by: Stefan Metzmacher <metze at samba.org>
---
 selftest/knownfail.d/imessaging         |   1 +
 source4/lib/messaging/tests/messaging.c | 221 ++++++++++++++++++++++++
 2 files changed, 222 insertions(+)
 create mode 100644 selftest/knownfail.d/imessaging

diff --git a/selftest/knownfail.d/imessaging b/selftest/knownfail.d/imessaging
new file mode 100644
index 000000000000..af8fa8d615a6
--- /dev/null
+++ b/selftest/knownfail.d/imessaging
@@ -0,0 +1 @@
+^samba4.local.messaging.multi_ctx
diff --git a/source4/lib/messaging/tests/messaging.c b/source4/lib/messaging/tests/messaging.c
index ba5897882880..80c8583c21eb 100644
--- a/source4/lib/messaging/tests/messaging.c
+++ b/source4/lib/messaging/tests/messaging.c
@@ -393,6 +393,226 @@ static bool test_messaging_overflow_check(struct torture_context *tctx)
 	return true;
 }
 
+struct test_multi_ctx {
+	struct torture_context *tctx;
+	struct imessaging_context *server_ctx;
+	struct imessaging_context *client_ctx[4];
+	size_t num_missing;
+	bool got_server;
+	bool got_client_0_1;
+	bool got_client_2_3;
+	bool ok;
+};
+
+static void multi_ctx_server_handler(struct imessaging_context *msg,
+				     void *private_data,
+				     uint32_t msg_type,
+				     struct server_id server_id,
+				     DATA_BLOB *data)
+{
+	struct test_multi_ctx *state = private_data;
+	char *str = NULL;
+
+	torture_assert_goto(state->tctx, state->num_missing >= 1,
+			    state->ok, fail,
+			    "num_missing should be at least 1.");
+	state->num_missing -= 1;
+
+	torture_assert_goto(state->tctx, !state->got_server,
+			    state->ok, fail,
+			    "already got server.");
+	state->got_server = true;
+
+	/*
+	 * We free the context itself and most likely reuse
+	 * the memory immediately.
+	 */
+	TALLOC_FREE(state->server_ctx);
+	str = generate_random_str(state->tctx, 128);
+	torture_assert_goto(state->tctx, str != NULL,
+			    state->ok, fail,
+			    "generate_random_str()");
+
+fail:
+	return;
+}
+
+static void multi_ctx_client_0_1_handler(struct imessaging_context *msg,
+					 void *private_data,
+					 uint32_t msg_type,
+					 struct server_id server_id,
+					 DATA_BLOB *data)
+{
+	struct test_multi_ctx *state = private_data;
+	char *str = NULL;
+
+	torture_assert_goto(state->tctx, state->num_missing >= 2,
+			    state->ok, fail,
+			    "num_missing should be at least 2.");
+	state->num_missing -= 2;
+
+	torture_assert_goto(state->tctx, !state->got_client_0_1,
+			    state->ok, fail,
+			    "already got client_0_1.");
+	state->got_client_0_1 = true;
+
+	/*
+	 * We free two contexts and most likely reuse
+	 * the memory immediately.
+	 */
+	TALLOC_FREE(state->client_ctx[0]);
+	str = generate_random_str(state->tctx, 128);
+	torture_assert_goto(state->tctx, str != NULL,
+			    state->ok, fail,
+			    "generate_random_str()");
+	TALLOC_FREE(state->client_ctx[1]);
+	str = generate_random_str(state->tctx, 128);
+	torture_assert_goto(state->tctx, str != NULL,
+			    state->ok, fail,
+			    "generate_random_str()");
+
+fail:
+	return;
+}
+
+static void multi_ctx_client_2_3_handler(struct imessaging_context *msg,
+					 void *private_data,
+					 uint32_t msg_type,
+					 struct server_id server_id,
+					 DATA_BLOB *data)
+{
+	struct test_multi_ctx *state = private_data;
+	char *str = NULL;
+
+	torture_assert_goto(state->tctx, state->num_missing >= 2,
+			    state->ok, fail,
+			    "num_missing should be at least 2.");
+	state->num_missing -= 2;
+
+	torture_assert_goto(state->tctx, !state->got_client_2_3,
+			    state->ok, fail,
+			    "already got client_2_3.");
+	state->got_client_2_3 = true;
+
+	/*
+	 * We free two contexts and most likely reuse
+	 * the memory immediately.
+	 */
+	TALLOC_FREE(state->client_ctx[2]);
+	str = generate_random_str(state->tctx, 128);
+	torture_assert_goto(state->tctx, str != NULL,
+			    state->ok, fail,
+			    "generate_random_str()");
+	TALLOC_FREE(state->client_ctx[3]);
+	str = generate_random_str(state->tctx, 128);
+	torture_assert_goto(state->tctx, str != NULL,
+			    state->ok, fail,
+			    "generate_random_str()");
+
+fail:
+	return;
+}
+
+static bool test_multi_ctx(struct torture_context *tctx)
+{
+	struct test_multi_ctx state = {
+		.tctx = tctx,
+		.ok = true,
+	};
+	struct timeval tv;
+	NTSTATUS status;
+
+	lpcfg_set_cmdline(tctx->lp_ctx, "pid directory", "piddir.tmp");
+
+	/*
+	 * We use cluster_id(0, 0) as that gets for
+	 * all task ids.
+	 */
+	state.server_ctx = imessaging_init(tctx,
+					   tctx->lp_ctx,
+					   cluster_id(0, 0),
+					   tctx->ev);
+	torture_assert(tctx, state.server_ctx != NULL,
+		       "Failed to init messaging context");
+
+	status = imessaging_register(state.server_ctx, &state,
+				     MSG_TMP_BASE-1,
+				     multi_ctx_server_handler);
+	torture_assert(tctx, NT_STATUS_IS_OK(status), "imessaging_register failed");
+
+	state.client_ctx[0] = imessaging_init(tctx,
+					      tctx->lp_ctx,
+					      cluster_id(0, 0),
+					      tctx->ev);
+	torture_assert(tctx, state.client_ctx[0] != NULL,
+		       "msg_client_ctx imessaging_init() failed");
+	status = imessaging_register(state.client_ctx[0], &state,
+				     MSG_TMP_BASE-1,
+				     multi_ctx_client_0_1_handler);
+	torture_assert(tctx, NT_STATUS_IS_OK(status), "imessaging_register failed");
+	state.client_ctx[1] = imessaging_init(tctx,
+					      tctx->lp_ctx,
+					      cluster_id(0, 0),
+					      tctx->ev);
+	torture_assert(tctx, state.client_ctx[1] != NULL,
+		       "msg_client_ctx imessaging_init() failed");
+	status = imessaging_register(state.client_ctx[1], &state,
+				     MSG_TMP_BASE-1,
+				     multi_ctx_client_0_1_handler);
+	torture_assert(tctx, NT_STATUS_IS_OK(status), "imessaging_register failed");
+	state.client_ctx[2] = imessaging_init(tctx,
+					      tctx->lp_ctx,
+					      cluster_id(0, 0),
+					      tctx->ev);
+	torture_assert(tctx, state.client_ctx[2] != NULL,
+		       "msg_client_ctx imessaging_init() failed");
+	status = imessaging_register(state.client_ctx[2], &state,
+				     MSG_TMP_BASE-1,
+				     multi_ctx_client_2_3_handler);
+	torture_assert(tctx, NT_STATUS_IS_OK(status), "imessaging_register failed");
+	state.client_ctx[3] = imessaging_init(tctx,
+					      tctx->lp_ctx,
+					      cluster_id(0, 0),
+					      tctx->ev);
+	torture_assert(tctx, state.client_ctx[3] != NULL,
+		       "msg_client_ctx imessaging_init() failed");
+	status = imessaging_register(state.client_ctx[3], &state,
+				     MSG_TMP_BASE-1,
+				     multi_ctx_client_2_3_handler);
+	torture_assert(tctx, NT_STATUS_IS_OK(status), "imessaging_register failed");
+
+	/*
+	 * Send one message that need to arrive on 3 ( 5 - 2 ) handlers.
+	 */
+	state.num_missing = 5;
+
+	status = imessaging_send(state.server_ctx,
+				 cluster_id(0, 0),
+				 MSG_TMP_BASE-1, NULL);
+	torture_assert_ntstatus_ok(tctx, status, "msg failed");
+
+	tv = timeval_current();
+	while (timeval_elapsed(&tv) < 30 && state.num_missing > 0 && state.ok) {
+		int ret;
+
+		ret = tevent_loop_once(tctx->ev);
+		torture_assert_int_equal(tctx, ret, 0, "tevent_loop_once()");
+	}
+
+	if (!state.ok) {
+		return false;
+	}
+
+	torture_assert_int_equal(tctx, state.num_missing, 0,
+				 "wrong message count");
+
+	torture_assert(tctx, state.got_client_0_1, "got_client_0_1");
+	torture_assert(tctx, state.got_client_2_3, "got_client_2_3");
+	torture_assert(tctx, state.got_server, "got_server");
+
+	return true;
+}
+
 struct torture_suite *torture_local_messaging(TALLOC_CTX *mem_ctx)
 {
 	struct torture_suite *s = torture_suite_create(mem_ctx, "messaging");
@@ -400,5 +620,6 @@ struct torture_suite *torture_local_messaging(TALLOC_CTX *mem_ctx)
 	torture_suite_add_simple_test(s, "overflow_check",
 				      test_messaging_overflow_check);
 	torture_suite_add_simple_test(s, "ping_speed", test_ping_speed);
+	torture_suite_add_simple_test(s, "multi_ctx", test_multi_ctx);
 	return s;
 }
-- 
2.17.1


From 50e2ae92c0b99f57a39a89a63d2dedccf8493ccb Mon Sep 17 00:00:00 2001
From: Stefan Metzmacher <metze at samba.org>
Date: Mon, 9 Jul 2018 12:33:34 +0200
Subject: [PATCH 4/4] s3:messages: make the loop in msg_dgm_ref_recv() more
 robust against stale pointers

The interaction between msg_dgm_ref_recv() and msg_dgm_ref_destructor()
doesn't allow two references from messaging_dgm_ref() to be free'd
during the loop in msg_dgm_ref_recv().

In addition to the global 'refs' list, we also need to
have a global 'next_ref' pointer, which can be adjusted in
msg_dgm_ref_destructor().

As AD DC we hit this when using irpc in auth_winbind,
which uses imessaging_client_init().
In addition to the main messaging_dgm_ref() in smbd,
source3/auth/auth_samba4.c: prepare_gensec() and
make_auth4_context_s4() also generate a temporary
imessaging_context for auth_context->msg_ctx from within
auth_generic_prepare().

Bug: https://bugzilla.samba.org/show_bug.cgi?id=13514

Signed-off-by: Stefan Metzmacher <metze at samba.org>
---
 selftest/knownfail.d/imessaging |  1 -
 source3/lib/messages_dgm_ref.c  | 12 +++++++++---
 2 files changed, 9 insertions(+), 4 deletions(-)
 delete mode 100644 selftest/knownfail.d/imessaging

diff --git a/selftest/knownfail.d/imessaging b/selftest/knownfail.d/imessaging
deleted file mode 100644
index af8fa8d615a6..000000000000
--- a/selftest/knownfail.d/imessaging
+++ /dev/null
@@ -1 +0,0 @@
-^samba4.local.messaging.multi_ctx
diff --git a/source3/lib/messages_dgm_ref.c b/source3/lib/messages_dgm_ref.c
index 39d22700740c..470dfbeabc74 100644
--- a/source3/lib/messages_dgm_ref.c
+++ b/source3/lib/messages_dgm_ref.c
@@ -35,6 +35,7 @@ struct msg_dgm_ref {
 
 static pid_t dgm_pid = 0;
 static struct msg_dgm_ref *refs = NULL;
+static struct msg_dgm_ref *next_ref = NULL;
 
 static int msg_dgm_ref_destructor(struct msg_dgm_ref *r);
 static void msg_dgm_ref_recv(struct tevent_context *ev,
@@ -121,16 +122,16 @@ static void msg_dgm_ref_recv(struct tevent_context *ev,
 			     const uint8_t *msg, size_t msg_len,
 			     int *fds, size_t num_fds, void *private_data)
 {
-	struct msg_dgm_ref *r, *next;
+	struct msg_dgm_ref *r;
 
 	/*
 	 * We have to broadcast incoming messages to all refs. The first ref
 	 * that grabs the fd's will get them.
 	 */
-	for (r = refs; r != NULL; r = next) {
+	for (r = refs; r != NULL; r = next_ref) {
 		bool active;
 
-		next = r->next;
+		next_ref = r->next;
 
 		active = messaging_dgm_fde_active(r->fde);
 		if (!active) {
@@ -150,6 +151,11 @@ static int msg_dgm_ref_destructor(struct msg_dgm_ref *r)
 	if (refs == NULL) {
 		abort();
 	}
+
+	if (r == next_ref) {
+		next_ref = r->next;
+	}
+
 	DLIST_REMOVE(refs, r);
 
 	TALLOC_FREE(r->fde);
-- 
2.17.1

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20180710/0d6addec/signature.sig>