tfork() test hanging in socket_wrapper lock

Ralph Böhme slow at samba.org
Tue Nov 20 15:26:52 UTC 2018


On Mon, Nov 19, 2018 at 07:12:17PM +0100, Andreas Schneider via samba-technical wrote:
>On Monday, 19 November 2018 19:01:18 CET Andrew Bartlett wrote:
>> On Mon, 2018-11-19 at 11:54 +0100, Andreas Schneider wrote:
>> > On Monday, 19 November 2018 03:59:35 CET Andrew Bartlett wrote:
>> > > Today on sn-devel my autobuild hung running the tfork test:
>> > >
>> > > It appears to be waiting forever for a mutex lock.
>> >
>> > Anoop and myself tried to reproduce it, but we couldn't. However helgrind
>> > reports several errors in tfork itself.
>>
>> I should have mentioned, I've left them stuck in case they can be
>> debugged.
>
>I think Ralph already nailed it. It is an issue with SIGCHLD handling in
>tfork.

in fact it's not. :) Sorry for the red herring, but it took some time to swap 
the tfork details back into main memory.

Taking a closer look at the data races reported by helgrind, I came to the 
conclusion that they are all false positives.

The reported data races are in code that is globally serialized by 
tfork_acquire_sighandling() and tfork_release_sighandling(), ensuring there's 
always only one writer to the reported objects and all read accesses are 
sequenced after the write because there are a bunch of syscalls in between which 
serve as implicit memory barrier.

I've added valgrind annotations to tfork.c to suppress the more subtle false 
positives. A few other false positives can't really be marked in the code, I've 
therefor added suppressions files for helgrind and dbd.

metze, can you please take a close look at the attached (WIPish) patchset? 
Thanks!

-slow

-- 
Ralph Boehme, Samba Team       https://samba.org/
Samba Developer, SerNet GmbH   https://sernet.de/en/samba/
GPG Key Fingerprint:           FAE2 C608 8A24 2520 51C5
                               59E4 AA1E 9B71 2639 9E46
-------------- next part --------------
From 617f55649e441e1863f08a0bd0219e2ac4c18598 Mon Sep 17 00:00:00 2001
From: Ralph Boehme <slow at samba.org>
Date: Mon, 19 Nov 2018 16:47:33 +0100
Subject: [PATCH 1/5] tfork/test: ensure all threads start with SIGCHLD
 unblocked

---
 lib/util/tests/tfork.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/lib/util/tests/tfork.c b/lib/util/tests/tfork.c
index 3c73355b3f0..a74f7e8d7e2 100644
--- a/lib/util/tests/tfork.c
+++ b/lib/util/tests/tfork.c
@@ -470,12 +470,29 @@ static bool test_tfork_threads(struct torture_context *tctx)
 	bool ok = true;
 	const int num_threads = 64;
 	pthread_t threads[num_threads];
+	sigset_t set;
 	int i;
 
 #ifndef HAVE_PTHREAD
 	torture_skip(tctx, "no pthread support\n");
 #endif
 
+	/*
+	 * Be nasty and taste for the worst case: ensure all threads start with
+	 * SIGCHLD unblocked so we have the most fun with SIGCHLD being
+	 * delivered to a random thread. :)
+	 */
+	sigemptyset(&set);
+	sigaddset(&set, SIGCHLD);
+#ifdef HAVE_PTHREAD
+	ret = pthread_sigmask(SIG_UNBLOCK, &set, NULL);
+#else
+	ret = sigprocmask(SIG_UNBLOCK, &set, NULL);
+#endif
+	if (ret != 0) {
+		return -1;
+	}
+
 	for (i = 0; i < num_threads; i++) {
 		ret = pthread_create(&threads[i], NULL, tfork_thread, NULL);
 		torture_assert_goto(tctx, ret == 0, ok, done,
-- 
2.17.2


From 713bd933de69d28714bc2cc31c7002b9126524b9 Mon Sep 17 00:00:00 2001
From: Ralph Boehme <slow at samba.org>
Date: Mon, 19 Nov 2018 23:07:55 +0100
Subject: [PATCH 2/5] tfork: TFORK_ANNOTATE_BENIGN_RACE

---
 lib/util/tfork.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/lib/util/tfork.c b/lib/util/tfork.c
index 8ed5811c536..4a5c08f7d79 100644
--- a/lib/util/tfork.c
+++ b/lib/util/tfork.c
@@ -58,6 +58,18 @@
  *   +----------+
  */
 
+#ifdef HAVE_VALGRIND_HELGRIND_H
+#include <valgrind/helgrind.h>
+#endif
+#ifndef ANNOTATE_BENIGN_RACE_SIZED
+#define ANNOTATE_BENIGN_RACE_SIZED(obj, size, description)
+#endif
+
+#define TFORK_ANNOTATE_BENIGN_RACE(obj)					\
+	ANNOTATE_BENIGN_RACE_SIZED(					\
+		(obj), sizeof(*(obj)),					\
+		"no race, serialized by tfork_[un]install_sigchld_handler");
+
 /*
  * The resulting (private) state per tfork_create() call, returned as a opaque
  * handle to the caller.
@@ -221,6 +233,14 @@ static void tfork_atfork_child(void)
 	ret = pthread_key_create(&tfork_global_key, tfork_global_destructor);
 	assert(ret == 0);
 
+	/*
+	 * There's no data race on the cond variable from the signal state, we
+	 * are writing here, but there are no readers yet. Some data race
+	 * detection tools report a race, but the readers are in the parent
+	 * process.
+	 */
+	TFORK_ANNOTATE_BENIGN_RACE(&signal_state.cond);
+
 	/*
 	 * There's no way to destroy a condition variable if there are waiters,
 	 * pthread_cond_destroy() will return EBUSY. Just zero out memory and
@@ -266,6 +286,13 @@ static void tfork_global_initialize(void)
 
 	ret = pthread_cond_init(&signal_state.cond, NULL);
 	assert(ret == 0);
+
+	/*
+	 * In a threaded process there's no data race on t->waiter_pid as
+	 * we're serializing globally via tfork_acquire_sighandling() and
+	 * tfork_release_sighandling().
+	 */
+	TFORK_ANNOTATE_BENIGN_RACE(&signal_state.pid);
 #endif
 
 	signal_state.available = true;
@@ -496,6 +523,13 @@ static pid_t tfork_start_waiter_and_worker(struct tfork_state *state,
 	if (pid != 0) {
 		/* The caller */
 
+		/*
+		 * In a threaded process there's no data race on
+		 * state->waiter_pid as we're serializing globally via
+		 * tfork_acquire_sighandling() and tfork_release_sighandling().
+		 */
+		TFORK_ANNOTATE_BENIGN_RACE(&state->waiter_pid);
+
 		state->waiter_pid = pid;
 
 		close(status_sp_waiter_fd);
@@ -757,6 +791,13 @@ struct tfork *tfork_create(void)
 		return t;
 	}
 
+	/*
+	 * In a threaded process there's no data race on t->waiter_pid as
+	 * we're serializing globally via tfork_acquire_sighandling() and
+	 * tfork_release_sighandling().
+	 */
+	TFORK_ANNOTATE_BENIGN_RACE(&t->waiter_pid);
+
 	t->waiter_pid = pid;
 	t->worker_pid = state->worker_pid;
 
-- 
2.17.2


From 080bee69c0b875a7d1bd77ee7b8ecd70b0f580cf Mon Sep 17 00:00:00 2001
From: Ralph Boehme <slow at samba.org>
Date: Mon, 19 Nov 2018 15:18:34 +0100
Subject: [PATCH 3/5] tfork: add a suppresssions file for helgrind

tfork_atexit_unknown[1|2]:

  No idea what triggers this, definitely not tfork itself.

tfork_pthread_get_specific:

 Helgrind reports:

 Possible data race during read of size 4 at 0x5141304 by thread #3
 Locks held: none
    at 0x50E602E: tfork_global_get (tfork.c:301)
    by 0x50E69B1: tfork_create (tfork.c:737)
    by 0x2F7419: tfork_thread (tfork.c:431)
    by 0x4C35AC5: mythread_wrapper (hg_intercepts.c:389)
    by 0x8D38593: start_thread (pthread_create.c:463)
    by 0x9A70E6E: clone (clone.S:95)

 This conflicts with a previous write of size 4 by thread #2
 Locks held: none
    at 0x8D3F7B7: pthread_key_create (pthread_key_create.c:41)
    by 0x50E5F79: tfork_global_initialize (tfork.c:280)
    by 0x8D3FEA6: __pthread_once_slow (pthread_once.c:116)
    by 0x50E6999: tfork_create (tfork.c:728)
    by 0x2F7419: tfork_thread (tfork.c:431)
    by 0x4C35AC5: mythread_wrapper (hg_intercepts.c:389)
    by 0x8D38593: start_thread (pthread_create.c:463)
    by 0x9A70E6E: clone (clone.S:95)
  Location 0x5141304 is 0 bytes inside global var "tfork_global_key"
  declared at tfork.c:122

  This is nonsense, tfork_global_get() calls pthread_getspecific, so
  we're looking at the pthread_key_create()/pthread_[g|s]etspecific()
  API here which works with threads by design.
---
 lib/util/tests/tfork-helgrind.supp | 32 ++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)
 create mode 100644 lib/util/tests/tfork-helgrind.supp

diff --git a/lib/util/tests/tfork-helgrind.supp b/lib/util/tests/tfork-helgrind.supp
new file mode 100644
index 00000000000..4b62b2a1a93
--- /dev/null
+++ b/lib/util/tests/tfork-helgrind.supp
@@ -0,0 +1,32 @@
+{
+   tfork_atexit_unknown1
+   Helgrind:Misc
+   fun:mutex_destroy_WRK
+   fun:pthread_mutex_destroy
+   obj:/usr/lib64/libp11-kit.so.0.3.0
+   fun:_dl_fini
+   fun:__run_exit_handlers
+   fun:exit
+   fun:(below main)
+}
+
+{
+   tfork_atexit_unknown2
+   Helgrind:Misc
+   fun:mutex_destroy_WRK
+   fun:pthread_mutex_destroy
+   fun:_dl_fini
+   fun:__run_exit_handlers
+   fun:exit
+   fun:(below main)
+}
+{
+   tfork_pthread_get_specific
+   Helgrind:Race
+   fun:tfork_global_get
+   fun:tfork_create
+   fun:tfork_thread
+   fun:mythread_wrapper
+   fun:start_thread
+   fun:clone
+}
-- 
2.17.2


From 8a2abc9cd4c240a0d7757e7ad89fe7ca27f60030 Mon Sep 17 00:00:00 2001
From: Ralph Boehme <slow at samba.org>
Date: Tue, 20 Nov 2018 16:03:03 +0100
Subject: [PATCH 4/5] tfork: add a suppresssions file for drd

drd reports:

 initialized twice: cond 0x514f188
    at 0x4C3A399: pthread_cond_init_intercept (drd_pthread_intercepts.c:1022)
    by 0x4C3A399: pthread_cond_init@* (drd_pthread_intercepts.c:1030)
    by 0x50F3FF3: tfork_atfork_child (tfork.c:250)
    by 0x9A4B95D: fork (fork.c:204)
    by 0x50F4834: tfork_start_waiter_and_worker (tfork.c:581)
    by 0x50F4CDB: tfork_create (tfork.c:780)
    by 0x2F7469: tfork_thread (tfork.c:431)
    by 0x4C358F8: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
    by 0x8D46593: start_thread (pthread_create.c:463)
    by 0x9A7EE6E: clone (clone.S:95)
 cond 0x514f188 was first observed at:
    at 0x4C3A399: pthread_cond_init_intercept (drd_pthread_intercepts.c:1022)
    by 0x4C3A399: pthread_cond_init@* (drd_pthread_intercepts.c:1030)
    by 0x50F413A: tfork_global_initialize (tfork.c:287)
    by 0x8D4DEA6: __pthread_once_slow (pthread_once.c:116)
    by 0x4C377FD: pthread_once_intercept (drd_pthread_intercepts.c:800)
    by 0x4C377FD: pthread_once (drd_pthread_intercepts.c:806)
    by 0x50F4C0E: tfork_create (tfork.c:743)
    by 0x2F7469: tfork_thread (tfork.c:431)
    by 0x4C358F8: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
    by 0x8D46593: start_thread (pthread_create.c:463)
    by 0x9A7EE6E: clone (clone.S:95)

This is intentional, the reinit is in a child process. Cf the comment in
tfork.c.
---
 lib/util/tests/tfork-drd.supp | 14 ++++++++++++++
 1 file changed, 14 insertions(+)
 create mode 100644 lib/util/tests/tfork-drd.supp

diff --git a/lib/util/tests/tfork-drd.supp b/lib/util/tests/tfork-drd.supp
new file mode 100644
index 00000000000..7d0544b6b43
--- /dev/null
+++ b/lib/util/tests/tfork-drd.supp
@@ -0,0 +1,14 @@
+{
+   tfork_pthread_cond_init
+   drd:CondErr
+   fun:pthread_cond_init_intercept
+   fun:pthread_cond_init@*
+   fun:tfork_atfork_child
+   fun:fork
+   fun:tfork_start_waiter_and_worker
+   fun:tfork_create
+   fun:tfork_thread
+   fun:vgDrd_thread_wrapper
+   fun:start_thread
+   fun:clone
+}
-- 
2.17.2


From a3c2eac7637a5275069952e2750a058df045e93e Mon Sep 17 00:00:00 2001
From: Ralph Boehme <slow at samba.org>
Date: Tue, 20 Nov 2018 15:50:52 +0100
Subject: [PATCH 5/5] tfork: add a README how to run test torture test under
 valgrind

---
 lib/util/tests/README | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
 create mode 100644 lib/util/tests/README

diff --git a/lib/util/tests/README b/lib/util/tests/README
new file mode 100644
index 00000000000..c1337d5cf58
--- /dev/null
+++ b/lib/util/tests/README
@@ -0,0 +1,22 @@
+tfork tests
+===========
+
+To run the tfork torture testsuite under valgrind with the helgrind or drd
+thread checkers, run valgrind with the --suppress option passing a suppressions
+file.
+
+For helgrind:
+
+$ valgrind \
+  --trace-children=yes \
+  --tool=helgrind \
+  --suppressions=lib/util/tests/tfork-helgrind.supp \
+  ./bin/smbtorture ncalrpc:localhost local.tfork.tfork_threads
+
+For drd:
+
+$ valgrind \
+  --trace-children=yes \
+  --tool=drd \
+  --suppressions=lib/util/tests/tfork-drd.supp \
+  ./bin/smbtorture ncalrpc:localhost local.tfork.tfork_threads
-- 
2.17.2



More information about the samba-technical mailing list