recursive mutexes in appl_head winbindd_cm.c?

Martin Pool mbp at
Fri Jan 10 03:40:01 GMT 2003

I'm looking at jra's change to winbindd_cmd.c in relation to
hp CR1501.

I think there are some problems with the way the mutex reference count
is handled.  I'm not sure what is the cleanest way to fix it.

The mutexes are implemented on top of fcntl locks, which cannot be
nested.  Therefore winbindd holds an in-memory reference count for
each lock.  When this increments from zero, the OS lock is taken; when
it decreases to zero the OS lock is released.  So far so good.  

jra, can you explain what the "recursion" thing in this patch is for?

Tim says the point of the mutex is to protect against an NT bug that
causes failures if more than one connection tries to authenticate at
the same time.  

In cm_open_connection:

	for (i = 0; retry && (i < NUM_CLI_AUTH_CONNECT_RETRIES); i++) {

		if (!secrets_named_mutex(new_conn->controller, WINBIND_SERVER_MUTEX_WAIT_TIME, &new_conn->mutex_ref_count)) {
			DEBUG(0,("cm_open_connection: mutex grab failed for %s\n", new_conn->controller));

		result = cli_full_connection(&new_conn->cli, global_myname_unix(), new_conn->controller, 
			     &dc_ip, 0, CLI_AUTH_TIMEOUT, "IPC$", 
			     "IPC", ipc_username, ipc_domain, 
			     ipc_password, strlen(ipc_password), &retry);
		if (NT_STATUS_IS_OK(result))

If we fail to acquire the mutex, then we continue trying a few times,
which is probably OK.  However, if we never get the mutex after three
times, then the loop terminates and we proceed on through the function
with 'result' uninitialized, which would cause trouble.

In another case, suppose that our first attempt to call
cli_full_connection() fails.  (I think this is the case I'm seeing --
because of something to do with "restrict anonymous", we can't get in
to the PDC.)  We therefore end up with 3 acquisitions of the mutex,
and one of them is released when we exit the function, so the fctnl
lock is never freed, which presumably causes trouble with other things
later -- we have leaked two mutex reservations.

One way to cope better would be for the function to fail if it doesn't
get the mutex after the timeout.  However, since the mutex is only a
safeguard against an NT bug, we might be better off taking our chances
and proceeding anyhow -- this is what the code does at the moment.
However, it still tries to release the mutex even though it was not
actually acquired.  This causes panic()s in secrets.c.

cm_open_connection and get_connection_from_cache in appliance_head
both have a keep_mutex flag that is used by cm_get_netlogon_cli to
hold onto the mutex for a longer period so that it can also guard the
NetLogon phase.  There seem to be two problems with this.

If the connection is returned from cache, then the mutex count is in
fact not acquired, and it is incorrect for cm_get_netlogon_cli() to
release it:

	if (conn->mutex_ref_count)
		secrets_named_mutex_release(conn->controller, &conn->mutex_ref_count);

Examining the refcount seems to me not to be a safe protection against
this: it might already be >1 because some other caller has acquired
it, but that doesn't mean we have the right to release it.  We're
giving up somebody else's lock.  This happens in a couple of places.

Also, as noted above, sometimes cm_open_connection() completes without
acquiring the mutex, but cm_get_netlogon_cli() assumes that it's
always taken.   


More information about the samba-technical mailing list