[EXTERNAL] Re: Need tips on debugging assert_no_pending_aio() cores

Ashok Ramakrishnan aramakrishnan at nasuni.com
Thu Sep 24 18:54:38 UTC 2020

Thanks Jeremy for the tip. We are able to reproduce the issue after a few hours of IO. I re-read the comments and the code and have one follow up question.

Is it possible for talloc_realloc() in aio_add_req_to_fsp() and aio_del_req_from_fsp() to race? Since the array is being mem copied when the size is incremented 10 at a time...

I am adding some instrumentation to the code to see if we are running into this situation here. But, we seem to end up with a case where fsp->num_aio_requests = 1, while the fsp->aio_requests has been freed (because all the outstanding aio requests have been destroyed).

-----Original Message-----
From: Jeremy Allison <jra at samba.org>
Sent: Thursday, September 24, 2020 12:05 PM
To: Ashok Ramakrishnan <aramakrishnan at nasuni.com>
Cc: samba-technical at lists.samba.org
Subject: [EXTERNAL] Re: Need tips on debugging assert_no_pending_aio() cores

On Thu, Sep 24, 2020 at 02:44:53PM +0000, Ashok Ramakrishnan via samba-technical wrote:
> Hi:
> We use Samba on top of our user space (fuse) file system. We just recently updated to samba 4.12.6 (still in pre-release testing internally) and we are running into these smbd cores after very heavy IO load. On looking at the core, I see that there seems to be a race (or a mismatch) between the num_aio_requests accounting and the actual requests linked to the fsp structure (fsp->aio_requests)... Since we are on 4.12.6, we already have the fixes for https://bugzilla.samba.org/show_bug.cgi?id=14301. My question is, how do I debug this issue further? Is it just code inspection, adding additional debug logging? Or is there a better way?
> Also, I could use some help understanding this code block in aio_del_req_from_fsp()
>         if (i == fsp->num_aio_requests) {
>                 DEBUG(1, ("req %p not found in fsp %p\n", req, fsp));
>                 return 0;
>         }
> Why is it OK to not find an aio request attached to the fsp while destructing it? Is there a valid use case where this is expected to happen? I am not sure we are running into the above code block, plan to set log level 1 to see if that is the case.. Just noticed this during code inspection and trying to understand the logic there.

That's the destructor for the lnk struct, created as a talloc child of the outstanding tevent_req.

The fsp->aio_requests[index] can be deleted in a SHUTDOWN_CLOSE independently of the lnk struct, so the lnk struct needs to allow the associated fsp->aio_requests[] value to have been freed.

Check the code and comment in:


for details.

I wrote much of this logic, so I can
help you track this down if you can reproduce it.
This e-mail message and all attachments transmitted with it may contain privileged and/or confidential information intended solely for the use of the addressee(s). If the reader of this message is not the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, forwarding or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and delete this message, all attachments and all copies and backups thereof.

More information about the samba-technical mailing list