Socket draining for sys_recvfile() (patch for 'reverse sendfile')

Thu Dec 22 14:14:17 MST 2011

On Thu, 2011-12-22 at 11:12 -0800, Jeremy Allison wrote:
> On Thu, Dec 22, 2011 at 11:09:20PM +1100, Andrew Bartlett wrote:
> > Volker and Jermey,
> > 
> > I've been working on the attached patch to implement sys_recvfile() for
> > a kernel that has been patched to allow sendfile from a socket to a
> > file.  (Patch for 3.5.11). 
> > 
> > I notice in the git history that this seems to be your area, and that
> > you have had trouble with recvfile in practice, with concerns about
> > blocking and the current code being effectively disabled in master.  
> 
> The only problems we've had here are with the splice() implementation
> in Linux being unsuitable for this purpose initially, or later being
> slower than standard read/pwrite (hence the patched kernel you're
> working with :-).

:-)

> > As such, I'm hoping you might be able to help me out in understanding
> > this area.  
> 
> I'll try !
> 
> > In particular, what is the purpose of the 'socket draining' code?  I'm a
> > little confused, because the comments at the callers
> > (source3/smbd/vfs.c) indicate it is expected, but it is only explicitly
> > in the splice() code path in source3/lib/recvfile.c and not in the
> > read/write 'fallback' (which is actually what we always use). 
> 
> Ok, when recvfile() returns we *must* have drained the incoming
> socket of the remainder of the SMB packet, else we'll lose packet
> boundary coherence.
> 
> The socket draining code was added as it is (theoretically?) possible
> for the splice() code path to return a short read, leaving data
> in the socket buffers (the aforementioned packet boundary coherence
> problem) so we need to drain the socket in this case.
> 
> It can't happen in the "normal" case as the loop takes care of
> this inline - notice the section:
> 
>                                 if (write_ret <= 0) {
>                                         /* write error - stop writing. */
>                                         tofd = -1;
>                                         saved_errno = errno;
>                                         continue;
>                                 }
> 
> which sets "tofd = -1" - and enables us to still do the read()
> call (draining the socket) but not the write call (as we know
> it will fail - disk full or EIO error maybe?).
> 
> That's the "normal" case equivalent to the drain socket code.
> 
> > Finally, is there anything else should I know or be looking out for?
> 
> Buggy kernel patches mainly :-). That recvfile() code has been
> used and shipped in production product (Isilon with modified *BSD
> kernel) so I think it's pretty safe (all IMHO of course :-).
> 
> > (As a hint to others, I've found that Samba4 client libs underpinning
> > smbtorture do not trigger the recvfile path, but smbclient3 does.  This
> > is apparently a well known behaviour difference of one padding/alignment
> > byte in the WriteX packet)
> 
> Don't know about well known :-), but the recvfile() code was
> written explicitly to work with Windows clients that do that
> one byte alignment (although not mandated by the protocol) so
> the s3 client code was written to do the same thing (to ensure
> we can benefit from recvfile).

I think what happened here is the Samba4 client libs didn't get updated,
so smbtorture bench.nbench does not trigger the faster codepath.  (I
ended up moving to a simple smbclient3 'put' to run the code).

> Good luck with it !

With those clues, I think I can sort this out pretty well.  

> FYI: I'm on Holiday from now until Jan 3rd so email response
> times can be a little sketchy !
> 
> Happy Holidays !

Thanks, and I do hope you have an enjoyable break yourself. 

Andrew Bartlett

-- 
Andrew Bartlett                                http://samba.org/~abartlet/
Authentication Developer, Samba Team           http://samba.org