[linux-cifs-client] [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sends

Thu Jul 23 11:25:30 MDT 2009

On Thu, 23 Jul 2009 12:00:25 -0500
Shirish Pargaonkar <shirishpargaonkar at gmail.com> wrote:

> On Thu, Jul 23, 2009 at 10:34 AM, Jeff Layton<jlayton at redhat.com> wrote:
> > On Thu, 23 Jul 2009 09:51:32 -0500
> > Shirish Pargaonkar <shirishpargaonkar at gmail.com> wrote:
> >
> >> On Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton at redhat.com> wrote:
> >> > On Wed, 22 Jul 2009 20:14:38 -0500
> >> > Shirish Pargaonkar <shirishpargaonkar at gmail.com> wrote:
> >> >
> >> >> Inspite of a set of data integrity patches in cifs last yer, there
> >> >> still persist errors
> >> >> caused due to timeouts resulting in sending incomplete data and
> >> >> hence data integrity errors.
> >> >>
> >> >> The proposed socket send timeout is large enough to elminate that possibility.
> >> >
> >> > On what evidence do you base the above statement? Who's to say that 30s
> >> > is long enough if someone has a high-latency enough connection?
> >> >
> >> >> The tests with this patches have resulted in elminating data integrity errors on
> >> >> an 80 hours test runs which otherwise manifest in matter of hours of a test run.
> >> >>
> >> >
> >> > Also, can you give some details about these data integrity errors? Were
> >> > writes failing? If so, were they not reported at fsync or close?
> >>
> >> The errors logged by cifs client were like this
> >> This is what I had seen last year when the patches were developed.
> >> The entire write could not be sent because of socket timeout, other thread
> >> fills in rest of the 56K write so that second 56K is not responded and client
> >> logs 'No response for cmd'.
> >> The longer timeout seems to be long enough for server to receive entire
> >> smbwrite (56K).
> >>
> >> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> >> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> >> 20646
> >> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
> >> 20647
> >> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
> >> May 12 05:17:11 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> >> 21347
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
> >> 21348
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
> >> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> >> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 46 mid
> >> 24859
> >> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Send error in read = -11
> >> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> >> 24858
> >>
> >>
> >
> > It sounds like the original bug was never fixed then, only made less
> > likely by changing the timing. This patch looks like it just does the
> > same thing.
> 
> The first step was to change the socket from non-blocking to blocking
> to prevent interleaved sends.
> A longer send timeout makes sure the send has enough duration to
> complete the send instead of returning prematurely.
> 
> I can not think of a way to abort a partialy sent request to the server and
> I do not know whether it is possible to be sure that entire 56K buffer is
> available before dispatching a send on  a (test induced) stressed socket.
> 

I think we already discussed this several months ago and agreed that the
right thing to do is to detect when a partial send has occurred and to
reconnect the socket when it does. I can dig up the discussion again,
but you probably remember it...

The question I have is -- why didn't that happen here? That should have
prevented these interleaved sends...right?

Increasing the send timeout will have other effects too that you're not
accounting for here. You're increasing the total send timeout from 15s
to 90s (since steve wanted to keep this loop in smb_sendv instead
of just letting the socket layer handle it). That potentially changes
the overall timeout for SMB calls.

I'm very leery of increasing the send timeout and hoping for the best.
Since the consequences of getting this wrong are data corruption, we
need a real fix or a detailed explanation of how this is guaranteed to
prevent the problem in the future.

-- 
Jeff Layton <jlayton at redhat.com>