[linux-cifs-client] error while writing large file to Windows share

Sat Oct 27 12:33:20 GMT 2007

On Fri, 26 Oct 2007 16:10:28 -0700
Noah Romer <nromer at arcmailtech.com> wrote:

> Ok, here's the setup. We have a backup process running on a Fedora
> Core 6 box (2.6.18-1.2257smp kernel, samba-3.0.23c-1.fc5 userland
> code) that writes several files to a temporary directory on a Windows
> share and then tars them all up into a single tarball, also on the
> Windows share. When it gets about 63GB-76GB into writing the tarball,
> we get an error the causes it to be incomplete/corrupt.
> 
> This shows up on the console as a tar error:
> tar: var.tar: File shrank by 17045487616 bytes; padding with zeros
> tar: Error exit delayed from previous errors
> 
> The tar error corresponds with an series of error messages 
> in /var/log/messages:
> 
> Oct 26 11:42:50 [system name] kernel:  CIFS VFS: server not responding
> Oct 26 11:42:50 [system name] kernel:  CIFS VFS: server not responding
> Oct 26 11:42:50 [system name] kernel:  CIFS VFS: No response to cmd
> 47 mid 125 93
> Oct 26 11:42:50 [system name] kernel:  CIFS VFS: No response to cmd
> 47 mid 140 04
> Oct 26 11:42:53 [system name] kernel:  CIFS VFS: No response to cmd
> 47 mid 170 26
> Oct 26 11:42:53 [system name] kernel:  CIFS VFS: Write2 ret -11,
> written = 0 Oct 26 11:42:53 [system name] kernel:  CIFS VFS: Write2
> ret -9, written = 0
> 

-11 == -EAGAIN

Probably due to the fact that the server isn't responding in a timely
fashion.

 -9 == -EBADF

This error seems to be pretty common on older kernels with an
unresponsive server. I think it has something to do with the socket
being closed and reconnected, but I'm not sure.

> [other log messages snipped]
> 
> Oct 26 14:28:00 [system name] kernel:  CIFS VFS: Send error in read =
> -12 Oct 26 14:28:39 [system name] last message repeated 2 times
> Oct 26 14:31:04 [system name] kernel:  CIFS VFS: Send error in read =
> -12 Oct 26 14:33:32 [system name] kernel:  CIFS VFS: Send error in
> read = -12 Oct 26 14:34:53 [system name] last message repeated 4 times
> Oct 26 14:36:30 [system name] last message repeated 2 times
> Oct 26 14:39:29 [system name] last message repeated 4 times
> Oct 26 14:40:47 [system name] last message repeated 3 times
> Oct 26 14:40:47 [system name] last message repeated 2 times
> 

Yikes -- -12 is -ENOMEM. Sounds like you're low on kernel memory or
it's heavily fragmented. This error has possibly been propagated up
from lower in the networking stack.

> Because tar is doesn't report its error message immediately I can't
> be sure which of the cifs error sequences corresponds directly to it.
> 
> Any ideas on what's causing this, or suggestions for further
> debugging? This is occuring on a customer's network, so I can't get
> all ham-fisted but I'm willing to dig for further info if it will
> help.
> 
> Thanks,
> Noah Romer

I'd highly recommend updating the kernel and seeing if you're still
having problems. 2.6.18-ish kernels are pretty ancient by now and with
Fedora we don't generally backport fixes.

There's also a very recent patch that went in to fix a long-standing
memory corruption bug with CIFS when kernel_recvmsg returns -EAGAIN.
You'll most definitely want that as well, but I don't think that's made
it into any Fedora kernels yet -- I'll see if I can get that included
in the next set of builds...

-- 
Jeff Layton <jlayton at redhat.com>