Data corruption problem

Thu Feb 10 22:34:03 MST 2011

Thank you for all the work you guys do!

First, I'm not certain whether this is samba, the linux cifs driver, or
something else.  I've done as much as I can to verify it.  Here is what
I've done and a few questions at the bottom about what I can do to
gather more data.

During testing, one of my QA guys was running an inhouse program that
generates pseudo-random, but fully recreatable, data and writes it to
a file, the file is named with a name that is essentially the seed to
the pseudo- random stream, so, given a filename, it can read the file
and verify that the data is correct.

The file he created was on a CentOS 5.5 machine that was mounting a cifs
share on another CentOS 5.5 host running samba.  After 150K individual
files from 35 bytes to 9 GB, he created a 9 GB file that failed
validation.  He ran the test again with the same seed and it succeeded.
He ran it a 3rd time and it failed again.

He got me involved.  I found no useful messages (cifs, IO, kernel mem,
network, or samba) in any logs on client or server anywhere near the
times of the file creations.

I cmp'd the files.  Then used "od -A x -t a" with offsets and diffed the
3 files.  Each of the 2 failed files has a single block of 56K (57344) nuls.
The 2 failed files have these at different points in the 2 files.  Each
56K nul block starts on an offset where x % 57344 == 0.

first file:
>>> 519995392 / 57344.
9068.0 # matching 56K blocks before the one null 56K block
# followed by the rest of the 9 GB file, all matching

second file:
>>> 7910088704/57344.
137941.0 # matching 56K blocks before the one null 56K block
# followed by the rest of the 9 GB file, all matching

So, I searched the kernel source, expecting to find 56K in the sata
driver code.  Instead the only place I found it that seemed relevant
was:

	./fs/cifs/README:  wsize default write size (default 57344)

I have since used cp to copy the file 4 times with tcpdump running at
both ends.  All 4 times have worked properly.  Don't know if that is
because tcpdump is slowing it down or if our test app could be at fault.
Our test app is talking to the local file system and not with a block
size of 56K, so I don't think it is our app.

Unfortunately, the tcpdumps at both ends are reporting the kernel
dropping about 50% of the packets, so even if I can get it to fail,
I'm still  unsure whether it's the client or the samba server, where
client would still leave me choosing betweem our app and fs/cifs.

The only other thing I can think of is the ethernet devices, but since
the packet is made up of 30+ ethernet frames, and being TCP there is
a payload checksum, I can't see the network layers being the culprit,
but just in case:

client w/ fs/cifs:
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)

samba server:
01:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet Controller
03:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller

A few questions:

0. Anyone already know of a bug in fs/cifs or samba that has this
symptom?

1. Anyone know how to get the kernel to not drop the packets?

2. Any other ideas on what I can do to gather more data to differentiate
between bad-app, fs/cifs, samba, or other-element-in-the-chain?

-- 

Wayne Walker
wwalker at solid-constructs.com
(512) 633-8076
Senior Consultant
Solid Constructs, LLC

> A: Because it messes up the order in which people normally read text.
> > Q: Why is top-posting such a bad thing?
> > > A: Top-posting.
> > > > Q: What is the most annoying thing in e-mail?