Samba and AIX slightly OT.

Wed Feb 25 00:35:30 GMT 2004

William Jojo [mailto:jojowil at hvcc.edu]  wrote:
> Bottom line is I have ... JFS meta-corruption when using the 64-bit kernel
within
> 8-11 days after a fsck repair and reboot. The 32-bit kernel NEVER exhibits
this
> problem after months of testing.
...
> I'm battling with Software Support Line at IBM ... of course once I
mention Samba,
> I might as well face eastward on a full-moon at midnight for all the good
my
> talking will do.
...
> ... ultimately Samba is calling IBM provided APIs for filesystems
management.

Precisely correct. Samba is just an another (somewhat complex) application
as far as the operating system is concerned. Samba might be doing a series
of operations that exposes a latent bug, but you can be assured that the bug
is in the file system or disk driver or hardware, not in Samba. In fact, by
your own words you have already proven this beyond a reasonable doubt. You
said you have never seen it on the 32-bit kernel (presumably on the exact
same hardware). QED.

The key to tracking down mysterious corruption problem is narrowing down the
possibilities.

Are you able to reliably detect the corruption soon after it happens?  Does
it always happen to the same file or directory?  Or same logical disk block?
Does it always bite the same user?  Does it happen at the same time of day?
Are people hitting their quota limits?  Are your disks filling up?  What is
the data that is corrupted?  What is the data that is the source of
corruption?  Both the source data (good data written to the wrong location)
and the target data (innocent victim of corruption) can often provide strong
clues.

In my 30-odd years of experience working with operating systems, I have seen
a number of corruption problems and many of them are triggered by boundary
conditions.  Files that hit the file size limit, directories that hit the
directory size limit, disks that fill up, quota systems that misfire, files
that grow beyond "magic" size boundaries (say, 2GB), etc.

You best hope is to find the corruption quickly and then come up with
hypothesis as to why that particular item got corrupted at that particular
time.  Then, you can try experiments to test hypotheses.

For example, you might try splitting your load over two servers.  If the
situation is being driven by a specific user scenario, then one server will
see it and the other will not.  If the situation is just load-related, then
reducing the load by half should reduce the chances of encountering the
problem.  If it always hits the same user, then find out what they are
doing.  Enlist your user community to help you identify when and where it is
happening.  If you think you might be hitting some of the limits I mentioned
above, then write trivial synthetic test programs to stress those limits.

But regardless, tell those support people that you are paying for support
services and you'd like some!  Escalate to their management, quote them back
their Terms & Conditions, and hold them accountable.  They are giving you
the runaround and hoping you just go away.  Ask to talk to someone from
Engineering.  That should shut them up.  In my experience working in
Engineering for a vendor, CS types hate to call in Engineering.  The
customer usually has to appeal to the account team or to senior CS
management to turn the issue over to the experts.  (We just love this....we
only get to talk to really upset customers....yummy).  So go appeal.  It's
your right.

HTH

Thanks
PG
--
Paul Green, Senior Technical Consultant,
Stratus Technologies, Maynard, MA USA
Voice: +1 978-461-7557; FAX: +1 978-461-3610
Speaking from Stratus not for Stratus