Skipped groups in idmap2 on a cluster (with CTDB and GPFS)

Thu May 26 10:22:09 MDT 2011

On Thu, May 26, 2011 at 09:06:15AM -0700, Richard Sharpe wrote:
> > It should never happen that the idmap2.tdb is different on
> > different nodes. That file is being covered by ctdb
> > transactions, which are supposed to take care of making sure
> > that they are the same everywhere. Can you say when this
> > happened?
> 
> It happened on a customer site on 18-May-2011 or there abouts.
> 
> The customer was using robocopy to copy a share from a Win2k8 (I
> think) node to the cluster.
> 
> There are two issues that I see:
> 
> 1. The idmap2 files are different on the nodes on the cluster
> 2. The partial SID mapping info, but even worse is that it is the
> GID-to-SID entry and it is partial as well.

There must have been transaction problems. With recent
versions of ctdb and Samba we believe they are gone. Is it
possible that you have messages containing "rsn" errors in
/var/log/messages around that time or later? I don't recall
the exact messages.

> There is a failed drive on one node which GPFS has kicked out and our
> data is on a partition at the beginning of that drive, but it is
> mirrored across four drives in the system, so MD should have taken
> care of that and in any event, the failed drive is not on the node
> that disagrees with the other two nodes.
> 
> I am investigating some more and will trawl through some more code.
> 
> What is vacuuming used for, BTW.

When Samba/ctdb want to delete a record from a tdb, they
overwrite that with a 0-length record for consistency and
failover reasons. Those have to be cleaned up eventually.
This process is called vacuuming. Also, vacuuming has seen a
vast amount of fixes lately. Michael Adam will be able to
tell more about that.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen