[Samba] Samba in Pacemaker-Cluster: CTDB fails to get recovery lock

Mon Mar 14 12:23:03 MDT 2011

On Fri, Mar 11, 2011 at 8:13 AM, Uwe Ritzschke
<uwe.ritzschke.2 at cms.hu-berlin.de> wrote:
> I'm currently testing fail-over with a two-node active-active cluster (with
> node dig and node dag): Both nodes are up, one is manually killed. CTDB on
> the node that's still alive should perform a recovery and everything should
> working again.
>
> What's infrequently happening is:
>
> After killing the pacemaker-process on dag (and dag consequently being
> fenced), dig's CTDB tries to get the recovery lock and fails. As there is no
> other node online to get the recovery lock and thus finishing CTDB's
> recovery, dig's CTDB keeps trying to get the recovery lock until manually
> stopped.
> The only way to get CTDB back to work is to restart OCFS2's distributed lock
> manager.
>
>
> Our setting:
>
> two nodes directly connected via LAN running openSuse 11.3 and sharing a
> SAN-drive that is connected via two interfaces using multipath.
>
> pacemaker 1.1.2
> corosync 1.2.1
> cluster-glue 1.0.5-1.4
> ctdb 1.0.114-2.20
> ocfs2 1.4.3-1.4
> multipath 0.4.8-51.3
>
You might want to try updated packages from the repository:
http://download.opensuse.org/repositories/network:/ha-clustering/openSUSE_11.3/

This would give you newer code levels on the HA packages.

-- 
Jim McDonough
Samba Team
SUSE labs
jmcd at samba dot org
jmcd at themcdonoughs dot org