CTDB with GlusterFS.

Tue Nov 10 21:39:23 UTC 2015

Hi Guys,

I am testing CTDB with GlusterFS and having issues with the recovery
lock functionality.   I am using CTDB with a GlusterFS 3.7.5 mount
hosting my CTDB_RECOVERY_LOCK ( replica volume type ) .   Once I start
the service, I aquire the lock, CTDB assigns the VIP and life is
happy.  Shortly after startup ( 30 sec ), CTDB complains of a slow
RECLOCK until a gluster error is emitted.  This process repeats itself
ad-infinitum.  I ran the ping_pong test to flex POSIX locking as
suggested by other theads of this type and have no errors.  This is
all with a single CTDB instance, other nodes are left down during my
testing.

* Has anyone tried CTDB with modern Gluster and had success?

* I am fairly confident this is a gluster bug, due to gluster a
transport endpoint not connected ( logs below ) .  I will fire a
message to their list as well, but wanted to ask if anybody on
#samba-technical has shared my experience.  I have not discovered what
sequence is causing gluster to fail, thus I have not reported
yet....more digging required :-)

Please share any thoughts! :-)

Thanks,
Matt

----- LOGS ------

2015/11/09 17:04:51.653042 [recoverd:32594]: Takeover run starting
2015/11/09 17:04:51.653328 [32384]: Takeover of IP 172.20.63.61/24 on
interface eth0
2015/11/09 17:04:51.653868 [32384]: Takeover of IP 172.20.63.60/24 on
interface eth0
2015/11/09 17:04:52.112887 [recoverd:32594]: Takeover run completed successfully
2015/11/09 17:04:53.598101 [32384]: monitor event OK - node re-enabled
2015/11/09 17:04:53.598152 [32384]: Node became HEALTHY. Ask recovery
master 0 to perform ip reallocation
2015/11/09 17:04:53.598474 [recoverd:32594]: Node 0 has changed flags
- now 0x0  was 0x2
2015/11/09 17:05:07.666713 [recoverd:32594]:
server/ctdb_recoverd.c:3333 check_reclock child process hung/timedout
CFS slow to grant locks?
2015/11/09 17:05:07.666843 [32384]: High RECLOCK latency 15.014489s
for operation recd reclock
2015/11/09 17:05:07.667011 [recoverd:32594]: Takeover run starting
2015/11/09 17:05:07.874068 [recoverd:32594]: Takeover run completed successfully
2015/11/09 17:05:22.889932 [recoverd:32594]:
server/ctdb_recoverd.c:3333 check_reclock child process hung/timedout
CFS slow to grant locks?
2015/11/09 17:05:22.890061 [32384]: High RECLOCK latency 15.015322s
for operation recd reclock
2015/11/09 17:05:24.141909 [32384]: 01.reclock: ERROR: 4 consecutive
failures for 01.reclock, marking node unhealthy
2015/11/09 17:05:24.142173 [32384]: monitor event failed - disabling node
2015/11/09 17:05:24.142208 [32384]: Node became UNHEALTHY. Ask
recovery master 0 to perform ip reallocation
2015/11/09 17:05:24.142534 [recoverd:32594]: Node 0 has changed flags
- now 0x2  was 0x0
2015/11/09 17:05:29.188326 [32384]: 01.reclock: ERROR: 5 consecutive
failures for 01.reclock, marking node unhealthy
2015/11/09 17:05:37.905513 [recoverd:32594]:
server/ctdb_recoverd.c:3333 check_reclock child process hung/timedout
CFS slow to grant locks?
2015/11/09 17:05:37.905608 [32384]: High RECLOCK latency 15.014795s
for operation recd reclock
2015/11/09 17:05:37.905850 [recoverd:32594]: Takeover run starting
2015/11/09 17:05:38.117459 [recoverd:32594]: Takeover run completed successfully
2015/11/09 17:05:39.233709 [32384]: 01.reclock: ERROR: 6 consecutive
failures for 01.reclock, marking node unhealthy
2015/11/09 17:05:53.127048 [recoverd:32594]:
server/ctdb_recoverd.c:3333 check_reclock child process hung/timedout
CFS slow to grant locks?
2015/11/09 17:05:53.127204 [32384]: High RECLOCK latency 15.009084s
for operation recd reclock
2015/11/09 17:05:54.280983 [32384]: 01.reclock: ERROR: 7 consecutive
failures for 01.reclock, marking node unhealthy
2015/11/09 17:06:01.935111 [recovery-lock: 1208]: failed read from
recovery_lock_fd - Transport endpoint is not connected
2015/11/09 17:06:01.935284 [recoverd:32594]:
server/ctdb_recoverd.c:3356 reclock child process returned error 2
2015/11/09 17:06:01.935326 [recoverd:32594]:
server/ctdb_recoverd.c:3456 reclock child failed when checking file
2015/11/09 17:06:01.936010 [32384]: High RECLOCK latency 8.807912s for
operation recd reclock
2015/11/09 17:06:01.936257 [recoverd:32594]: Failed
check_recovery_lock. Force a recovery
2015/11/09 17:06:01.936293 [recoverd:32594]:
server/ctdb_recoverd.c:1765 Starting do_recovery
2015/11/09 17:06:01.936305 [recoverd:32594]: Taking out recovery lock
from recovery daemon
2015/11/09 17:06:01.936314 [recoverd:32594]: Take the recovery lock
2015/11/09 17:06:01.958604 [recoverd:32594]: ctdb_recovery_lock:
Unable to open /mnt/glusterfs/ctdb.lock - (Transport endpoint is not
connected)
2015/11/09 17:06:01.958669 [recoverd:32594]: Unable to get recovery
lock - aborting recovery and ban ourself for 120 seconds
2015/11/09 17:06:01.958689 [recoverd:32594]: Banning node 0 for 120 seconds
2015/11/09 17:06:01.958762 [32384]: Banning this node for 120 seconds
2015/11/09 17:06:01.958788 [32384]: This node has been banned -
forcing freeze and recovery