[Samba] 答复: [ctdb] Unable to take recovery lock - contention

Mon Feb 26 22:27:06 UTC 2018

[Thanks Harry!]

Am Montag, 26. Februar 2018, 17:26:06 CET schrieb zhu.shangzhong--- via
samba:

> ------------------原始邮件------------------
> 发件人：朱尚忠10137461
> 收件人：samba at lists.samba.org <samba at lists.samba.org>
> 日 期 ：2018年02月26日 17:10
> 主 题 ：[ctdb] Unable to take recovery lock - contention
> When the ctdb is starting, the "Unable to take recovery lock - contention" log will be output all the time.
> Which cases will the "unable to take lock" errror be output?
> Thanks!
> 
> The following the ctdb logs:
> 2018/02/12 19:38:51.147959 ctdbd[5615]: CTDB starting on node
> 2018/02/12 19:38:51.528921 ctdbd[6602]: Starting CTDBD (Version 4.6.10) as PID: 6602
> 2018/02/12 19:38:51.529060 ctdbd[6602]: Created PID file /run/ctdb/ctdbd.pid
> 2018/02/12 19:38:51.529120 ctdbd[6602]: Listening to ctdb socket /var/run/ctdb/ctdbd.socket
> 2018/02/12 19:38:51.529146 ctdbd[6602]: Set real-time scheduler priority
> 2018/02/12 19:38:51.648117 ctdbd[6602]: Starting event daemon /usr/libexec/ctdb/ctdb_eventd -e /etc/ctdb/events.d -s /var/run
> /ctdb/eventd.sock -P 6602 -l file:/var/log/log.ctdb -d NOTICE
> 2018/02/12 19:38:51.648390 ctdbd[6602]: connect() failed, errno=2
> 2018/02/12 19:38:51.693790 ctdb-eventd[6633]: listening on /var/run/ctdb/eventd.sock
> 2018/02/12 19:38:51.693893 ctdb-eventd[6633]: daemon started, pid=6633
> 2018/02/12 19:38:52.648474 ctdbd[6602]: Set runstate to INIT (1)
> 2018/02/12 19:38:54.505780 ctdbd[6602]: PNN is 1
> 2018/02/12 19:38:54.574993 ctdbd[6602]: Vacuuming is disabled for persistent database ctdb.tdb
> 2018/02/12 19:38:54.576297 ctdbd[6602]: Attached to database '/var/lib/ctdb/persistent/ctdb.tdb.1' with flags 0x400
> 2018/02/12 19:38:54.576322 ctdbd[6602]: Ignoring persistent database 'ctdb.tdb.2'
> 2018/02/12 19:38:54.576339 ctdbd[6602]: Ignoring persistent database 'ctdb.tdb.0'
> 2018/02/12 19:38:54.576364 ctdbd[6602]: Freeze db: ctdb.tdb
> 2018/02/12 19:38:54.576393 ctdbd[6602]: Set lock helper to "/usr/libexec/ctdb/ctdb_lock_helper"
> 2018/02/12 19:38:54.579527 ctdbd[6602]: Set runstate to SETUP (2)
> 2018/02/12 19:38:54.881828 ctdbd[6602]: Keepalive monitoring has been started
> 2018/02/12 19:38:54.881873 ctdbd[6602]: Set runstate to FIRST_RECOVERY (3)
> 2018/02/12 19:38:54.882020 ctdb-recoverd[7182]: monitor_cluster starting
> 2018/02/12 19:38:54.882620 ctdb-recoverd[7182]: Initial recovery master set - forcing election
> 2018/02/12 19:38:54.882702 ctdbd[6602]: This node (1) is now the recovery master
> 2018/02/12 19:38:55.882735 ctdbd[6602]: CTDB_WAIT_UNTIL_RECOVERED
> 2018/02/12 19:38:56.902874 ctdbd[6602]: CTDB_WAIT_UNTIL_RECOVERED
> 2018/02/12 19:38:57.885800 ctdb-recoverd[7182]: Election period ended
> 2018/02/12 19:38:57.886134 ctdb-recoverd[7182]: Node:1 was in recovery mode. Start recovery process
> 2018/02/12 19:38:57.886160 ctdb-recoverd[7182]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery
> 2018/02/12 19:38:57.886187 ctdb-recoverd[7182]: Attempting to take recovery lock (/share-fs/export/ctdb/.ctdb/reclock)
> 2018/02/12 19:38:57.886243 ctdb-recoverd[7182]: Set cluster mutex helper to "/usr/libexec/ctdb/ctdb_mutex_fcntl_helper"
> 2018/02/12 19:38:57.899722 ctdb-recoverd[7182]: Unable to take recovery lock - contention
> 2018/02/12 19:38:57.899763 ctdb-recoverd[7182]: Unable to get recovery lock - retrying recovery
> [...]

First, I would check that the recovery lock file actually exists, to
make sure the error message is sane. For example, using:

  ls -l /share-fs/export/ctdb/.ctdb/reclock

If the file doesn't exist, does the directory exist?

If the file does exist, the next step would be to see what processes
have that file open (and potentially locked). Try:

  fuser -v /share-fs/export/ctdb/.ctdb/reclock

You should find that a /usr/libexec/ctdb/ctdb_mutex_fcntl_helper
process has it open.  This process should exit if its parent (CTDB
recovery daemon) goes away, so check what the parent process is.

You can also use

  ls -i /share-fs/export/ctdb/.ctdb/reclock

to determine the inode of the file and then something like this to
determine if a process has the file locked:

  awk -v inode=<INODE#> '$6 ~ ".*:.*:" inode { print $5 }' /proc/locks

If nothing has the file locked then there might be something weird
about your cluster filesystem.  Try running the helper by hand under
strace and seeing what fails:

  strace /usr/libexec/ctdb/ctdb_mutex_fcntl_helper /share-fs/export/ctdb/.ctdb/reclock

Good luck!

peace & happiness,
martin