[Samba] Troubleshooting a suspected ctdb performance issue

Smith, Jarrod A jarrod.smith at Vanderbilt.Edu
Tue Mar 8 02:09:33 UTC 2016


We have a three-node ctdb/samba cluster (8x Sandy Bridge cores + 64GB RAM each node) running on top of GPFS 4.1.0.8, serving  5-600 CIFS clients.  We use sernet-samba-4.1.6 with ctdb 1.0.114 on Centos 6.6.  Unfortunately the administrator who originally installed the ctdb/samba solution has left some time ago, and we are still learning it.

Users are reporting intermittent "latency" issues that occur mutliple times per day over the past several months.  Typical complaints include 30-60s to open folders or files, and sometimes being disconnected from the service.  The samba and ctdb logs show nothing at debug level WARNING.  We have done tcpdump/wireshark packet captures during such events and analyzed these - they showed no obvious ill behavior in the network.

I have recently been probing ctdb itself and today realized that periodically we see the number of ctdbd processes on a node quickly grow from 2 to 250+.  This lasts for 30 seconds to single-digit minutes at which point it corrects itself.  It seems to be correlated with a large increase in the number of lines in /proc/locks.  We also see what I feel are fairly high max_lockwait_latency and max_call_latency values (see our statistics outputs below).  I don't know what causes this, or how to fix it (if it indeed needs fixing).  Keeping in mind that I am new to samba and ctdb, have you got any other recommendations for us to further troubleshoot and/or fix the issue if you think I've hit upon it already?

Thanks for your advice,

--
Jarrod A. Smith, Ph.D.
Asst. Director, Center for Structural Biology
Research Assoc. Professor, Biochemistry
Vanderbilt University - 5135 MRB III
615-322-1739



-----------------------------------
CTDB statistics for each node.
The counters were reset a week or two ago.
-----------------------------------

CTDB version 1
num_clients                      136
frozen                             0
recovering                         0
client_packets_sent        286501416
client_packets_recv        325458087
node_packets_sent          354199901
node_packets_recv          266253799
keepalive_packets_sent        394382
keepalive_packets_recv        394374
node
    req_call               143496620
    reply_call                 90253
    req_dmaster             55005319
    reply_dmaster           60629271
    reply_error                    0
    req_message              1720274
    req_control             74416680
    reply_control           29674368
client
    req_call               253163227
    req_message              1113562
    req_control             71335811
timeouts
    call                           0
    control                        1
    traverse                       3
total_calls                253163227
pending_calls                      0
lockwait_calls              11138044
pending_lockwait_calls             0
childwrite_calls                   6
pending_childwrite_calls             0
memory_used                   210352
max_hop_count                   2162
max_reclock_ctdbd                  0.141385 sec
max_reclock_recd                   169.497819 sec
max_call_latency                   310.868259 sec
max_lockwait_latency               214.839209 sec
max_childwrite_latency             0.014314 sec

-----------------------------------

CTDB version 1
num_clients                      132
frozen                             0
recovering                         0
client_packets_sent        247024177
client_packets_recv        286512929
node_packets_sent          336526909
node_packets_recv          255250235
keepalive_packets_sent        394339
keepalive_packets_recv        394328
node
    req_call               128153759
    reply_call                 70305
    req_dmaster             60194286
    reply_dmaster           53335499
    reply_error                    0
    req_message              1521121
    req_control             73830804
    reply_control           24189543
client
    req_call               219206250
    req_message              1037383
    req_control             66378108
timeouts
    call                           0
    control                        3
    traverse                       5
total_calls                219206250
pending_calls                      0
lockwait_calls               3265686
pending_lockwait_calls             0
childwrite_calls                   6
pending_childwrite_calls             0
memory_used                   253340
max_hop_count                   2163
max_reclock_ctdbd                  0.342660 sec
max_reclock_recd                   0.000000 sec
max_call_latency                   437.201033 sec
max_lockwait_latency               67.572988 sec
max_childwrite_latency             0.015522 sec

-----------------------------------

CTDB version 1
num_clients                      205
frozen                             0
recovering                         0
client_packets_sent        299537550
client_packets_recv        349951795
node_packets_sent          376914119
node_packets_recv          272621669
keepalive_packets_sent        417794
keepalive_packets_recv        417782
node
    req_call               139163332
    reply_call                154848
    req_dmaster             59987264
    reply_dmaster           58997998
    reply_error                    0
    req_message              1083367
    req_control             85827912
    reply_control           34427476
client
    req_call               262120527
    req_message              2169333
    req_control             85858251
timeouts
    call                           0
    control                        2
    traverse                       5
total_calls                262120527
pending_calls                      0
lockwait_calls               5550667
pending_lockwait_calls             0
childwrite_calls                   6
pending_childwrite_calls             0
memory_used                   250736
max_hop_count                   2152
max_reclock_ctdbd                  0.016747 sec
max_reclock_recd                   166.169447 sec
max_call_latency                   16350.672816 sec
max_lockwait_latency               74.970163 sec
max_childwrite_latency             0.016126 sec






More information about the samba mailing list