CTDB woes

Mon Apr 15 04:16:44 MDT 2013

On 13/04/13 01:21, Amitay Isaacs wrote:
>
>
>
> On Sat, Apr 13, 2013 at 1:41 AM, Orlando Richards
> <orlando.richards at ed.ac.uk <mailto:orlando.richards at ed.ac.uk>> wrote:
>
>     On 12/04/13 16:35, Amitay Isaacs wrote:
>
>
>         On Fri, Apr 12, 2013 at 10:39 PM, Orlando Richards
>         <orlando.richards at ed.ac.uk <mailto:orlando.richards at ed.ac.uk>
>         <mailto:orlando.richards at ed.__ac.uk
>         <mailto:orlando.richards at ed.ac.uk>>> wrote:
>
>
>              Hi folks,
>
>              We've long been using CTDB and Samba for our NAS service,
>         servicing
>              ~500 users. We've been suffering from some problems with
>         the CTDB
>              performance over the last few weeks, likely triggered
>         either by an
>              upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>         result),
>              or possibly by additional users coming on with a new workload.
>
>              We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44
>         (again,
>              from sernet). Before we roll back, we'd like to make sure
>         we can't
>              fix the problem and stick with Samba 3.6 (and we don't even
>         know
>              that a roll back would fix the issue).
>
>              The symptoms are a complete freeze of the service for CIFS
>         users for
>              10-60 seconds, and on the servers a corresponding spawning
>         of large
>              numbers of CTDB processes, which seem to be created in a
>         "big bang",
>              and then do what they do and exit in the subsequent 10-60
>         seconds.
>
>              We also serve up NFS from the same ctdb-managed frontends,
>         and GPFS
>              from the cluster - and these are both fine throughout.
>
>              This was happening 5-10 times per hour, not at exact intervals
>              though. When we added a third node to the CTDB cluster, it "got
>              worse", and when we dropped the CTDB cluster down to a
>         single node
>              and everything started behaving fine - which is where we
>         are now.
>
>              So, I've got a bunch of questions!
>
>                - does anyone know why ctdb would be spawning these
>         processes, and
>              if there's anything we can do to stop it needing to do it?
>         Also -
>              any idea how we might reproduce this kind of behaviour in a
>         dev/test
>              lab?
>
>
>     Hi Amitay,
>
>
>
>
>         It looks like there is contention for some record(s) which
>         results in
>         CTDB creating lockwait child processes to wait for the record. I
>         would
>         suggest you try CTDB 1.2.61.
>
>
>     Is that the current "stable" release? I must admit to getting a bit
>     confused around release numbers for ctdb! The sernet release we're
>     on has proved to be very stable for us (it can't be said enough -
>     thanks Sernet!).
>
>     Yes. The current development release is 2.1.
>
>                - has anyone done any more general performance / config
>              optimisation of CTDB/Samba/GPFS/Linux?
>
>
>         For general performance tracking you will have to check if there is
>         heavy CPU load, high memory pressure, or lots of processes in wait
>         state. That will give you clues as to what the next bottleneck is.
>
>
>      From what we could see at the time, I'd have characterised it as
>     typical of "lots of processes in wait state", but I couldn't figure
>     out what they were waiting for.
>
>
> Yes. Those are lockwait processes waiting for fcntl locks.

Thanks Amitay,

I've got CTDB 1.2.61 built (from the 1.2.40 git branch) and ready to go, 
and a recompiled samba built against it too. However, this morning we 
tried re-initialising the tdb databases in our two-node cluster:
  - shut down ctdb on one node, move /var/ctdb out the way, start it 
back up again
  - shut down ctdb on the other node, move /var/ctdb out the way, start 
it back up again

And we've been in production today with no obvious re-occurrence of the 
problem. Of course, that could easily be down to a reduced load (it's 
Monday morning!), so we're continuing to monitor things. We have seen a 
few spikes in the number of ctdb processes, but they lasted less than 10 
seconds so I don't know if they caused any service "pauses". Associated 
with those spikes were these messages from ctdb (we've got it on DEBUG 
level logging just now):

Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_call.c:475 deferred 
ctdb_request_call
Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_ltdb_server.c:278 PACKET 
REQUEUED

We're also seeing loads of these all the time, as one would probably expect:
Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_vacuum.c:1484 schedule for 
deletion: db[locking.tdb] db_id[0x42fe72c5] key_hash[0xd1218614] 
lmaster[1] migrated_with_data[no]
Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_vacuum.c:1484 schedule for 
deletion: db[brlock.tdb] db_id[0x1421fb78] key_hash[0x21dbd0cc] 
lmaster[1] migrated_with_data[yes]

I'll follow up with how things go today - our next intervention is 
likely to be to deploy the updated ctdb, as you suggested.

Cheers,
Orlando.

-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.