CTDB woes

Mon Apr 22 08:51:48 MDT 2013

On 15/04/13 11:16, Orlando Richards wrote:
> On 13/04/13 01:21, Amitay Isaacs wrote:
>>
>>
>>
>> On Sat, Apr 13, 2013 at 1:41 AM, Orlando Richards
>> <orlando.richards at ed.ac.uk <mailto:orlando.richards at ed.ac.uk>> wrote:
>>
>>     On 12/04/13 16:35, Amitay Isaacs wrote:
>>
>>
>>         On Fri, Apr 12, 2013 at 10:39 PM, Orlando Richards
>>         <orlando.richards at ed.ac.uk <mailto:orlando.richards at ed.ac.uk>
>>         <mailto:orlando.richards at ed.__ac.uk
>>         <mailto:orlando.richards at ed.ac.uk>>> wrote:
>>
>>
>>              Hi folks,
>>
>>              We've long been using CTDB and Samba for our NAS service,
>>         servicing
>>              ~500 users. We've been suffering from some problems with
>>         the CTDB
>>              performance over the last few weeks, likely triggered
>>         either by an
>>              upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>         result),
>>              or possibly by additional users coming on with a new
>> workload.
>>
>>              We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44
>>         (again,
>>              from sernet). Before we roll back, we'd like to make sure
>>         we can't
>>              fix the problem and stick with Samba 3.6 (and we don't even
>>         know
>>              that a roll back would fix the issue).
>>
>>              The symptoms are a complete freeze of the service for CIFS
>>         users for
>>              10-60 seconds, and on the servers a corresponding spawning
>>         of large
>>              numbers of CTDB processes, which seem to be created in a
>>         "big bang",
>>              and then do what they do and exit in the subsequent 10-60
>>         seconds.
>>
>>              We also serve up NFS from the same ctdb-managed frontends,
>>         and GPFS
>>              from the cluster - and these are both fine throughout.
>>
>>              This was happening 5-10 times per hour, not at exact
>> intervals
>>              though. When we added a third node to the CTDB cluster,
>> it "got
>>              worse", and when we dropped the CTDB cluster down to a
>>         single node
>>              and everything started behaving fine - which is where we
>>         are now.
>>
>>              So, I've got a bunch of questions!
>>
>>                - does anyone know why ctdb would be spawning these
>>         processes, and
>>              if there's anything we can do to stop it needing to do it?
>>         Also -
>>              any idea how we might reproduce this kind of behaviour in a
>>         dev/test
>>              lab?
>>
>>
>>     Hi Amitay,
>>
>>
>>
>>
>>         It looks like there is contention for some record(s) which
>>         results in
>>         CTDB creating lockwait child processes to wait for the record. I
>>         would
>>         suggest you try CTDB 1.2.61.
>>
>>
>>     Is that the current "stable" release? I must admit to getting a bit
>>     confused around release numbers for ctdb! The sernet release we're
>>     on has proved to be very stable for us (it can't be said enough -
>>     thanks Sernet!).
>>
>>     Yes. The current development release is 2.1.
>>
>>                - has anyone done any more general performance / config
>>              optimisation of CTDB/Samba/GPFS/Linux?
>>
>>
>>         For general performance tracking you will have to check if
>> there is
>>         heavy CPU load, high memory pressure, or lots of processes in
>> wait
>>         state. That will give you clues as to what the next bottleneck
>> is.
>>
>>
>>      From what we could see at the time, I'd have characterised it as
>>     typical of "lots of processes in wait state", but I couldn't figure
>>     out what they were waiting for.
>>
>>
>> Yes. Those are lockwait processes waiting for fcntl locks.
>
>
> Thanks Amitay,
>
> I've got CTDB 1.2.61 built (from the 1.2.40 git branch) and ready to go,
> and a recompiled samba built against it too. However, this morning we
> tried re-initialising the tdb databases in our two-node cluster:
>   - shut down ctdb on one node, move /var/ctdb out the way, start it
> back up again
>   - shut down ctdb on the other node, move /var/ctdb out the way, start
> it back up again
>
> And we've been in production today with no obvious re-occurrence of the
> problem. Of course, that could easily be down to a reduced load (it's
> Monday morning!), so we're continuing to monitor things. We have seen a
> few spikes in the number of ctdb processes, but they lasted less than 10
> seconds so I don't know if they caused any service "pauses". Associated
> with those spikes were these messages from ctdb (we've got it on DEBUG
> level logging just now):
>
> Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_call.c:475 deferred
> ctdb_request_call
> Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_ltdb_server.c:278 PACKET
> REQUEUED
>
> We're also seeing loads of these all the time, as one would probably
> expect:
> Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_vacuum.c:1484 schedule for
> deletion: db[locking.tdb] db_id[0x42fe72c5] key_hash[0xd1218614]
> lmaster[1] migrated_with_data[no]
> Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_vacuum.c:1484 schedule for
> deletion: db[brlock.tdb] db_id[0x1421fb78] key_hash[0x21dbd0cc]
> lmaster[1] migrated_with_data[yes]
>
> I'll follow up with how things go today - our next intervention is
> likely to be to deploy the updated ctdb, as you suggested.
>
Hi all,

We did discover a recurrence of the problem with the old ctdb and 
cleaned databases, so we upgraded to ctdb 1.2.61 (and rebuilt samba 
against that ctdb version), and things have been fine ever since - many 
thanks indeed Amitay!

--
Orlando

-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.