Patch for a ctdb reloadnodes crash Bugzilla ID 10366

Kevin Osborn kosborn at overlandstorage.com
Wed Jan 22 10:32:26 MST 2014


Thanks Amitay. One more question...

Your patch didn't incorporate a change that we had proposed on the first allocation. We had changed the size being passed in to be just struct ctdb_tcp_array instead of calculating the amount of memory needed to hold the array. The ctdb_tcp_array struct includes a pointer to an array but not an array itself so we thought this allocation unnecessary.  An allocation further down in your patch did not calculate the space needed so at the very least these allocations are inconsistent.

I propose this for the first part of the patch
---- snip ----

    /* If this is the first tickle */

    if (tcparray == NULL) {

-       tcparray = talloc_size(ctdb->nodes,

+       tcparray = talloc_size(vnn, struct ctdb_tcp_array);
---- snip ----

Thanks again,

-Kevin



From: Amitay Isaacs [mailto:amitay at gmail.com]
Sent: Wednesday, January 22, 2014 1:36 AM
To: Kevin Osborn
Cc: samba-technical at lists.samba.org
Subject: Re: Patch for a ctdb reloadnodes crash Bugzilla ID 10366


On Wed, Jan 22, 2014 at 10:37 AM, Kevin Osborn <kosborn at overlandstorage.com<mailto:kosborn at overlandstorage.com>> wrote:
Hi,

We have been seeing a lot of CTDB crashes when testing adding and removing nodes from our cluster. The details are described in
https://bugzilla.samba.org/show_bug.cgi?id=10366

The good news is that we believe that have found a fix and we would like to make sure the community knows about it. The problem seems to go back at least as far as CTDB 1.1.

Here is our proposed fix.  Apply the following patch to the 2.5.1 ctdb source code:

--- ctdb-2.5.1/server/ctdb_takeover.c.orig      2014-01-16 09:24:59.000000000-0800
+++ ctdb-2.5.1/server/ctdb_takeover.c   2014-01-16 09:26:13.000000000 -0800 @@ -3051,11 +3051,9 @@

        /* If this is the first tickle */
        if (tcparray == NULL) {
-               tcparray = talloc_size(ctdb->nodes,
-                       offsetof(struct ctdb_tcp_array, connections) +
-                       sizeof(struct ctdb_tcp_connection) * 1);
+               tcparray = talloc(ctdb->nodes, struct ctdb_tcp_array);
                CTDB_NO_MEMORY(ctdb, tcparray);
-               vnn->tcp_array = tcparray;
+               vnn->tcp_array = talloc_steal(vnn, tcparray);

                tcparray->num = 0;
                tcparray->connections = talloc_size(tcparray, sizeof(struct ctdb_tcp_connection));

It seems like the original code is making ctdb->nodes the owner of the memory talloc'd for the tcparray. When reload nodes is called as part of the add nodes process, the old ctdb->nodes list is eventually talloc_free()'d, which then frees the tcparray that it "owns". Then later on, the tcp_array pointer is referenced but the memory has been recycled and the values inside are no longer valid. This is when the crash occurs.

Please let me know if the patch suggested is good, or if you have any other suggestions for how this problem should be addressed.

I hope this helps!

Thanks,

-Kevin Osborn

Hi Kevin,

Nice catch. There are two places where tcparray is allocated off ctdb->nodes. Here's the patch which fixes both places and removes unnecessary talloc_steal().  Please let me know if it works.

Thanks.
Amitay.




More information about the samba-technical mailing list