CTDB asymetric (non-)recovery

Fri Jun 8 07:43:56 MDT 2012

Le 07/06/2012 14:15, Nicolas Ecarnot a écrit :
> Le 07/06/2012 12:22, Nicolas Ecarnot a écrit :
>> I increased the log level to 9 (damn, this IS verbose), and I try to
>> extract the relevant part of the loop, on the failing node (though yet
>> nothing is proving me that the unhealthy node _is_ the faulty one).
>>
>> The log file is here : http://pastebin.com/YEwrkmPx
>
> Comparing the verbose log files between a nice recovery and a failing
> one, I see in the good situation that :
>
> | [recoverd: 3351]: The interfaces status has changed on local node 1 -
> force takeover run
>
> Followed by :
>
> | [recoverd: 3351]: Trigger takeoverrun
>
> In the bad case, this takeoverrun never gets triggered (nothing in the
> log file)
> Reading the source code, I see the function implied is
> verify_local_ip_allocation but I don't understand how one could get out
> of this function without yelding an error message in-between?

Ok, I see now.
In server/ctdb_recoverd.c, in the function verify_local_ip_allocation, 
the relevant part is :

/* skip the check if we have started but not finished recovery */
         if (timeval_compare(&uptime1->last_recovery_finished,
                             &uptime1->last_recovery_started) != 1) {
                 DEBUG(DEBUG_INFO, (__location__ " in the middle of 
recovery or ip reallocation. skipping public ip address check\n"));
                 talloc_free(mem_ctx);

                 return 0;
         }

It is _this_ "return 0" I was looking for.
Now, I must admit I need help : I'm looking for explanations about the 
whole process. Is is normal this check is happening now?
If so, in the looping case, does that mean something is delaying the 
recovery process that much that when the local ip allocation verify 
happens, the recovery process is not finished ?
It sounds sound.
What is calling this verify_local_ip_allocation function?

-- 
Nicolas Ecarnot