Summary of flapping tests this year

Mon Feb 25 03:00:39 UTC 2019

So we have a theory behind the drs.replica_sync failures. I think it's a
timing problem, where the periodic background replication occasionally
conflicts with what the test is trying to do.

The test is disabling replication, and then creating the same object on
each DC, then re-enabling replication and checking that the DCs resolve
the conflict correctly. The failure occurs when trying to create the
object on the 2nd DC - the test finds that the object already exists
(presumably because it's been replicated in the background).

Our theory is that although the test disables replication at the start,
there may still be a periodic replication already in progress at that
point, i.e. the DC has already sent a GetNCChanges request at that
point, but the peer DC hasn't actioned it yet. Possible solutions might
be either:
- Update the test so that after it disables replication, it waits until
the drepl process is idle (I think Andrew has some patches on a WIP
abartlet-repl-flapping branch that did this). Or,
- More aggressively discard replication info in
dreplsrv_pending_op_callback(), if DS_NTDSDSA_OPT_DISABLE_INBOUND_REPL
is set.

It'd be nice if we could reproduce the problem a bit more reliably. I
tried lowering the dreplsrv:periodic_interval, but couldn't reproduce it
locally.

On 22/02/19 4:56 PM, Scott Lovenberg via samba-technical wrote:
>> On Feb 21, 2019, at 21:24, Douglas Bagnall via samba-technical <samba-technical at lists.samba.org> wrote:
>>
>> For those who don't know, we run 6-hourly tests on our build host;
>> when those tests fail it sends an email to the samba-cvs list. All
>> these tests should pass, and passed at least once to get into the
>> tree. Every so often I run scripts to gather statistics from the
>> samba-cvs emails, and post the results here like this.
>>
>>
>> Since the beginning of the year we have had 38 failing tests. 17 of
>> those occurred only once, while these ones recur (the count is shown
>> at the start of each line):
>>
>>  11 UNEXPECTED(failure): samba.wbinfo_simple.check-secret.domain=SAMBA-TEST.wbinfo(nt4_member:local)
>>   3 UNEXPECTED(failure): samba3.raw.notify.mask(nt4_dc)
>>   3 UNEXPECTED(failure): samba3.smb2.notify.mask(nt4_dc)
>>   2 UNEXPECTED(error): samba4.drs.replica_sync.python(promoted_dc).replica_sync.DrsReplicaSyncTestCase.test_ReplConflictsRenamedVsNewRemoteWin(promoted_dc:local)
>>   2 UNEXPECTED(error): samba4.drs.replica_sync.python(promoted_dc).python2.replica_sync.DrsReplicaSyncTestCase.test_ReplConflictsRemoteWin_with_child(promoted_dc:local)
>>
>>
>> Of the ones that occurred once, six had something to do with
>> replica_sync (along with the four above):
>>
>>   1 UNEXPECTED(error): samba4.drs.replica_sync.python(promoted_dc).replica_sync.DrsReplicaSyncTestCase.test_ReplConflictsRemoteWin(promoted_dc:local)
>>   1 UNEXPECTED(error): samba4.drs.replica_sync.python(vampire_dc).replica_sync.DrsReplicaSyncTestCase.test_ReplConflictsRemoteWin_with_child(vampire_dc:local)
>>   1 UNEXPECTED(failure): samba4.drs.getnc_exop.python(promoted_dc).getnc_exop.DrsReplicaSyncTestCase.test_link_utdv_hwm(promoted_dc)
>>   1 UNEXPECTED(error): samba4.drs.replica_sync.python(promoted_dc).replica_sync.DrsReplicaSyncTestCase.test_ReplConflictsRemoteWin_with_child(promoted_dc:local)
>>   1 UNEXPECTED(error): samba4.drs.replica_sync.python(vampire_dc).replica_sync.DrsReplicaSyncTestCase.test_ReplConflictsRemoteWin(vampire_dc:local)
>>   1 UNEXPECTED(failure): samba4.drs.ridalloc_exop.python(vampire_dc).python2.ridalloc_exop.DrsReplicaSyncTestCase.test_rid_set_dbcheck(vampire_dc)
>>
>>
>> Three can be blamed on samba_kcc:
>>
>>   1 UNEXPECTED(error): samba4.drs.samba_tool_drs.python(promoted_dc).python2.samba_tool_drs.SambaToolDrsTests.test_samba_tool_kcc(promoted_dc:local)
>>   1 UNEXPECTED(error): samba4.drs.samba_tool_drs.python(promoted_dc).samba_tool_drs.SambaToolDrsTests.test_samba_tool_kcc(promoted_dc:local)
>>   1 UNEXPECTED(error): samba4.drs.samba_tool_drs.python(vampire_dc).python2.samba_tool_drs.SambaToolDrsTests.test_samba_tool_kcc(vampire_dc:local)
>>
>>
>> Two had something to do with notify (along with 6 above):
>>
>>   1 UNEXPECTED(failure): samba3.smb2.notify-inotify.inotify-rename(fileserver)
>>   1 UNEXPECTED(failure): samba3.raw.notify.dir(nt4_dc)
>>
>>
>> One ctdb:
>>
>>   1 *FAILED* tests/simple/60_recoverd_missing_ip.sh
>>
>>
>> There are four misfits:
>>
>>   1 UNEXPECTED(failure): samba4.ldap.password_lockout.python(ad_dc_ntvfs).__main__.PasswordTestsWithSleep.test_login_lockout_krb5(ad_dc_ntvfs)
>>   1 UNEXPECTED(failure): samba.tests.samba_tool.user_wdigest.samba.tests.samba_tool.user_wdigest.UserCmdWdigestTestCase.test_Wdigest01(ad_dc_ntvfs:local)
>>   1 UNEXPECTED(failure): lib.audit_logging.audit_logging.test_audit_get_timestamp(none)
>>   1 UNEXPECTED(failure): samba4.rpc.altercontext on ncalrpc with bigendian.altercontext(ad_dc_ntvfs:local)
>>
>>
>> And there was one (2019-01-15-0032) where the samba-xc test ended with:
>>
>>     OSError: [Errno 28] No space left on device
>>
>>
>> So at the crudest level we have:
>>
>> wbinfo:         11
>> replica_sync:   10
>> notify:          8
>> samba_kcc:       3
>> all others:      5
>>
>>
>> We are possibly in a comparative lull (tilt your head to the right to
>> read the histogram):
>>
>> 2015-11   5 #####
>> 2015-12  18 ##################
>> 2016-01  36 ####################################
>> 2016-02  35 ###################################
>> 2016-03  47 ###############################################
>> 2016-04  49 #################################################
>> 2016-05  55 #######################################################
>> 2016-06  58 ##########################################################
>> 2016-07  53 #####################################################
>> 2016-08  50 ##################################################
>> 2016-09  24 ########################
>> 2016-10  22 ######################
>> 2016-11  23 #######################
>> 2016-12  22 ######################
>> 2017-01  44 ############################################
>> 2017-02  29 #############################
>> 2017-03  22 ######################
>> 2017-04  35 ###################################
>> 2017-05  45 #############################################
>> 2017-06  64 ################################################################
>> 2017-07  26 ##########################
>> 2017-08  21 #####################
>> 2017-09  27 ###########################
>> 2017-10  38 ######################################
>> 2017-11  25 #########################
>> 2017-12  50 ##################################################
>> 2018-01  35 ###################################
>> 2018-02  17 #################
>> 2018-03  40 ########################################
>> 2018-04  23 #######################
>> 2018-05  25 #########################
>> 2018-06  40 ########################################
>> 2018-07  66 ##################################################################
>> 2018-08  29 #############################
>> 2018-09  56 ########################################################
>> 2018-10  50 ##################################################
>> 2018-11  33 #################################
>> 2018-12  32 ################################
>> 2019-01  24 ########################
>> 2019-02  15 ############### <-- unfinished month
>>
>> The noise is mostly due to flapping tests that appear and are fixed
>> within a few weeks. We can at least say the underlying rate of
>> flapping tests is not getting worse, even though we keep adding more
>> tests and code that needs testing.
>>
>> Douglas
>>
> If I’m understanding correctly, it also seems that the majority of these flapping tests are due to race conditions or of some sort of asynchronous nature either within the actual code or the test framework itself?