[PATCH] Assert that the objectClass is always present

Wed Mar 19 20:56:01 MDT 2014

On Wed, 2014-03-12 at 12:52 +0100, Arvid Requate wrote:
> On Wed, 2014-03-12 at 07:56:35 +1300 Andrew Bartlett wrote:
> > On Tue, 2014-03-11 at 17:37 +0100, Arvid Requate wrote:
> > [...]
> > > Ok, how will drepl behave after the new assertion has been triggered? Will
> > > this block replication full stop or will it cause the object to be
> > > neglected or will it try tro re-replicate the object in the next
> > > replication run?
> > > 
> > > * If the patch brings replication to a grinding halt that would be a show
> > > stopper.
> > 
> > Yes, it would do exactly that.  I agree it stops the show, and in the
> > case of corruption, I think that is the only safe action.
> > 
> > From here what I would suggest is a new forced replication, either
> > overwriting local changes totally or applying the replication merge
> > logic, but asking the remote server for all objects by suggesting our
> > USN is actually 0.
> > 
> > > [...]
> > 
> > I would suggest that this, run manually by the administrator, is the
> > only safe option.  It wouldn't lead to an infinite cycle because it
> > would need to be administrator-run.
> > 
> > Naturally, this needs to be combined with actually finding and fixing
> > whatever can cause this in the first place - it should not be a natural
> > part of operating a Samba domain.
> 
> Ok, I understand your points. Would it be possible to signal this situation in 
> some way, e.g. making "samba-tool drs showrepl" issue a specific warning/error 
> message?
> 
> For us it would be important that we can automate the detection of this error 
> condition, maybe run some information gathering script and finally trigger the 
> re-replication. We might do this e.g. by some cron job or nagios check. This 
> is what we as distributors need to be able to.
> 
> This might also speed up the gathering of evidence around this issue, if we 
> could extract relevant information on the spot, possibly even from other 
> replicating DCs, rather than manually digging through the logs and comparing 
> logs on two or more DCs.

So, where I've got to is to make the DRS code give some pretty specific
error messages, and to write things very clearly in the logs.  This much
is in master. 

We still need an example of this failing that we can poke and probe, so
I hope you can continue to try and reproduce on some servers we can
later grab the databases from.

I'm not confident that error message won't currently make it to
showrepl, but I can look into that further if you like.  Currently know
that the error will be lost at:

replicated_objects.c:
	ret = ldb_extended(ldb, DSDB_EXTENDED_REPLICATED_OBJECTS_OID, objects,
&ext_res);
	if (ret != LDB_SUCCESS) {
		/* restore previous schema */
		if (used_global_schema) { 
			dsdb_set_global_schema(ldb);
		} else if (cur_schema) {
			dsdb_reference_schema(ldb, cur_schema, false);
		}

		DEBUG(0,("Failed to apply records: %s: %s\n",
			 ldb_errstring(ldb), ldb_strerror(ret)));
		ldb_transaction_cancel(ldb);
		TALLOC_FREE(tmp_ctx);
		return WERR_FOOBAR;
	}
	talloc_free(ext_res);

We could map the LDB errors onto WERR_DS_, we don't currently have a
mapping table for that, but it looks like all LDB errors have a matching
WERR_DS error.

I do look forward to any more clues that would help us close this down
forever, rather than just lock up a replication partner. 

Thanks,

Andrew Bartlett

-- 
Andrew Bartlett
http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba