patch set for kcc topology comparison

Wed Oct 5 00:05:42 MDT 2011

Hi Dave,

 > The patch appears at:
 >         https://github.com/wimberosa/samba/tree/kcc-compare
 > 
 > Review is requested.

thanks!

 > The patch does not include the newer intra-site topology (as that
 > still has pending work in my sandbox) but this is needed as a step
 > toward verifying ANY topology algorithm.

being able to verify our topology algorithm is great, but I don't
think that this patch is the best approach. I think we're going to
need something a bit more flexible.

With this patch, when I run it against one of my windows DCs I get
this:

$ bin/samba-tool drs kcccmp w2k8r2d.t2.bludom.tridgell.net -Uadministrator at T2.BLUDOM.TRIDGELL.NET%p at ssw0rd
Mismatch! repsFrom flags for
        CN=Schema,CN=Configuration,DC=bludom,DC=tridgell,DC=net
        43c78d1f-d48d-481c-bc93-2fe579f249cb._msdcs.bludom.tridgell.net
Mismatch! repsFrom flags for
        CN=Configuration,DC=bludom,DC=tridgell,DC=net
        43c78d1f-d48d-481c-bc93-2fe579f249cb._msdcs.bludom.tridgell.net
KCC compare successful

That output doesn't actually tell me very much. It says that our kcc
topology generator calculates different repsFrom replica_flags from
the ones that Windows has calculated, but it doesn't tell me how the
flags differ, and it doesn't tell me if there are any other partitions
that the windows box thinks should be replicated but that our kcc
doesn't know of (in this case there is, as this windows DC is the
master for a subdomain NC, and our kcc doesn't seem to understand the
need to replicate that with other GC servers).

This isn't really the main problem though. The tool uses a ldap
connection to the other DC, and then directly produces its 'Mismatch'
output on the console. Imagine I was a Samba end user and you are the
Samba developer trying to track down the problem with our kcc. Asking
the user to run that command and sending you the output will tell you
that something is wrong with our kcc algorithm, but I don't think it
will give you the information you need to fix it.

In that screnario I think what you would ask the user to do is to run
a few of commands like this:

 bin/ldbsearch -H ldap://w2k8r2d.t2.bludom.tridgell.net -Uadministrator at BLUDOM.TRIDGELL.NET%p at ssw0rd -b CN=Configuration,DC=bludom,DC=tridgell,DC=net '(|(objectclass=ntdsdsa)(objectclass=crossref))'
 bin/ldbsearch -H ldap://w2k8r2d.t2.bludom.tridgell.net -Uadministrator at BLUDOM.TRIDGELL.NET%p at ssw0rd -s base -b CN=Configuration,DC=bludom,DC=tridgell,DC=net repsFrom
 bin/ldbsearch -H ldap://w2k8r2d.t2.bludom.tridgell.net -Uadministrator at BLUDOM.TRIDGELL.NET%p at ssw0rd -s base -b CN=Configuration,DC=bludom,DC=tridgell,DC=net repsFrom
 bin/ldbsearch -H ldap://w2k8r2d.t2.bludom.tridgell.net:3268 -Uadministrator at BLUDOM.TRIDGELL.NET%p at ssw0rd -s base -b DC=bludom,DC=tridgell,DC=net repsFrom 
 bin/ldbsearch -H ldap://w2k8r2d.t2.bludom.tridgell.net -Uadministrator at BLUDOM.TRIDGELL.NET%p at ssw0rd -s base -b CN=Schema,CN=Configuration,DC=bludom,DC=tridgell,DC=net repsFrom

I think that would give the developer all of the information that goes
into the kcc algorithm, plus the result that this DC has
calculated. The developer could potentially use that to step through
the Samba kcc algorithm without having access to the users machines,
and diagnose/fix any discrepencies.

So what I think we need is a tool which we can use to capture all the
information that goes into the kcc algorithm, and save it in a form
that we can use as input to our own kcc code. I think the simplest
solution is a tool that does the right ldap searches (ie. automating
the searches I list above) and then saves that in its normal ldif
form. Then we need a corresponding tool that reads that ldif, and
loads it as a set of ldb messages. Those ldb messages would then be
passed to our kcc algorithm, which produce a set of repsFrom
records. A comparison tool that compares the resulting repsFrom
records to the ones from the dump tool would then complete the tool
set, allowing confirmation that the algorithms match, or showing in
what way they don't match.

To support this, what we'd need in the kcc code is:

 - change the core of the kcc algorithm to take two ldb_result
   structures, one for the nsdsdsa objects, and one for the crossref
   objects. It would also need to take the NTDS GUID of the DC that
   the kcc is being run for (identifying which DC we are trying to
   generate repsFrom records for)

 - the output would be a set of repsFrom attributes, probably also
   encapsulated as a ldb_result (just as a convenient structure that
   holds a set of ldb_message structures).

 - the main kcc would do the searches, then call the kcc core, and
   then would compare and apply the repsFrom attributes as needed,
   taking care to preserve things like the success counts, error codes
   etc that shouldn't change when the kcc runs

 - the dump tool would just be a simple python script (maybe
   source4/scripting/devel/kcc_dump.py ?) which we would use to
   capture the inputs to the kcc algorithm

 - the comparison tool (maybe source4/scripting/devel/kcc_compare.py
   ?) would take the ldif output from the kcc_dump.py and would run
   our kcc algorithm and display any discrepencies

Once we have these tools we could then include in the source tree a
set of captured dumps of complex domains, which we could then compare
as part of make test. Putting them in source4/dsdb/kcc/test/ would
make sense I think. When we find a domain where we get things wrong,
we could then add the failing example into that test set, so that we
build up a good set of complex tests for the kcc.

I think the kcc algorithms are going to end up being pretty complex,
especially once we take account of subdomain and inter-site issues,
and having a good toolset that allows us to test our algorithms will
be pretty important. If we can do something like what I've described
above then I think we'll be able to be confident we've got it right.

This all assumes of course that the searches I've suggested
(ie. ntdsdsa and crossref objects) is enough input for the kcc
algorith. I think it is, but if it turns out we need additional
information then we need need to expand what search output we include
in the dump.

I also have a few minor nit-pick comments on the patch that may be
useful for future work. 

 - we generally prefer several small matches that just do one thing
   rather than one large patch, as it makes it much easier to
   review. In this case I think 4 patches would have been better:

    1) split up and refactoring of the kcc code
    2) adding the "check mode" flags and support
    3) adding pykcc
    4) using pykcc in samba-tool

 - I would prefer that our server core code doesn't have printf lines in it
   that produces output that is meant for users. DEBUG() calls are
   fine, but those are for debugging the code, not displaying
   output. So it would have been better to have the "display what is
   wrong" logic separated out from the server logic (with the
   different approach I've suggested above, the "display what is
   wrong" logic would be in the kcc_compare.py script)

 - we try to use C99 types where available in Samba, so "unsigned int"
   or "uint32_t" not "uint"

Cheers, Tridge