[SCM] Samba Shared Repository - branch master updated

Mon Nov 5 02:44:02 UTC 2018

The branch, master has been updated
       via  3338a3e traffic: Machine accounts were generated as critical objects
       via  be51b51 traffic_replay: Generate machine accounts as well as users
       via  1906312 traffic_replay: Improve user generation debug
       via  71c6641 traffic_replay: Convert print() to logger.info()
       via  32e5822 traffic_replay: Write group memberships once per group
       via  a29ee3a traffic_replay: Re-organize assignments to be group-based
       via  5ad7fc7 traffic_replay: Prevent users having 1000+ memberOf links
       via  fdd7540 traffic_replay: Change user distribution to use Pareto Distribution
       via  898e6b4 traffic_replay: Improve assign_groups() performance with large domains
       via  18740ec traffic_replay: Split out random group membership generation logic
       via  e3e84b0 traffic_replay: Add helper class for group assignments
      from  7dd3585 selftest: Run smb2.delete-on-close-perms also with "delete readonly = yes"

https://git.samba.org/?p=samba.git;a=shortlog;h=master


- Log -----------------------------------------------------------------
commit 3338a3e257fa9f285ae639d6ac382e3e234be90e
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Tue Oct 30 16:14:33 2018 +1300

    traffic: Machine accounts were generated as critical objects
    
    Due to the userAccountControl flags we were specifying, the machine
    accounts were all created as critical objects. When trying to populate
    1000s of machine accounts in a DB, this makes replication unnecessarily
    slow (because it has to replicate them all twice).
    
    This patch changes it so when we're just creating machine accounts for
    the purpose of populating a semi-realistic DB, we jsut use the default
    WORKSTATION_TRUST_ACCOUNT flag.
    
    Note that for the accounts used for traffic-replay, we apparently need
    the existing flags in order for the DC to accept certain requests.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>
    
    Autobuild-User(master): Tim Beale <timbeale at samba.org>
    Autobuild-Date(master): Mon Nov  5 03:43:24 CET 2018 on sn-devel-144

commit be51b51263a460e9790b8a210914a68499af7953
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Tue Oct 23 11:16:31 2018 +1300

    traffic_replay: Generate machine accounts as well as users
    
    Currently the tool only generates the machine accounts needed for
    traffic generation. However, this isn't realistic if we're trying to use
    the tool to generate users to simulate a large network.
    
    This patch generates machine accoutns along with the user accounts.
    Note we assume there will be more computer accounts than users in a real
    network (e.g. work laptops, servers, etc), so generate slightly more
    computer accounts.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit 1906312c097812fbdca1e04f6fe70f5af6bbd596
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Tue Oct 23 10:46:17 2018 +1300

    traffic_replay: Improve user generation debug
    
    When creating 1000s of users you currently get a lot of debug, but at
    the same time you have no idea how far through creating the users you
    actually are.
    
    Instead of logging every single user account that's created, log every
    50th (as well as how far through the overall generation we are).
    
    Logger already includes timestamps, so we can remove generating the
    timestamp diff manually. User creation is the slowest operation - adding
    groups/memberships is much faster, so we don't need to log as
    frequently.
    
    Note that there is a usability trade-off on how frequently we log
    depending on whether the user is using the slower (but more common)
    method of going via LDAP, vs the much faster (but more obscure) method
    of writing directly to sam.ldb with ldb:nosync=true. In my tests, we end
    up logging every ~30-ish secs with LDAP, and every ~3 seconds with
    direct file writes.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit 71c66419bb243605ff240e0b7cd5d5a32ba2441f
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Tue Oct 23 10:24:51 2018 +1300

    traffic_replay: Convert print() to logger.info()
    
    Using logger is more helpful here because it includes timestamps, so we
    can see how long things are taking. It's also more consistent with the
    rest of the traffic_replay logging.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit 32e58227cd10d9a91c349433fc20bd0cb36869f1
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Thu Nov 1 09:42:33 2018 +1300

    traffic_replay: Write group memberships once per group
    
    Each user-group membership was being written to the DB in a single
    operation. With large numbers of users (e.g. 10,000 in average 15 groups
    each), this becomes a lot of operations (e.g. 150,000). This patch
    reworks the code so that we write the memberships for a group in
    one operation. E.g. instead of 150,000 DB operations, we might make
    1,500. This makes writing the group memberships several times
    faster.
    
    Note that rthere is a performance vs memory tradeoff. When we hit
    10,000+ members in a group, memory-usage in the underlying DB modify
    operation becomes very inefficient/costly. So we avoid potential memory
    usage problems by writing no more than 1,000 users to a group at once.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit a29ee3a7458de89ae548077013c31cc08cb747f9
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Wed Oct 31 16:50:27 2018 +1300

    traffic_replay: Re-organize assignments to be group-based
    
    We can speed up writing the group memberships by adding multiple users
    to a group in a single DB modify operation.
    
    To do this, we first need to reorganize the assignments so instead
    of being a set of tuples, it's a dictionary where key=group and
    value=list-of-users-in-group.
    
    add_users_to_groups() now iterates through the users/groups slightly
    differently, but mostly it's just indentation changes. We haven't
    changed the number of DB operations yet - we'll do that in the next
    patch.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit 5ad7fc73355604daafddcf074f6b998a43aac9f7
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Tue Oct 16 16:01:25 2018 +1300

    traffic_replay: Prevent users having 1000+ memberOf links
    
    When adding 10,000 users, one user would end up in over 1000 groups.
    With 100,000 users, it would be more like 10,000 groups. While it makes
    sense to have groups with large numbers of users, having a single user
    in 1000s of groups is probably less realistic.
    
    This patch changes the shape of the Pareto distribution that we use to
    assign users to groups. The aim is to cap users at belonging to at most
    ~500 groups. Increasing the shape of the Pareto distribution pushes the
    user assignments so they're closer to the average, and the tail (with
    users in lots of groups) is not so large).
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit fdd75407afdde4db6177a981c2d91f2b113e2a19
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Tue Oct 16 10:57:29 2018 +1300

    traffic_replay: Change user distribution to use Pareto Distribution
    
    The current probability we were assigning to users roughly approximates
    the Pareto Distribution (with shape=1.0). This means the code now uses a
    documented algorithm (i.e. explanation on Wikipedia). It also allows us
    to vary the distribution by changing the shape parameter.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit 898e6b4332e4641ed8377ff2db398a295c37cebf
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Mon Oct 15 16:24:00 2018 +1300

    traffic_replay: Improve assign_groups() performance with large domains
    
    When assigning 10,000 users to 15 groups each (on average),
    assign_groups() would take over 30 seconds. This did not include any DB
    operations whatsoever. This patch improves things, so that it takes less
    than a second in the same situation.
    
    The problem was the code was looping ~23 million times where the
    'random.random() < probability * 10000' condition was not met. The
    problem is individual group/user probabilities get lower as the number
    of groups/users increases. And so with large numbers of users, most of
    the time the calculated probability was very small and didn't meet the
    threshold.
    
    This patch changes it so we can select a user/group in one go, avoiding
    the need to loop multiple times.
    
    Basically we distribute the users (or groups) between 0.0 and 1.0, so
    that each user has their own 'slice', and this slice is proporational to
    their weighted probability. random.random() generates a value between
    0.0 and 1.0, so we can use this to pick a 'slice' (or rather, we use
    this as an index into the list, using .bisect()). Users/groups with
    larger probabilities end up with larger slices, so are more likely to
    get picked.
    
    The end result is roughly the same distribution as before, although the
    first 10 or so user/groups seem to get picked more frequently, so the
    weighted-probability calculations may need tweaking some more.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit 18740ec0dd5c0ed59fa03b2d9d0d34ea11436b00
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Wed Oct 17 12:54:03 2018 +1300

    traffic_replay: Split out random group membership generation logic
    
    This doesn't change functionality at all. It just moves the probability
    calculations out into separate functions.
    
    We want to tweak the logic/implementation behind this code, but the
    rest of assign_groups() doesn't really care how the underlying
    probabilities are worked out, so long as it gets a suitably random
    user/group membership each time round the loop.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

commit e3e84b0f6dc4964f0ac9576051cc88422e548020
Author: Tim Beale <timbeale at catalyst.net.nz>
Date:   Thu Oct 18 16:36:44 2018 +1300

    traffic_replay: Add helper class for group assignments
    
    Wrap up the group assignment calculations in a helper class. We're going
    to tweak the internals a bit in subsequent patches, but the rest of the
    code doesn't really need to know about these changes.
    
    Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
    Reviewed-by: Douglas Bagnall <douglas.bagnall at catalyst.net.nz>

-----------------------------------------------------------------------

Summary of changes:
 python/samba/emulate/traffic.py | 308 +++++++++++++++++++++++++++-------------
 1 file changed, 212 insertions(+), 96 deletions(-)


Changeset truncated at 500 lines:

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 677ad82..af05163 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -45,13 +45,15 @@ from samba.auth import system_session
 from samba.dsdb import (
     UF_NORMAL_ACCOUNT,
     UF_SERVER_TRUST_ACCOUNT,
-    UF_TRUSTED_FOR_DELEGATION
+    UF_TRUSTED_FOR_DELEGATION,
+    UF_WORKSTATION_TRUST_ACCOUNT
 )
 from samba.dcerpc.misc import SEC_CHAN_BDC
 from samba import gensec
 from samba import sd_utils
 from samba.compat import get_string
 from samba.logger import get_samba_logger
+import bisect
 
 SLEEP_OVERHEAD = 3e-4
 
@@ -1630,6 +1632,8 @@ def generate_traffic_accounts(ldb, instance_id, number, password):
             netbios_name = "STGM-%d-%d" % (instance_id, i)
             create_machine_account(ldb, instance_id, netbios_name, password)
             added += 1
+            if added % 50 == 0:
+                LOGGER.info("Created %u/%u machine accounts" % (added, number))
         except LdbError as e:
             (status, _) = e.args
             if status == 68:
@@ -1637,8 +1641,7 @@ def generate_traffic_accounts(ldb, instance_id, number, password):
             else:
                 raise
     if added > 0:
-        print("Added %d new machine accounts" % added,
-              file=sys.stderr)
+        LOGGER.info("Added %d new machine accounts" % added)
 
     added = 0
     for i in range(number, 0, -1):
@@ -1646,6 +1649,9 @@ def generate_traffic_accounts(ldb, instance_id, number, password):
             username = "STGU-%d-%d" % (instance_id, i)
             create_user_account(ldb, instance_id, username, password)
             added += 1
+            if added % 50 == 0:
+                LOGGER.info("Created %u/%u users" % (added, number))
+
         except LdbError as e:
             (status, _) = e.args
             if status == 68:
@@ -1654,28 +1660,32 @@ def generate_traffic_accounts(ldb, instance_id, number, password):
                 raise
 
     if added > 0:
-        print("Added %d new user accounts" % added,
-              file=sys.stderr)
+        LOGGER.info("Added %d new user accounts" % added)
 
 
-def create_machine_account(ldb, instance_id, netbios_name, machinepass):
+def create_machine_account(ldb, instance_id, netbios_name, machinepass,
+                           traffic_account=True):
     """Create a machine account via ldap."""
 
     ou = ou_name(ldb, instance_id)
     dn = "cn=%s,%s" % (netbios_name, ou)
     utf16pw = ('"%s"' % get_string(machinepass)).encode('utf-16-le')
 
-    start = time.time()
+    if traffic_account:
+        # we set these bits for the machine account otherwise the replayed
+        # traffic throws up NT_STATUS_NO_TRUST_SAM_ACCOUNT errors
+        account_controls = str(UF_TRUSTED_FOR_DELEGATION |
+                               UF_SERVER_TRUST_ACCOUNT)
+
+    else:
+        account_controls = str(UF_WORKSTATION_TRUST_ACCOUNT)
+
     ldb.add({
         "dn": dn,
         "objectclass": "computer",
         "sAMAccountName": "%s$" % netbios_name,
-        "userAccountControl":
-            str(UF_TRUSTED_FOR_DELEGATION | UF_SERVER_TRUST_ACCOUNT),
+        "userAccountControl": account_controls,
         "unicodePwd": utf16pw})
-    end = time.time()
-    duration = end - start
-    LOGGER.info("%f\t0\tcreate\tmachine\t%f\tTrue\t" % (end, duration))
 
 
 def create_user_account(ldb, instance_id, username, userpass):
@@ -1683,7 +1693,6 @@ def create_user_account(ldb, instance_id, username, userpass):
     ou = ou_name(ldb, instance_id)
     user_dn = "cn=%s,%s" % (username, ou)
     utf16pw = ('"%s"' % get_string(userpass)).encode('utf-16-le')
-    start = time.time()
     ldb.add({
         "dn": user_dn,
         "objectclass": "user",
@@ -1696,25 +1705,17 @@ def create_user_account(ldb, instance_id, username, userpass):
     sdutils = sd_utils.SDUtils(ldb)
     sdutils.dacl_add_ace(user_dn, "(A;;WP;;;PS)")
 
-    end = time.time()
-    duration = end - start
-    LOGGER.info("%f\t0\tcreate\tuser\t%f\tTrue\t" % (end, duration))
-
 
 def create_group(ldb, instance_id, name):
     """Create a group via ldap."""
 
     ou = ou_name(ldb, instance_id)
     dn = "cn=%s,%s" % (name, ou)
-    start = time.time()
     ldb.add({
         "dn": dn,
         "objectclass": "group",
         "sAMAccountName": name,
     })
-    end = time.time()
-    duration = end - start
-    LOGGER.info("%f\t0\tcreate\tgroup\t%f\tTrue\t" % (end, duration))
 
 
 def user_name(instance_id, i):
@@ -1740,10 +1741,29 @@ def generate_users(ldb, instance_id, number, password):
         if name not in existing_objects:
             create_user_account(ldb, instance_id, name, password)
             users += 1
+            if users % 50 == 0:
+                LOGGER.info("Created %u/%u users" % (users, number))
 
     return users
 
 
+def generate_machine_accounts(ldb, instance_id, number, password):
+    """Add machine accounts to the server"""
+    existing_objects = search_objectclass(ldb, objectclass='computer')
+    added = 0
+    for i in range(number, 0, -1):
+        name = "STGM-%d-%d$" % (instance_id, i)
+        if name not in existing_objects:
+            name = "STGM-%d-%d" % (instance_id, i)
+            create_machine_account(ldb, instance_id, name, password,
+                                   traffic_account=False)
+            added += 1
+            if added % 50 == 0:
+                LOGGER.info("Created %u/%u machine accounts" % (added, number))
+
+    return added
+
+
 def group_name(instance_id, i):
     """Generate a group name from instance id."""
     return "STGG-%d-%d" % (instance_id, i)
@@ -1758,6 +1778,8 @@ def generate_groups(ldb, instance_id, number):
         if name not in existing_objects:
             create_group(ldb, instance_id, name)
             groups += 1
+            if groups % 1000 == 0:
+                LOGGER.info("Created %u/%u groups" % (groups, number))
 
     return groups
 
@@ -1779,120 +1801,214 @@ def generate_users_and_groups(ldb, instance_id, password,
                               group_memberships):
     """Generate the required users and groups, allocating the users to
        those groups."""
-    assignments = []
+    memberships_added = 0
     groups_added  = 0
 
     create_ou(ldb, instance_id)
 
-    print("Generating dummy user accounts", file=sys.stderr)
+    LOGGER.info("Generating dummy user accounts")
     users_added = generate_users(ldb, instance_id, number_of_users, password)
 
+    # assume there will be some overhang with more computer accounts than users
+    computer_accounts = int(1.25 * number_of_users)
+    LOGGER.info("Generating dummy machine accounts")
+    computers_added = generate_machine_accounts(ldb, instance_id,
+                                                computer_accounts, password)
+
     if number_of_groups > 0:
-        print("Generating dummy groups", file=sys.stderr)
+        LOGGER.info("Generating dummy groups")
         groups_added = generate_groups(ldb, instance_id, number_of_groups)
 
     if group_memberships > 0:
-        print("Assigning users to groups", file=sys.stderr)
-        assignments = assign_groups(number_of_groups,
-                                    groups_added,
-                                    number_of_users,
-                                    users_added,
-                                    group_memberships)
-        print("Adding users to groups", file=sys.stderr)
+        LOGGER.info("Assigning users to groups")
+        assignments = GroupAssignments(number_of_groups,
+                                       groups_added,
+                                       number_of_users,
+                                       users_added,
+                                       group_memberships)
+        LOGGER.info("Adding users to groups")
         add_users_to_groups(ldb, instance_id, assignments)
+        memberships_added = assignments.total()
 
     if (groups_added > 0 and users_added == 0 and
        number_of_groups != groups_added):
-        print("Warning: the added groups will contain no members",
-              file=sys.stderr)
+        LOGGER.warning("The added groups will contain no members")
+
+    LOGGER.info("Added %d users (%d machines), %d groups and %d memberships" %
+                (users_added, computers_added, groups_added,
+                 memberships_added))
+
+
+class GroupAssignments(object):
+    def __init__(self, number_of_groups, groups_added, number_of_users,
+                 users_added, group_memberships):
+
+        self.count = 0
+        self.generate_group_distribution(number_of_groups)
+        self.generate_user_distribution(number_of_users, group_memberships)
+        self.assignments = self.assign_groups(number_of_groups,
+                                              groups_added,
+                                              number_of_users,
+                                              users_added,
+                                              group_memberships)
+
+    def cumulative_distribution(self, weights):
+        # make sure the probabilities conform to a cumulative distribution
+        # spread between 0.0 and 1.0. Dividing by the weighted total gives each
+        # probability a proportional share of 1.0. Higher probabilities get a
+        # bigger share, so are more likely to be picked. We use the cumulative
+        # value, so we can use random.random() as a simple index into the list
+        dist = []
+        total = sum(weights)
+        cumulative = 0.0
+        for probability in weights:
+            cumulative += probability
+            dist.append(cumulative / total)
+        return dist
 
-    print(("Added %d users, %d groups and %d group memberships" %
-           (users_added, groups_added, len(assignments))),
-          file=sys.stderr)
+    def generate_user_distribution(self, num_users, num_memberships):
+        """Probability distribution of a user belonging to a group.
+        """
+        # Assign a weighted probability to each user. Use the Pareto
+        # Distribution so that some users are in a lot of groups, and the
+        # bulk of users are in only a few groups. If we're assigning a large
+        # number of group memberships, use a higher shape. This means slightly
+        # fewer outlying users that are in large numbers of groups. The aim is
+        # to have no users belonging to more than ~500 groups.
+        if num_memberships > 5000000:
+            shape = 3.0
+        elif num_memberships > 2000000:
+            shape = 2.5
+        elif num_memberships > 300000:
+            shape = 2.25
+        else:
+            shape = 1.75
 
+        weights = []
+        for x in range(1, num_users + 1):
+            p = random.paretovariate(shape)
+            weights.append(p)
 
-def assign_groups(number_of_groups,
-                  groups_added,
-                  number_of_users,
-                  users_added,
-                  group_memberships):
-    """Allocate users to groups.
+        # convert the weights to a cumulative distribution between 0.0 and 1.0
+        self.user_dist = self.cumulative_distribution(weights)
 
-    The intention is to have a few users that belong to most groups, while
-    the majority of users belong to a few groups.
+    def generate_group_distribution(self, n):
+        """Probability distribution of a group containing a user."""
 
-    A few groups will contain most users, with the remaining only having a
-    few users.
-    """
+        # Assign a weighted probability to each user. Probability decreases
+        # as the group-ID increases
+        weights = []
+        for x in range(1, n + 1):
+            p = 1 / (x**1.3)
+            weights.append(p)
 
-    def generate_user_distribution(n):
-        """Probability distribution of a user belonging to a group.
+        # convert the weights to a cumulative distribution between 0.0 and 1.0
+        self.group_dist = self.cumulative_distribution(weights)
+
+    def generate_random_membership(self):
+        """Returns a randomly generated user-group membership"""
+
+        # the list items are cumulative distribution values between 0.0 and
+        # 1.0, which makes random() a handy way to index the list to get a
+        # weighted random user/group. (Here the user/group returned are
+        # zero-based array indexes)
+        user = bisect.bisect(self.user_dist, random.random())
+        group = bisect.bisect(self.group_dist, random.random())
+
+        return user, group
+
+    def users_in_group(self, group):
+        return self.assignments[group]
+
+    def get_groups(self):
+        return self.assignments.keys()
+
+    def assign_groups(self, number_of_groups, groups_added,
+                      number_of_users, users_added, group_memberships):
+        """Allocate users to groups.
+
+        The intention is to have a few users that belong to most groups, while
+        the majority of users belong to a few groups.
+
+        A few groups will contain most users, with the remaining only having a
+        few users.
         """
-        dist = []
-        for x in range(1, n + 1):
-            p = 1 / (x + 0.001)
-            dist.append(p)
-        return dist
 
-    def generate_group_distribution(n):
-        """Probability distribution of a group containing a user."""
-        dist = []
-        for x in range(1, n + 1):
-            p = 1 / (x**1.3)
-            dist.append(p)
-        return dist
+        assignments = set()
+        if group_memberships <= 0:
+            return {}
 
-    assignments = set()
-    if group_memberships <= 0:
-        return assignments
+        # Calculate the number of group menberships required
+        group_memberships = math.ceil(
+            float(group_memberships) *
+            (float(users_added) / float(number_of_users)))
 
-    group_dist = generate_group_distribution(number_of_groups)
-    user_dist  = generate_user_distribution(number_of_users)
+        existing_users  = number_of_users  - users_added  - 1
+        existing_groups = number_of_groups - groups_added - 1
+        while len(assignments) < group_memberships:
+            user, group = self.generate_random_membership()
 
-    # Calculate the number of group menberships required
-    group_memberships = math.ceil(
-        float(group_memberships) *
-        (float(users_added) / float(number_of_users)))
+            if group > existing_groups or user > existing_users:
+                # the + 1 converts the array index to the corresponding
+                # group or user number
+                assignments.add(((user + 1), (group + 1)))
 
-    existing_users  = number_of_users  - users_added  - 1
-    existing_groups = number_of_groups - groups_added - 1
-    while len(assignments) < group_memberships:
-        user        = random.randint(0, number_of_users - 1)
-        group       = random.randint(0, number_of_groups - 1)
-        probability = group_dist[group] * user_dist[user]
+        # convert the set into a dictionary, where key=group, value=list-of-
+        # users-in-group (indexing by group-ID allows us to optimize for
+        # DB membership writes)
+        assignment_dict = defaultdict(list)
+        for (user, group) in assignments:
+            assignment_dict[group].append(user)
+            self.count += 1
 
-        if ((random.random() < probability * 10000) and
-           (group > existing_groups or user > existing_users)):
-            # the + 1 converts the array index to the corresponding
-            # group or user number
-            assignments.add(((user + 1), (group + 1)))
+        return assignment_dict
 
-    return assignments
+    def total(self):
+        return self.count
 
 
 def add_users_to_groups(db, instance_id, assignments):
-    """Add users to their assigned groups.
+    """Takes the assignments of users to groups and applies them to the DB."""
 
-    Takes the list of (group,user) tuples generated by assign_groups and
-    assign the users to their specified groups."""
+    total = assignments.total()
+    count = 0
+    added = 0
+
+    for group in assignments.get_groups():
+        users_in_group = assignments.users_in_group(group)
+        if len(users_in_group) == 0:
+            continue
+
+        # Split up the users into chunks, so we write no more than 1K at a
+        # time. (Minimizing the DB modifies is more efficient, but writing
+        # 10K+ users to a single group becomes inefficient memory-wise)
+        for chunk in range(0, len(users_in_group), 1000):
+            chunk_of_users = users_in_group[chunk:chunk + 1000]
+            add_group_members(db, instance_id, group, chunk_of_users)
+
+            added += len(chunk_of_users)
+            count += 1
+            if count % 50 == 0:
+                LOGGER.info("Added %u/%u memberships" % (added, total))
+
+def add_group_members(db, instance_id, group, users_in_group):
+    """Adds the given users to group specified."""
 
     ou = ou_name(db, instance_id)
 
     def build_dn(name):
         return("cn=%s,%s" % (name, ou))
 
-    for (user, group) in assignments:
-        user_dn  = build_dn(user_name(instance_id, user))
-        group_dn = build_dn(group_name(instance_id, group))
+    group_dn = build_dn(group_name(instance_id, group))
+    m = ldb.Message()
+    m.dn = ldb.Dn(db, group_dn)
 
-        m = ldb.Message()
-        m.dn = ldb.Dn(db, group_dn)
-        m["member"] = ldb.MessageElement(user_dn, ldb.FLAG_MOD_ADD, "member")
-        start = time.time()
-        db.modify(m)
-        end = time.time()
-        duration = end - start
-        LOGGER.info("%f\t0\tadd\tuser\t%f\tTrue\t" % (end, duration))
+    for user in users_in_group:
+        user_dn = build_dn(user_name(instance_id, user))
+        idx = "member-" + str(user)
+        m[idx] = ldb.MessageElement(user_dn, ldb.FLAG_MOD_ADD, "member")
+
+    db.modify(m)
 
 
 def generate_stats(statsdir, timing_file):


-- 
Samba Shared Repository