[PATCH] Improvements to traffic_replay user generation

Mon Oct 29 20:27:59 UTC 2018

Reworked patch #8 to only write 1K group memberships at a time. (Trying
to add 50K users to a group in a single DB operation exploded the memory
usage a wee bit...)

CI pass: https://gitlab.com/catalyst-samba/samba/pipelines/34615505

On 25/10/18 10:43 AM, Tim Beale wrote:
> Resending. First 8 patches are the same as last time. I added some new
> patches (#9-#11) that try to improve the traffic_replay debug for user
> generation. Patch #12 generates machine accounts as well as user
> accounts for the --generate-users-only option.
>
> CI link: https://gitlab.com/catalyst-samba/samba/pipelines/34100460
>
> Review appreciated. Thanks.
>
> On 18/10/18 4:54 PM, Tim Beale via samba-technical wrote:
>> Updated patch-set. I've split out most of the logging changes and will
>> resend those separately.
>>
>> CI link: https://gitlab.com/catalyst-samba/samba/pipelines/33374613
>>
>> On 18/10/18 8:49 AM, Tim Beale via samba-technical wrote:
>>> These patches make some improvements to the traffic_replay tool, when
>>> using the --generate-users-only to populate a large DB (e.g. 10,000+ users).
>>>
>>> The changes allow the tool to generate the users a lot faster, and tries
>>> to improve how users were randomly being assigned to groups.
>>>
>>> CI pass: https://gitlab.com/catalyst-samba/samba/pipelines/33218214
>>>
>>> Review appreciated. Thanks.
>>>
-------------- next part --------------
From 2ca588e5fb7e04cd6e1ecc618b4e4a7345eae53d Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Thu, 11 Oct 2018 14:47:28 +1300
Subject: [PATCH 01/12] traffic_replay: Generate users faster by writing to
 local DB

We can create user accounts much faster if the LDB connection talks
directly to the local sam.ldb file rather than going via LDAP. This
patch allows the 'host' argument to the tool to be a .ldb file (e.g.
"/usr/local/samba/private/sam.ldb") instead of a server name/IP.

In most cases, the traffic_replay tool wants to run on a remote device
(because the point of it is to send traffic to the DC). However, the
--generate-users-only is one case where the tool can be run locally,
directly on the test DC. (The traffic_replay user generation is handy
for standalone testing, because it also handles assigning group
memberships to the generated user accounts).

Note that you also need to use '--option="ldb:nosync = true"' to get
the improvement in performance.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 script/traffic_replay | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/script/traffic_replay b/script/traffic_replay
index 6578c84..3f4ff9b 100755
--- a/script/traffic_replay
+++ b/script/traffic_replay
@@ -30,6 +30,8 @@ from samba import gensec, get_debug_level
 from samba.emulate import traffic
 import samba.getopt as options
 from samba.logger import get_samba_logger
+from samba.samdb import SamDB
+from samba.auth import system_session
 
 
 def print_err(*args, **kwargs):
@@ -306,8 +308,16 @@ def main():
                          opts.number_of_groups)))
         sys.exit(1)
 
+    # Get an LDB connection.
     try:
-        ldb = traffic.openLdb(host, creds, lp)
+        # if we're only adding users, then it's OK to pass a sam.ldb filepath
+        # as the host, which creates the users much faster. In all other cases
+        # we should be connecting to a remote DC
+        if opts.generate_users_only and os.path.isfile(host):
+            ldb = SamDB(url="ldb://{0}".format(host),
+                        session_info=system_session(), lp=lp)
+        else:
+            ldb = traffic.openLdb(host, creds, lp)
     except:
         logger.error(("\nInitial LDAP connection failed! Did you supply "
                       "a DNS host name and the correct credentials?"))
-- 
2.7.4


From 2f2f5c93e8387f5226256560897ff529482d9728 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Wed, 17 Oct 2018 11:26:28 +1300
Subject: [PATCH 02/12] traffic_replay: Change print() to use logger()

This reduces noise, so the messages only come out if you specify
--debug.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 35100ca..677ad82 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1892,7 +1892,7 @@ def add_users_to_groups(db, instance_id, assignments):
         db.modify(m)
         end = time.time()
         duration = end - start
-        print("%f\t0\tadd\tuser\t%f\tTrue\t" % (end, duration))
+        LOGGER.info("%f\t0\tadd\tuser\t%f\tTrue\t" % (end, duration))
 
 
 def generate_stats(statsdir, timing_file):
-- 
2.7.4


From 098d6c0aff9cc115dd10ab192c8500fac3e7cab1 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Thu, 18 Oct 2018 16:36:44 +1300
Subject: [PATCH 03/12] traffic_replay: Add helper class for group assignments

Wrap up the group assignment calculations in a helper class. We're going
to tweak the internals a bit in subsequent patches, but the rest of the
code doesn't really need to know about these changes.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 124 ++++++++++++++++++++++------------------
 1 file changed, 67 insertions(+), 57 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 677ad82..cdff76d 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1779,7 +1779,7 @@ def generate_users_and_groups(ldb, instance_id, password,
                               group_memberships):
     """Generate the required users and groups, allocating the users to
        those groups."""
-    assignments = []
+    memberships_added = 0
     groups_added  = 0
 
     create_ou(ldb, instance_id)
@@ -1793,13 +1793,14 @@ def generate_users_and_groups(ldb, instance_id, password,
 
     if group_memberships > 0:
         print("Assigning users to groups", file=sys.stderr)
-        assignments = assign_groups(number_of_groups,
-                                    groups_added,
-                                    number_of_users,
-                                    users_added,
-                                    group_memberships)
+        assignments = GroupAssignments(number_of_groups,
+                                       groups_added,
+                                       number_of_users,
+                                       users_added,
+                                       group_memberships)
         print("Adding users to groups", file=sys.stderr)
         add_users_to_groups(ldb, instance_id, assignments)
+        memberships_added = assignments.total()
 
     if (groups_added > 0 and users_added == 0 and
        number_of_groups != groups_added):
@@ -1807,67 +1808,76 @@ def generate_users_and_groups(ldb, instance_id, password,
               file=sys.stderr)
 
     print(("Added %d users, %d groups and %d group memberships" %
-           (users_added, groups_added, len(assignments))),
+           (users_added, groups_added, memberships_added)),
           file=sys.stderr)
 
 
-def assign_groups(number_of_groups,
-                  groups_added,
-                  number_of_users,
-                  users_added,
-                  group_memberships):
-    """Allocate users to groups.
+class GroupAssignments(object):
+    def __init__(self, number_of_groups, groups_added, number_of_users,
+                 users_added, group_memberships):
+        self.assignments = self.assign_groups(number_of_groups,
+                                              groups_added,
+                                              number_of_users,
+                                              users_added,
+                                              group_memberships)
 
-    The intention is to have a few users that belong to most groups, while
-    the majority of users belong to a few groups.
+    def assign_groups(self, number_of_groups, groups_added, number_of_users,
+                      users_added, group_memberships):
+        """Allocate users to groups.
 
-    A few groups will contain most users, with the remaining only having a
-    few users.
-    """
+        The intention is to have a few users that belong to most groups, while
+        the majority of users belong to a few groups.
 
-    def generate_user_distribution(n):
-        """Probability distribution of a user belonging to a group.
+        A few groups will contain most users, with the remaining only having a
+        few users.
         """
-        dist = []
-        for x in range(1, n + 1):
-            p = 1 / (x + 0.001)
-            dist.append(p)
-        return dist
-
-    def generate_group_distribution(n):
-        """Probability distribution of a group containing a user."""
-        dist = []
-        for x in range(1, n + 1):
-            p = 1 / (x**1.3)
-            dist.append(p)
-        return dist
-
-    assignments = set()
-    if group_memberships <= 0:
-        return assignments
-
-    group_dist = generate_group_distribution(number_of_groups)
-    user_dist  = generate_user_distribution(number_of_users)
 
-    # Calculate the number of group menberships required
-    group_memberships = math.ceil(
-        float(group_memberships) *
-        (float(users_added) / float(number_of_users)))
-
-    existing_users  = number_of_users  - users_added  - 1
-    existing_groups = number_of_groups - groups_added - 1
-    while len(assignments) < group_memberships:
-        user        = random.randint(0, number_of_users - 1)
-        group       = random.randint(0, number_of_groups - 1)
-        probability = group_dist[group] * user_dist[user]
+        def generate_user_distribution(n):
+            """Probability distribution of a user belonging to a group.
+            """
+            dist = []
+            for x in range(1, n + 1):
+                p = 1 / (x + 0.001)
+                dist.append(p)
+            return dist
+
+        def generate_group_distribution(n):
+            """Probability distribution of a group containing a user."""
+            dist = []
+            for x in range(1, n + 1):
+                p = 1 / (x**1.3)
+                dist.append(p)
+            return dist
+
+        assignments = set()
+        if group_memberships <= 0:
+            return assignments
+
+        group_dist = generate_group_distribution(number_of_groups)
+        user_dist  = generate_user_distribution(number_of_users)
+
+        # Calculate the number of group menberships required
+        group_memberships = math.ceil(
+            float(group_memberships) *
+            (float(users_added) / float(number_of_users)))
+
+        existing_users  = number_of_users  - users_added  - 1
+        existing_groups = number_of_groups - groups_added - 1
+        while len(assignments) < group_memberships:
+            user        = random.randint(0, number_of_users - 1)
+            group       = random.randint(0, number_of_groups - 1)
+            probability = group_dist[group] * user_dist[user]
+
+            if ((random.random() < probability * 10000) and
+               (group > existing_groups or user > existing_users)):
+                # the + 1 converts the array index to the corresponding
+                # group or user number
+                assignments.add(((user + 1), (group + 1)))
 
-        if ((random.random() < probability * 10000) and
-           (group > existing_groups or user > existing_users)):
-            # the + 1 converts the array index to the corresponding
-            # group or user number
-            assignments.add(((user + 1), (group + 1)))
+        return assignments
 
-    return assignments
+    def total(self):
+        return len(self.assignments)
 
 
 def add_users_to_groups(db, instance_id, assignments):
-- 
2.7.4


From 8a54f07459ae7dac974140b358c0df45ea6dd432 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Wed, 17 Oct 2018 12:54:03 +1300
Subject: [PATCH 04/12] traffic_replay: Split out random group membership 
 generation logic

This doesn't change functionality at all. It just moves the probability
calculations out into separate functions.

We want to tweak the logic/implementation behind this code, but the
rest of assign_groups() doesn't really care how the underlying
probabilities are worked out, so long as it gets a suitably random
user/group membership each time round the loop.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 63 +++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index cdff76d..55dadd3 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1815,14 +1815,46 @@ def generate_users_and_groups(ldb, instance_id, password,
 class GroupAssignments(object):
     def __init__(self, number_of_groups, groups_added, number_of_users,
                  users_added, group_memberships):
+
+        self.generate_group_distribution(number_of_groups)
+        self.generate_user_distribution(number_of_users)
         self.assignments = self.assign_groups(number_of_groups,
                                               groups_added,
                                               number_of_users,
                                               users_added,
                                               group_memberships)
 
-    def assign_groups(self, number_of_groups, groups_added, number_of_users,
-                      users_added, group_memberships):
+    def generate_user_distribution(self, n):
+        """Probability distribution of a user belonging to a group.
+        """
+        self.user_dist = []
+        for x in range(1, n + 1):
+            p = 1 / (x + 0.001)
+            self.user_dist.append(p)
+
+        self.num_users = n
+
+    def generate_group_distribution(self, n):
+        """Probability distribution of a group containing a user."""
+        self.group_dist = []
+        for x in range(1, n + 1):
+            p = 1 / (x**1.3)
+            self.group_dist.append(p)
+
+        self.num_groups = n
+
+    def generate_random_membership(self):
+        """Returns a randomly generated user-group membership"""
+        while True:
+            user        = random.randint(0, self.num_users - 1)
+            group       = random.randint(0, self.num_groups - 1)
+            probability = self.group_dist[group] * self.user_dist[user]
+
+            if random.random() < probability * 10000:
+                return user, group
+
+    def assign_groups(self, number_of_groups, groups_added,
+                      number_of_users, users_added, group_memberships):
         """Allocate users to groups.
 
         The intention is to have a few users that belong to most groups, while
@@ -1832,30 +1864,10 @@ class GroupAssignments(object):
         few users.
         """
 
-        def generate_user_distribution(n):
-            """Probability distribution of a user belonging to a group.
-            """
-            dist = []
-            for x in range(1, n + 1):
-                p = 1 / (x + 0.001)
-                dist.append(p)
-            return dist
-
-        def generate_group_distribution(n):
-            """Probability distribution of a group containing a user."""
-            dist = []
-            for x in range(1, n + 1):
-                p = 1 / (x**1.3)
-                dist.append(p)
-            return dist
-
         assignments = set()
         if group_memberships <= 0:
             return assignments
 
-        group_dist = generate_group_distribution(number_of_groups)
-        user_dist  = generate_user_distribution(number_of_users)
-
         # Calculate the number of group menberships required
         group_memberships = math.ceil(
             float(group_memberships) *
@@ -1864,12 +1876,9 @@ class GroupAssignments(object):
         existing_users  = number_of_users  - users_added  - 1
         existing_groups = number_of_groups - groups_added - 1
         while len(assignments) < group_memberships:
-            user        = random.randint(0, number_of_users - 1)
-            group       = random.randint(0, number_of_groups - 1)
-            probability = group_dist[group] * user_dist[user]
+            user, group = self.generate_random_membership()
 
-            if ((random.random() < probability * 10000) and
-               (group > existing_groups or user > existing_users)):
+            if group > existing_groups or user > existing_users:
                 # the + 1 converts the array index to the corresponding
                 # group or user number
                 assignments.add(((user + 1), (group + 1)))
-- 
2.7.4


From 7a8fdd8e38ebf3608a5b8ebac213a5935d54d620 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Mon, 15 Oct 2018 16:24:00 +1300
Subject: [PATCH 05/12] traffic_replay: Improve assign_groups() performance
 with large domains

When assigning 10,000 users to 15 groups each (on average),
assign_groups() would take over 30 seconds. This did not include any DB
operations whatsoever. This patch improves things, so that it takes less
than a second in the same situation.

The problem was the code was looping ~23 million times where the
'random.random() < probability * 10000' condition was not met. The
problem is individual group/user probabilities get lower as the number
of groups/users increases. And so with large numbers of users, most of
the time the calculated probability was very small and didn't meet the
threshold.

This patch changes it so we can select a user/group in one go, avoiding
the need to loop multiple times.

Basically we distribute the users (or groups) between 0.0 and 1.0, so
that each user has their own 'slice', and this slice is proporational to
their weighted probability. random.random() generates a value between
0.0 and 1.0, so we can use this to pick a 'slice' (or rather, we use
this as an index into the list, using .bisect()). Users/groups with
larger probabilities end up with larger slices, so are more likely to
get picked.

The end result is roughly the same distribution as before, although the
first 10 or so user/groups seem to get picked more frequently, so the
weighted-probability calculations may need tweaking some more.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 47 ++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 12 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 55dadd3..b79fc28 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -52,6 +52,7 @@ from samba import gensec
 from samba import sd_utils
 from samba.compat import get_string
 from samba.logger import get_samba_logger
+import bisect
 
 SLEEP_OVERHEAD = 3e-4
 
@@ -1824,34 +1825,56 @@ class GroupAssignments(object):
                                               users_added,
                                               group_memberships)
 
+    def cumulative_distribution(self, weights):
+        # make sure the probabilities conform to a cumulative distribution
+        # spread between 0.0 and 1.0. Dividing by the weighted total gives each
+        # probability a proportional share of 1.0. Higher probabilities get a
+        # bigger share, so are more likely to be picked. We use the cumulative
+        # value, so we can use random.random() as a simple index into the list
+        dist = []
+        total = sum(weights)
+        cumulative = 0.0
+        for probability in weights:
+            cumulative += probability
+            dist.append(cumulative / total)
+        return dist
+
     def generate_user_distribution(self, n):
         """Probability distribution of a user belonging to a group.
         """
-        self.user_dist = []
+        # Assign a weighted probability to each user. Probability decreases
+        # as the user-ID increases
+        weights = []
         for x in range(1, n + 1):
             p = 1 / (x + 0.001)
-            self.user_dist.append(p)
+            weights.append(p)
 
-        self.num_users = n
+        # convert the weights to a cumulative distribution between 0.0 and 1.0
+        self.user_dist = self.cumulative_distribution(weights)
 
     def generate_group_distribution(self, n):
         """Probability distribution of a group containing a user."""
-        self.group_dist = []
+
+        # Assign a weighted probability to each user. Probability decreases
+        # as the group-ID increases
+        weights = []
         for x in range(1, n + 1):
             p = 1 / (x**1.3)
-            self.group_dist.append(p)
+            weights.append(p)
 
-        self.num_groups = n
+        # convert the weights to a cumulative distribution between 0.0 and 1.0
+        self.group_dist = self.cumulative_distribution(weights)
 
     def generate_random_membership(self):
         """Returns a randomly generated user-group membership"""
-        while True:
-            user        = random.randint(0, self.num_users - 1)
-            group       = random.randint(0, self.num_groups - 1)
-            probability = self.group_dist[group] * self.user_dist[user]
 
-            if random.random() < probability * 10000:
-                return user, group
+        # the list items are cumulative distribution values between 0.0 and
+        # 1.0, which makes random() a handy way to index the list to get a
+        # weighted random user/group
+        user = bisect.bisect(self.user_dist, random.random())
+        group = bisect.bisect(self.group_dist, random.random())
+
+        return user, group
 
     def assign_groups(self, number_of_groups, groups_added,
                       number_of_users, users_added, group_memberships):
-- 
2.7.4


From f5914365ef2df2cd0c50cc497f02aee32121d4c3 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Tue, 16 Oct 2018 10:57:29 +1300
Subject: [PATCH 06/12] traffic_replay: Change user distribution to use Pareto
 Distribution

The current probability we were assigning to users roughly approximates
the Pareto Distribution (with shape=1.0). This means the code now uses a
documented algorithm (i.e. explanation on Wikipedia). It also allows us
to vary the distribution by changing the shape parameter.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index b79fc28..5ab56b2 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1842,11 +1842,12 @@ class GroupAssignments(object):
     def generate_user_distribution(self, n):
         """Probability distribution of a user belonging to a group.
         """
-        # Assign a weighted probability to each user. Probability decreases
-        # as the user-ID increases
+        # Assign a weighted probability to each user. Use the Pareto
+        # Distribution so that some users are in a lot of groups, and the
+        # bulk of users are in only a few groups
         weights = []
         for x in range(1, n + 1):
-            p = 1 / (x + 0.001)
+            p = random.paretovariate(1.0)
             weights.append(p)
 
         # convert the weights to a cumulative distribution between 0.0 and 1.0
-- 
2.7.4


From de8ac8b5d7f7f7ac0dfac8689217450fd19ebb96 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Tue, 16 Oct 2018 16:01:25 +1300
Subject: [PATCH 07/12] traffic_replay: Prevent users having 1000+ memberOf
 links

When adding 10,000 users, one user would end up in over 1000 groups.
With 100,000 users, it would be more like 10,000 groups. While it makes
sense to have groups with large numbers of users, having a single user
in 1000s of groups is probably less realistic.

This patch changes the shape of the Pareto distribution that we use to
assign users to groups. The aim is to cap users at belonging to at most
~500 groups. Increasing the shape of the Pareto distribution pushes the
user assignments so they're closer to the average, and the tail (with
users in lots of groups) is not so large).

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 5ab56b2..04a5cbe 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1818,7 +1818,7 @@ class GroupAssignments(object):
                  users_added, group_memberships):
 
         self.generate_group_distribution(number_of_groups)
-        self.generate_user_distribution(number_of_users)
+        self.generate_user_distribution(number_of_users, group_memberships)
         self.assignments = self.assign_groups(number_of_groups,
                                               groups_added,
                                               number_of_users,
@@ -1839,15 +1839,27 @@ class GroupAssignments(object):
             dist.append(cumulative / total)
         return dist
 
-    def generate_user_distribution(self, n):
+    def generate_user_distribution(self, num_users, num_memberships):
         """Probability distribution of a user belonging to a group.
         """
         # Assign a weighted probability to each user. Use the Pareto
         # Distribution so that some users are in a lot of groups, and the
-        # bulk of users are in only a few groups
+        # bulk of users are in only a few groups. If we're assigning a large
+        # number of group memberships, use a higher shape. This means slightly
+        # fewer outlying users that are in large numbers of groups. The aim is
+        # to have no users belonging to more than ~500 groups.
+        if num_memberships > 5000000:
+            shape = 3.0
+        elif num_memberships > 2000000:
+            shape = 2.5
+        elif num_memberships > 300000:
+            shape = 2.25
+        else:
+            shape = 1.75
+
         weights = []
-        for x in range(1, n + 1):
-            p = random.paretovariate(1.0)
+        for x in range(1, num_users + 1):
+            p = random.paretovariate(shape)
             weights.append(p)
 
         # convert the weights to a cumulative distribution between 0.0 and 1.0
-- 
2.7.4


From c6a98b5fda939b4381eabed303e699773da0b786 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Wed, 17 Oct 2018 13:22:59 +1300
Subject: [PATCH 08/12] traffic_replay: Write group memberships once per group

Each user-group membership was being written to the DB in a single
operation. With large numbers of users (e.g. 10,000 in average 15 groups
each), this becomes a lot of operations (e.g. 150,000). This patch
reworks the code so that we write *all* the memberships for a group in
one operation. E.g. instead of 150,000 DB operations, we might make
1,500. This makes writing the group memberships several times
faster.

To do this, we reorganize the assignments so instead of being a set of
tuples, it's a dictionary where key=group and
value=list-of-users-in-group.

Note that although this makes performance better, when we hit 10,000+
members in a group, memory-usage in the underlying DB modify operation
becomes very inefficient/costly. So we avoid potential memory usage
problems by writing no more than 1,000 users to a group at once.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 79 +++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 26 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 04a5cbe..e6e90af 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1817,13 +1817,17 @@ class GroupAssignments(object):
     def __init__(self, number_of_groups, groups_added, number_of_users,
                  users_added, group_memberships):
 
+        # populate a dictionary of empty sets (group is the dictionary key,
+        # because that optimizes for DB membership writes)
+        self.count = 0
+        self.assignments = {}
+        for i in range(number_of_groups, 0, -1):
+            self.assignments[i] = set()
+
         self.generate_group_distribution(number_of_groups)
         self.generate_user_distribution(number_of_users, group_memberships)
-        self.assignments = self.assign_groups(number_of_groups,
-                                              groups_added,
-                                              number_of_users,
-                                              users_added,
-                                              group_memberships)
+        self.assign_groups(number_of_groups, groups_added, number_of_users,
+                           users_added, group_memberships)
 
     def cumulative_distribution(self, weights):
         # make sure the probabilities conform to a cumulative distribution
@@ -1889,6 +1893,17 @@ class GroupAssignments(object):
 
         return user, group
 
+    def add_assignment(self, user, group):
+        if user not in self.assignments[group]:
+            self.assignments[group].add(user)
+            self.count += 1
+
+    def users_in_group(self, group):
+        return list(self.assignments[group])
+
+    def get_groups(self):
+        return self.assignments.keys()
+
     def assign_groups(self, number_of_groups, groups_added,
                       number_of_users, users_added, group_memberships):
         """Allocate users to groups.
@@ -1900,9 +1915,8 @@ class GroupAssignments(object):
         few users.
         """
 
-        assignments = set()
         if group_memberships <= 0:
-            return assignments
+            return
 
         # Calculate the number of group menberships required
         group_memberships = math.ceil(
@@ -1911,43 +1925,56 @@ class GroupAssignments(object):
 
         existing_users  = number_of_users  - users_added  - 1
         existing_groups = number_of_groups - groups_added - 1
-        while len(assignments) < group_memberships:
+        while self.total() < group_memberships:
             user, group = self.generate_random_membership()
 
             if group > existing_groups or user > existing_users:
                 # the + 1 converts the array index to the corresponding
                 # group or user number
-                assignments.add(((user + 1), (group + 1)))
-
-        return assignments
+                self.add_assignment((user + 1), (group + 1))
 
     def total(self):
-        return len(self.assignments)
+        return self.count
 
 
 def add_users_to_groups(db, instance_id, assignments):
-    """Add users to their assigned groups.
+    """Takes the assignments of users to groups and applies them to the DB."""
+
+    for group in assignments.get_groups():
+        users_in_group = assignments.users_in_group(group)
+        if len(users_in_group) == 0:
+            continue
 
-    Takes the list of (group,user) tuples generated by assign_groups and
-    assign the users to their specified groups."""
+        # Split up the users into chunks, so we write no more than 1K at a
+        # time. (Minimizing the DB modifies is more efficient, but writing
+        # 10K+ users to a single group becomes inefficient memory-wise)
+        for chunk in range(0, len(users_in_group), 1000):
+            chunk_of_users = users_in_group[chunk:chunk + 1000]
+            add_group_members(db, instance_id, group, chunk_of_users)
 
+
+def add_group_members(db, instance_id, group, users_in_group):
+    """Adds the given users to group specified."""
+
+    start = time.time()
     ou = ou_name(db, instance_id)
 
     def build_dn(name):
         return("cn=%s,%s" % (name, ou))
 
-    for (user, group) in assignments:
-        user_dn  = build_dn(user_name(instance_id, user))
-        group_dn = build_dn(group_name(instance_id, group))
+    group_dn = build_dn(group_name(instance_id, group))
+    m = ldb.Message()
+    m.dn = ldb.Dn(db, group_dn)
 
-        m = ldb.Message()
-        m.dn = ldb.Dn(db, group_dn)
-        m["member"] = ldb.MessageElement(user_dn, ldb.FLAG_MOD_ADD, "member")
-        start = time.time()
-        db.modify(m)
-        end = time.time()
-        duration = end - start
-        LOGGER.info("%f\t0\tadd\tuser\t%f\tTrue\t" % (end, duration))
+    for user in users_in_group:
+        user_dn = build_dn(user_name(instance_id, user))
+        idx = "member-" + str(user)
+        m[idx] = ldb.MessageElement(user_dn, ldb.FLAG_MOD_ADD, "member")
+
+    db.modify(m)
+    end = time.time()
+    duration = end - start
+    LOGGER.info("%f\t0\tadd\tuser(s)\t%f\tTrue\t" % (end, duration))
 
 
 def generate_stats(statsdir, timing_file):
-- 
2.7.4


From c14f87c56e284cacca8359ea312b258acda70c3d Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Tue, 23 Oct 2018 10:19:38 +1300
Subject: [PATCH 09/12] traffic_replay: logger was ignoring smb.conf log-level

We were trying to access the debug-level (in python C bindings) before
the smb.conf had been loaded and actually set the debug-level. So it
would default to zero, regardless of what was in the smb.conf.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 script/traffic_replay | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/script/traffic_replay b/script/traffic_replay
index 3f4ff9b..617a746 100755
--- a/script/traffic_replay
+++ b/script/traffic_replay
@@ -137,6 +137,7 @@ def main():
         parser.print_usage()
         return
 
+    lp = sambaopts.get_loadparm()
     debuglevel = get_debug_level()
     logger = get_samba_logger(name=__name__,
                               verbose=debuglevel > 3,
@@ -173,7 +174,6 @@ def main():
     if opts.random_seed:
         random.seed(opts.random_seed)
 
-    lp = sambaopts.get_loadparm()
     creds = credopts.get_credentials(lp)
     creds.set_gensec_features(creds.get_gensec_features() | gensec.FEATURE_SEAL)
 
-- 
2.7.4


From 3725e8203c1879c4418f8c39221bee3fbb0580fa Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Tue, 23 Oct 2018 10:24:51 +1300
Subject: [PATCH 10/12] traffic_replay: Convert print() to logger.info()

Using logger is more helpful here because it includes timestamps, so we
can see how long things are taking. It's also more consistent with the
rest of the traffic_replay logging.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index e6e90af..0bd881c 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1638,8 +1638,7 @@ def generate_traffic_accounts(ldb, instance_id, number, password):
             else:
                 raise
     if added > 0:
-        print("Added %d new machine accounts" % added,
-              file=sys.stderr)
+        LOGGER.info("Added %d new machine accounts" % added)
 
     added = 0
     for i in range(number, 0, -1):
@@ -1655,8 +1654,7 @@ def generate_traffic_accounts(ldb, instance_id, number, password):
                 raise
 
     if added > 0:
-        print("Added %d new user accounts" % added,
-              file=sys.stderr)
+        LOGGER.info("Added %d new user accounts" % added)
 
 
 def create_machine_account(ldb, instance_id, netbios_name, machinepass):
@@ -1785,32 +1783,30 @@ def generate_users_and_groups(ldb, instance_id, password,
 
     create_ou(ldb, instance_id)
 
-    print("Generating dummy user accounts", file=sys.stderr)
+    LOGGER.info("Generating dummy user accounts")
     users_added = generate_users(ldb, instance_id, number_of_users, password)
 
     if number_of_groups > 0:
-        print("Generating dummy groups", file=sys.stderr)
+        LOGGER.info("Generating dummy groups")
         groups_added = generate_groups(ldb, instance_id, number_of_groups)
 
     if group_memberships > 0:
-        print("Assigning users to groups", file=sys.stderr)
+        LOGGER.info("Assigning users to groups")
         assignments = GroupAssignments(number_of_groups,
                                        groups_added,
                                        number_of_users,
                                        users_added,
                                        group_memberships)
-        print("Adding users to groups", file=sys.stderr)
+        LOGGER.info("Adding users to groups")
         add_users_to_groups(ldb, instance_id, assignments)
         memberships_added = assignments.total()
 
     if (groups_added > 0 and users_added == 0 and
        number_of_groups != groups_added):
-        print("Warning: the added groups will contain no members",
-              file=sys.stderr)
+        LOGGER.warning("The added groups will contain no members")
 
-    print(("Added %d users, %d groups and %d group memberships" %
-           (users_added, groups_added, memberships_added)),
-          file=sys.stderr)
+    LOGGER.info("Added %d users, %d groups and %d group memberships" %
+                (users_added, groups_added, memberships_added))
 
 
 class GroupAssignments(object):
-- 
2.7.4


From d50bbbcc6c6ada4e4ad6ce80e3003b5614923c47 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Tue, 23 Oct 2018 10:46:17 +1300
Subject: [PATCH 11/12] traffic_replay: Improve user generation debug

When creating 1000s of users you currently get a lot of debug, but at
the same time you have no idea how far through creating the users you
actually are.

Instead of logging every single user account that's created, log every
50th (as well as how far through the overall generation we are).

Logger already includes timestamps, so we can remove generating the
timestamp diff manually. User creation is the slowest operation - adding
groups/memberships is much faster, so we don't need to log as
frequently.

Note that there is a usability trade-off on how frequently we log
depending on whether the user is using the slower (but more common)
method of going via LDAP, vs the much faster (but more obscure) method
of writing directly to sam.ldb with ldb:nosync=true. In my tests, we end
up logging every ~30-ish secs with LDAP, and every ~3 seconds with
direct file writes.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 0bd881c..3283d17 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1631,6 +1631,8 @@ def generate_traffic_accounts(ldb, instance_id, number, password):
             netbios_name = "STGM-%d-%d" % (instance_id, i)
             create_machine_account(ldb, instance_id, netbios_name, password)
             added += 1
+            if added % 50 == 0:
+                LOGGER.info("Created %u/%u machine accounts" % (added, number))
         except LdbError as e:
             (status, _) = e.args
             if status == 68:
@@ -1646,6 +1648,9 @@ def generate_traffic_accounts(ldb, instance_id, number, password):
             username = "STGU-%d-%d" % (instance_id, i)
             create_user_account(ldb, instance_id, username, password)
             added += 1
+            if added % 50 == 0:
+                LOGGER.info("Created %u/%u users" % (added, number))
+
         except LdbError as e:
             (status, _) = e.args
             if status == 68:
@@ -1664,7 +1669,6 @@ def create_machine_account(ldb, instance_id, netbios_name, machinepass):
     dn = "cn=%s,%s" % (netbios_name, ou)
     utf16pw = ('"%s"' % get_string(machinepass)).encode('utf-16-le')
 
-    start = time.time()
     ldb.add({
         "dn": dn,
         "objectclass": "computer",
@@ -1672,9 +1676,6 @@ def create_machine_account(ldb, instance_id, netbios_name, machinepass):
         "userAccountControl":
             str(UF_TRUSTED_FOR_DELEGATION | UF_SERVER_TRUST_ACCOUNT),
         "unicodePwd": utf16pw})
-    end = time.time()
-    duration = end - start
-    LOGGER.info("%f\t0\tcreate\tmachine\t%f\tTrue\t" % (end, duration))
 
 
 def create_user_account(ldb, instance_id, username, userpass):
@@ -1682,7 +1683,6 @@ def create_user_account(ldb, instance_id, username, userpass):
     ou = ou_name(ldb, instance_id)
     user_dn = "cn=%s,%s" % (username, ou)
     utf16pw = ('"%s"' % get_string(userpass)).encode('utf-16-le')
-    start = time.time()
     ldb.add({
         "dn": user_dn,
         "objectclass": "user",
@@ -1695,25 +1695,17 @@ def create_user_account(ldb, instance_id, username, userpass):
     sdutils = sd_utils.SDUtils(ldb)
     sdutils.dacl_add_ace(user_dn, "(A;;WP;;;PS)")
 
-    end = time.time()
-    duration = end - start
-    LOGGER.info("%f\t0\tcreate\tuser\t%f\tTrue\t" % (end, duration))
-
 
 def create_group(ldb, instance_id, name):
     """Create a group via ldap."""
 
     ou = ou_name(ldb, instance_id)
     dn = "cn=%s,%s" % (name, ou)
-    start = time.time()
     ldb.add({
         "dn": dn,
         "objectclass": "group",
         "sAMAccountName": name,
     })
-    end = time.time()
-    duration = end - start
-    LOGGER.info("%f\t0\tcreate\tgroup\t%f\tTrue\t" % (end, duration))
 
 
 def user_name(instance_id, i):
@@ -1739,6 +1731,8 @@ def generate_users(ldb, instance_id, number, password):
         if name not in existing_objects:
             create_user_account(ldb, instance_id, name, password)
             users += 1
+            if users % 50 == 0:
+                LOGGER.info("Created %u/%u users" % (users, number))
 
     return users
 
@@ -1757,6 +1751,8 @@ def generate_groups(ldb, instance_id, number):
         if name not in existing_objects:
             create_group(ldb, instance_id, name)
             groups += 1
+            if groups % 1000 == 0:
+                LOGGER.info("Created %u/%u groups" % (groups, number))
 
     return groups
 
@@ -1936,6 +1932,10 @@ class GroupAssignments(object):
 def add_users_to_groups(db, instance_id, assignments):
     """Takes the assignments of users to groups and applies them to the DB."""
 
+    total = assignments.total()
+    count = 0
+    added = 0
+
     for group in assignments.get_groups():
         users_in_group = assignments.users_in_group(group)
         if len(users_in_group) == 0:
@@ -1948,11 +1948,14 @@ def add_users_to_groups(db, instance_id, assignments):
             chunk_of_users = users_in_group[chunk:chunk + 1000]
             add_group_members(db, instance_id, group, chunk_of_users)
 
+            added += len(chunk_of_users)
+            count += 1
+            if count % 50 == 0:
+                LOGGER.info("Added %u/%u memberships" % (added, total))
 
 def add_group_members(db, instance_id, group, users_in_group):
     """Adds the given users to group specified."""
 
-    start = time.time()
     ou = ou_name(db, instance_id)
 
     def build_dn(name):
@@ -1968,9 +1971,6 @@ def add_group_members(db, instance_id, group, users_in_group):
         m[idx] = ldb.MessageElement(user_dn, ldb.FLAG_MOD_ADD, "member")
 
     db.modify(m)
-    end = time.time()
-    duration = end - start
-    LOGGER.info("%f\t0\tadd\tuser(s)\t%f\tTrue\t" % (end, duration))
 
 
 def generate_stats(statsdir, timing_file):
-- 
2.7.4


From 9f737257e752c2259cdd7a2e58c1319a139b5666 Mon Sep 17 00:00:00 2001
From: Tim Beale <timbeale at catalyst.net.nz>
Date: Tue, 23 Oct 2018 11:16:31 +1300
Subject: [PATCH 12/12] traffic_replay: Generate machine accounts as well as
 users

Currently the tool only generates the machine accounts needed for
traffic generation. However, this isn't realistic if we're trying to use
the tool to generate users to simulate a large network.

This patch generates machine accoutns along with the user accounts.
Note we assume there will be more computer accounts than users in a real
network (e.g. work laptops, servers, etc), so generate slightly more
computer accounts.

Signed-off-by: Tim Beale <timbeale at catalyst.net.nz>
---
 python/samba/emulate/traffic.py | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 3283d17..b45d853 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -1737,6 +1737,22 @@ def generate_users(ldb, instance_id, number, password):
     return users
 
 
+def generate_machine_accounts(ldb, instance_id, number, password):
+    """Add machine accounts to the server"""
+    existing_objects = search_objectclass(ldb, objectclass='computer')
+    added = 0
+    for i in range(number, 0, -1):
+        name = "STGM-%d-%d$" % (instance_id, i)
+        if name not in existing_objects:
+            name = "STGM-%d-%d" % (instance_id, i)
+            create_machine_account(ldb, instance_id, name, password)
+            added += 1
+            if added % 50 == 0:
+                LOGGER.info("Created %u/%u machine accounts" % (added, number))
+
+    return added
+
+
 def group_name(instance_id, i):
     """Generate a group name from instance id."""
     return "STGG-%d-%d" % (instance_id, i)
@@ -1782,6 +1798,12 @@ def generate_users_and_groups(ldb, instance_id, password,
     LOGGER.info("Generating dummy user accounts")
     users_added = generate_users(ldb, instance_id, number_of_users, password)
 
+    # assume there will be some overhang with more computer accounts than users
+    computer_accounts = int(1.25 * number_of_users)
+    LOGGER.info("Generating dummy machine accounts")
+    computers_added = generate_machine_accounts(ldb, instance_id,
+                                                computer_accounts, password)
+
     if number_of_groups > 0:
         LOGGER.info("Generating dummy groups")
         groups_added = generate_groups(ldb, instance_id, number_of_groups)
@@ -1801,8 +1823,9 @@ def generate_users_and_groups(ldb, instance_id, password,
        number_of_groups != groups_added):
         LOGGER.warning("The added groups will contain no members")
 
-    LOGGER.info("Added %d users, %d groups and %d group memberships" %
-                (users_added, groups_added, memberships_added))
+    LOGGER.info("Added %d users (%d machines), %d groups and %d memberships" %
+                (users_added, computers_added, groups_added,
+                 memberships_added))
 
 
 class GroupAssignments(object):
-- 
2.7.4