Fwd: Regression: ldb performance with indexes

Fri Mar 22 16:55:23 UTC 2024

Hi Andreas and Andrew,

 >>>> > Hi,my colleagues discovered a performance issue in libldb:
 >>>> > https://bugzilla.samba.org/show_bug.cgi?id=15590
 >>>> > > > > As soon as you use indexes, ldbadd will be magnitudes
 >> slower than
 >>>> > itwas before.Could some ldb expert please look into it?
 >>>> > > Your subject says a regression. What version is this a
 >>>> regressionagainst?
 >>>> Isn't that obvious from the bug report?
 >>> Here is the short summary:
 >>> $ bash repro.sh 20000 indexesAdded 2 records successfullyAdded
 >> 20000
 >>> records successfully
 >>> On Samba 4.10: 0m01.231sOn Samba 4.19: 1m30.924s (that's 90 times
 >>> slower)
 >>>> > The very nature of a DB index is that it will take time to
 >>>> create,possibly a lot of time, but should make reads faster.
 >>>> Either the DB index doesn't work at all in Samba 4.10 or there
 >> is a
 >>> huge performance problem in Samba 4.19. What is it?
 >>
 >> Thanks, that wasn't written as obviously on the bug, thanks for the
 >> clarification.
 >
 > I used our CentOS 8 Stream CI image for bisecting. You can't bisect
 > easily on a modern Linux Distribution, as the included waf would not
 > have support for newer Python versions like 3.12.
 >
 > In case you want to reproduce it, here is my run:

I'm Andréas from Tranquil IT dev team. Denis and Yohannès asked me this 
week to take a look at the performance issues on large domains, which 
include this issue in the current thread along the mdb large transaction 
issues.

The attached patchset goes through all the tdb and ldb make test.

* LMDB : increase MDB_IDL_LOGN from 16 to 18 to accomodate large nested 
transactions
* tdb : fail-fast when record hash doesn't match expected hash to avoid 
to read/copy the entire record
* ldb : increase DEFAULT_INDEX_CACHE_SIZE from 491 to 8089 to increase 
the number of bucket to have smaller bucket to have faster iteration in 
each buckets in tdb_find

With this patchset we can upgrade large domains (>200k objects) to 
FL2k16 level in approximatly 1 hour instead of 3 days :-)

[root at srvads1-bl1cw ~]# bash repro.sh 20000 indexes Added 2 records 
successfully Added 20000 records successfully real 0m0.536s user 
0m0.798s sys 0m0.105s

Tranquil IT team is expert at deploying Samba-AD in large domains, but 
we are not core devs, so I may have missed something during my debugging 
/ patching session. Don't hesitate to comment and tell me what you think 
about this patchset, if there are some pitfalls that I missed or if the 
style can be improved.


Cheers,

Andréas
-------------- next part --------------
diff -up a/libraries/liblmdb/midl.h b/libraries/liblmdb/midl.h

--- a/libraries/liblmdb/midl.h	2018-03-22 16:23:05.000000000 +0100
+++ b/libraries/liblmdb/midl.h	2024-03-21 11:46:54.616000000 +0100
@@ -56,7 +56,7 @@ typedef MDB_ID *MDB_IDL;
 /* IDL sizes - likely should be even bigger
  *   limiting factors: sizeof(ID), thread stack size
  */
-#define	MDB_IDL_LOGN	16	/* DB_SIZE is 2^16, UM_SIZE is 2^17 */
+#define	MDB_IDL_LOGN	18	/* DB_SIZE is 2^18, UM_SIZE is 2^19 */
 #define MDB_IDL_DB_SIZE		(1<<MDB_IDL_LOGN)
 #define MDB_IDL_UM_SIZE		(1<<(MDB_IDL_LOGN+1))
-------------- next part --------------
diff -ruN a/common/io.c b/common/io.c
--- a/common/io.c	2023-07-18 10:14:54.490091000 +0200
+++ b/common/io.c	2024-03-21 15:49:38.780634784 +0100
@@ -741,6 +741,26 @@
 	return result;
 }
 
+
+/* Check if the record hash match the expected one. If not, move the offset to the next record */
+
+int tdb_rec_ishash(struct tdb_context *tdb, tdb_off_t *off, uint32_t hash) {
+	uint32_t read_hash;
+
+	// Ensure the record is not oob
+	if (tdb_oob(tdb, *off, sizeof(struct tdb_record), 0) != 0)
+		 return -1;
+
+	// Read only full_hash value and compare it to expected hash
+	if (tdb->methods->tdb_read(tdb, *off + offsetof(struct tdb_record, full_hash), &read_hash, sizeof(hash), DOCONV()) == -1)
+		 return -1;
+	if (read_hash == hash)
+		 return 1;
+
+	// Read the next pointer value
+	return tdb->methods->tdb_read(tdb, *off, off, sizeof(tdb_off_t),DOCONV());
+}
+
 /* read/write a record */
 int tdb_rec_read(struct tdb_context *tdb, tdb_off_t offset, struct tdb_record *rec)
 {
diff -ruN a/common/tdb.c b/common/tdb.c
--- a/common/tdb.c	2023-07-18 10:14:54.494091000 +0200
+++ b/common/tdb.c	2024-03-22 09:44:07.060633846 +0100
@@ -133,7 +133,19 @@
 
 	/* keep looking until we find the right record */
 	while (rec_ptr) {
-		bool ok;
+		int is_hash_result = tdb_rec_ishash(tdb, &rec_ptr, hash);
+
+		/* Failed to read the hash/next pointer of this record */
+		if (is_hash_result == -1)
+			return 0;
+		/* Record hash doesn't match expected hash so we fail fast instead of copying the entire record.
+		   rec_ptr already moved to next record by tdb_rec_ishash.
+		   Still need to check the chainwalk to avoid circular search */
+		else if (is_hash_result == 0) {
+			if (!tdb_chainwalk_check(tdb, &chainwalk, rec_ptr))
+				return 0;
+			continue;
+		}
 
 		if (tdb_rec_read(tdb, rec_ptr, r) == -1)
 			return 0;
@@ -147,10 +159,8 @@
		}
		rec_ptr = r->next;
 
-		ok = tdb_chainwalk_check(tdb, &chainwalk, rec_ptr);
-		if (!ok) {
+		if (!tdb_chainwalk_check(tdb, &chainwalk, rec_ptr))
 			return 0;
-		}
 	}
 	tdb->ecode = TDB_ERR_NOEXIST;
 	return 0;
diff -ruN a/common/tdb_private.h b/common/tdb_private.h
--- a/common/tdb_private.h	2023-07-18 10:14:54.494091000 +0200
+++ b/common/tdb_private.h	2024-03-21 15:50:39.244634141 +0100
@@ -322,6 +322,7 @@
 int tdb_lock_record(struct tdb_context *tdb, tdb_off_t off);
 int tdb_unlock_record(struct tdb_context *tdb, tdb_off_t off);
 bool tdb_needs_recovery(struct tdb_context *tdb);
+int tdb_rec_ishash(struct tdb_context *tdb, tdb_off_t *off, uint32_t hash);
 int tdb_rec_read(struct tdb_context *tdb, tdb_off_t offset, struct tdb_record *rec);
 int tdb_rec_write(struct tdb_context *tdb, tdb_off_t offset, struct tdb_record *rec);
 unsigned char *tdb_alloc_read(struct tdb_context *tdb, tdb_off_t offset, tdb_len_t len);

-------------- next part --------------
diff -ruN a/ldb_key_value/ldb_kv.h b/ldb_key_value/ldb_kv.h
--- a/ldb_key_value/ldb_kv.h	2023-07-18 10:14:54.438090800 +0200
+++ b/ldb_key_value/ldb_kv.h	2024-03-21 15:48:57.312635225 +0100
@@ -249,7 +249,7 @@
  * The value chosen gives a prime modulo for the hash table and keeps the
  * tdb memory overhead under 4 kB
  */
-#define DEFAULT_INDEX_CACHE_SIZE 491
+#define DEFAULT_INDEX_CACHE_SIZE 8089 
 
 struct ldb_parse_tree;