talloc: Other minor issues/queries

Thu Oct 22 17:02:08 UTC 2020

> We always have a the parent pointer, but it's left as NULL
> for all but the first child.

Yes, that's how I understood it to work.

> Currently the chunk header is 88 bytes (padded to a 16 byte boundary => 96 bytes)
> on 64bit architectures, which is more than the typical cacheline size of 64 bytes.

Max talloc chunk size is 268435456, so it'd fit comfortably in a uint32_t.
Changing the size field to be a uint32_t and moving to after the flags
field gives an immediate saving of 8 bytes.

With clang 10.0.1

sizeof(struct talloc_chunk) = 88
+ size_t -> uint32_t and re-arranging  = 80
+ removing refs and limits = 64

There's no guarantee the system malloc would return cache line aligned 
blocks of memory, that'd be where the 'aligned' variants of the allocation
functions would be useful.

Current struct talloc_chunk
------------------------------------
talloc: 6827970 ops/sec
talloc_pool: 14650240 ops/sec
malloc: 10992723 ops/sec

talloc: 6966185 ops/sec
talloc_pool: 14945391 ops/sec
malloc: 11628633 ops/sec

talloc: 7064589 ops/sec
talloc_pool: 14970238 ops/sec
malloc: 11319955 ops/sec

talloc-avg: 6952914
talloc_pool-avg: 14855290

32bit size field
-------------------
talloc: 6751118 ops/sec
talloc_pool: 14736652 ops/sec
malloc: 11301125 ops/sec

talloc: 6952915 ops/sec
talloc_pool: 14979351 ops/sec
malloc: 10980855 ops/sec

talloc: 6789687 ops/sec
talloc_pool: 14853362 ops/sec
malloc: 11576574 ops/sec

talloc-avg: 6788595
talloc_pool-avg: 14814459

So there's possibly 1.8% penalty for using a 32bit size field with talloc, 
and no real impact for talloc pools.

This is just on my local machine, so there'll be a lot of random noise.
I wouldn't really expect unaligned accesses to have a huge penalty 
on x86-64.

I can re-run the tests somewhere quieter if there's interest in making
this change.

> Removing the memlimit pointer would shrink it to 80 bytes, but I guess it's to hard to
> make it fit into 64 bytes...
> 
> There's also an alignment gad of 4 bytes after 'unsigned flags',
> so we're currently wasting 12 bytes for each chunk.
> 
> It means that we could change flags to uint64_t or split of the magic into
> a separate uint32_t. Both would allow expanding the flags mask from 0xF to
> more, so we could have flags regarding the a const name, or it a chunk
> was realloc'ed.

Is there any real issue with only having 24bits of magic if 32bit
size fields were implemented?

> We could start with filling the parent pointer on each child first
> on the first realloc of the parent we'll mark the parent with FLAG_REALLOCED
> and the children with FLAG_INDIRECT_PARENT.

Yes, it's still O(n) for realloc, but I'm thinking that reallocing chunks
with large numbers of children is something that's unlikely to be
a common pattern in either Samba, FreeRADIUS or other projects.

> Maybe we could have a FLAG_FIXED_POOL, which avoids the fallback to head memory.
> 
> But I guess we'd need something implemented a bit differently,
> 
> Currently we're very bad at reusing free memory within pools, we don't maintain a free list.

IIRC it's basically when the chunk with the chunk with the highest address
in the pool gets freed, the memory gets reclaimed?

Yes, it's not great. 

> But that's ok for our two main use cases talloc_stackframe_pool()
> and talloc_pooled_object() for both our main goal is to avoid calling malloc/free more than needed.
> But they don't really implement a SLAB allocator.
> https://en.wikipedia.org/wiki/Slab_allocation
> 
> The Linux kernel has a separation of 'struct kmem_cache' and 'mempool_t', see:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/slab.h
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/slab_def.h
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/mempool.h
> 
> So there's 'mempool_t *mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)'
> and 'mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)', but that's all
> for fixed size objects.
> 
> Maybe we could abstract the low level allocation (malloc,realloc,free) in talloc and just
> implement the other allocation schemes below the main talloc logic.

Yes, definitely, if the primary goal is speed.  The slab allocation code needn't 
even be in talloc.  Talloc just needs to allow swapping out the default malloc, 
realloc, free functions. It's maybe a 30-40 LoC change. It doesn't really add any 
complexity, or impact performance.

Unfortunately this doesn't solve our immediate need. I'll give it some more thought.

> These are some ideas, but it will be a long road to get there, so I guess we need to
> find incremental mini steps...

Indeed.  

-Arran