CTDB Segfault in a container-based env - Looking for pointers

Fri Jul 16 13:35:35 UTC 2021

On Friday, July 16, 2021 3:47:52 AM EDT Michael Adam wrote:
> > On 16. Jul 2021, at 04:43, Amitay Isaacs <amitay at gmail.com> wrote:
> > 
> > On Fri, Jul 16, 2021 at 2:13 AM Michael Adam via samba-technical
> > 
> > <samba-technical at lists.samba.org <mailto:samba-technical at lists.samba.org>> 
wrote:
> >>> On 15. Jul 2021, at 15:16, John Mulligan via samba-technical
> >>> <samba-technical at lists.samba.org> wrote:
> >>> 
> >>> On Wednesday, July 14, 2021 10:12:46 PM EDT Amitay Isaacs via
> >>> samba-technical>>> 
> >>> wrote:
> >>>> Hi John,
> >>>> 
> >>>> There are certain assumptions made in CTDB assuming it's running on a
> >>>> normal system. When running CTDB in a container, those assumptions are
> >>>> not valid any more and you might get unexpected behaviour.
> >>> 
> >>> First, thanks for replying!
> >>> 
> >>> Sure, I fully expect that. It was similar for smbd/winbind but in those
> >>> cases I was able to tune the environment sufficiently - for example
> >>> they need to run within the same pid namespace to function properly.
> >>> The issue I'm having now is that the segfault isn't mapping to anything
> >>> obvious (yet) that I can change in the environment.
> >>> 
> >>>> One such assumption is that init (in some form) has the pid of 1 and
> >>>> CTDB daemon will never have PID of 1.  Obviously this is not true in
> >>>> your case.  From the logs you can see that the CTDB daemon is started
> >>>> as PID 1. In general, CTDB relies on init (in some form) to start/stop
> >>>> various file services (smb, nfs, etc.) via the event scripts.  So,
> >>>> working init is a requirement for normal operation of CTDB.
> >>> 
> >>> Good point. I'll experiment with giving ctdb a parent process.
> >> 
> >> Right, if we want to avoid systemd or other beefier systems that are not
> >> made for containers, we can consider “tini”: E.g. rook is using this.>> 
> >>>> What are you trying to do exactly?  You cannot put CTDB in a container
> >>>> on its own without Samba daemons.
> >> 
> >> Hmm, at least last I checked you can even run ctdb in a “traditional”
> >> non-containerized cluster without any samba daemons. :-)> 
> > Of course you can.  But that doesn't serve any useful purpose. :-)
> 
> That’s a different topic. You wrote that “you cannot”, and I said yes, you
> can. :-)
> >> Maybe you are saying that if you want to run smbd/winbindd on top of
> >> ctdb, then they must run in the same container? I don’t think this is
> >> true either:
> >> 
> >> We usually have multiple containers in one pod, and the containers within
> >> the pod can communicate just as normal. At least that’s what we did with
> >> the smbd and windbindd daemons: separate containers in one pod.> 
> > My understanding of containers is limited here, so I don't understand
> > how you can run ctdb and smbd in different containers.  Does mutex
> > locking on shared databases work across containers (or different
> > namespaces)?  How about unix datagram messaging using pids?
> 
> You are right, in that normally containers do not share these kinds of
> things automatically, but as mentioned by me and in Alexander’s mail, the
> “pod” is the smallest unit that is deployed together in kubernetes. A pod
> is a collection of one or more containers that are seen as a unit on one
> host system. (podman can also directly work on pods, even without
> Kubernetes.) The containers within a pod share the same pid-space, and can
> easily access common files, devices, sockets, etc.
> > If mutex locking on shared databases works across containers, then
> > obviously you can run ctdb and smbd in different containers.
> > If unix datagram messaging works across containers, then obviously you
> > can run smbd and winbindd in different containers.
> 
> Yes, as John and I have demonstrated in our sambaXP presentation,
> 
> https://sambaxp.org/fileadmin/user_upload/sambaxp2021-slides/Mulligan_Adam_s
> amba_operator.pdf https://www.youtube.com/watch?v=mG-Jxaf8_gw
> 
> this was already working. The next step was to add ctdb into the picture,
> where John hit additional problems.

Indeed. FWIW when running smbd plus winbind we need to enable a shared PID 
namespace. Similarly, when CTDB is added to the mix the PID namespace will 
also be shared among smbd, winbind, and ctdbd containers.

Unix domain sockets work by sharing parts of the file system across multiple 
containers. For example, when using winbind we share /run/samba/winbindd.

> 
> (Spoiler alert: following my suggesting with using tini as an init process
> in the container, John has been able to make ctdb + samba work in the
> container yesterday. He will follow up with detail.)

Yes, I was planning on updating the list today, but you scooped me. Using a 
minimal init process such as "tini" avoids triggering this problem. I wish I 
had tried it sooner, but I guess that what these threads are for -- pointing 
out that thing one overlooked. :-)

I also experimented with sharing PID namespace between a "pause container" as 
pid 1, and ctdb. That also appeared to avoid the segfault, but I did see 
zombie processes after terminating ctbdd. Thus I'm planning on using tini for 
now and do addition investigation of the "optimal" way to start ctdbd later. 
Perhaps gathering additional feedback from this thread.

> >>> I'm not clear on what you mean by that. My longer term goal is to
> >>> investigate CTDB as part of the HA story for samba in containers (see
> >>> our general effort here [1]). Short term, I just want to run ctdb on
> >>> its own with very few (or none) event scripts just to get tdb
> >>> replication working across multiple nodes in a container based
> >>> environment. Based on my reading of the docs and a tiny bit of the
> >>> code, bringing up smbd/etc is the responsibility of the event scripts
> >> 
> >> This is not quite true:
> >> 
> >> Ctdb logically consists of two layers:
> >> 
> >> (1) the mandatory core is the distributed tab database and messaging
> >> channel for smbd (2) the optional upper layer is about resource
> >> management (public IPs, services like smbd, winbindd, etc)
> >> 
> >> Ctdb and samba can run together perfectly without #2 as long as someone
> >> takes care of the service management. E.g it has been done with
> >> pacemaker. In our case, Kubernetes / operators, etc, would provide this
> >> role and we would run ctdb without “CTDB_MANAGES_SAMBA=yes” etc...>> 
> >>> so I think it should be possible to run ctdb on its own like that.
> >>> 
> >>> Any thoughts on adding code to specifically handle the case where the
> >>> callback has already been called, but tevent calls it again?
> >> 
> >> Right the crux here seems to be the question whether the tevent-using
> >> code in ctdb is not prepared for the situation that EPOLLHUP is issued,
> >> and if  it would be appropriate to just catch that condition (of being
> >> called again).> 
> > Well that's not really the crux here.  I know what the real issue is
> > (I did write that code), but I still don't understand the motivation
> > behind running ctdb and smbd in different containers.
> 
> As Alexander explained in his mail, the “microservice” approach is a
> paradigm by which each container should be as simple as possible, ideally
> just encapsulating a single process, and containers interacting with each
> other over network and other interfaces, possible multiple containers
> bundled into a pod if needed and appropriate. One fundamental idea behind
> this is this: if the application / service is comprised of multiple
> components, and each component is isolated in a container image that is as
> minimal as possible, the whole management of the application like scaling
> of individual components etc can all be done by a generic container
> orchestration system, in our case kubernetes. This just works best if each
> container is minimal. Rescheduling a more complex, heavyweight container
> multiple server components in that same container is possibly much more
> disruptive and problematic.
> 
> Since we are looking at managing samba as much as possible in a
> kubernetes/container-native way here (see the sambaXP preso), it is natural
> to aim at as much a micro-service approach as possible. We will certainly
> have to do some modifications to Samba / ctdb itself at some point to go
> the last mile, and I am convinced that this will be beneficial to the
> software as such, but of course the first approach is to see how far we can
> get without any modifications.

We mentioned that in the presentation but didn't elaborate on it much as we 
were already pressed for time. :-)

In addition to the benefits that isolated-by-default provide, we've been trying 
to treat the container image like a single application with various 
subcommands (think "net" or "samba-tool" or "git" subcommand style). Leaving 
aside the long running daemons, we have functionality like populate the samba 
config registry from a JSON file, and update AD DNS from a JSON file, and so on. 
At the risk of repeating myself we're striving to keep the containers 
independent of the orchestration layer so that they can be reused by others.

Yet, at a very high level, this is all driven by the desire to use container 
orchestration as a general substrate controlling clusters of physical/virtual 
machines. In that orchestration specific use-cases are deployed: your 
webservers, or your nosql db, or distributed file system, or - as in our case -  
your NAS protocol heads.

> 
> But since you know what the real issue is, would you please enlighten us?
> :-)
> 
> Even if you don’t see a real benefit of this containerized layout just yet,
> it might still be beneficial for the code to consider some modifications to
> make ctdb more “container-ready”...
> 
> Cheers - Michael
> 
> >> But it is of course good and correct to weed out any higher level config
> >> issues before diving into this.
> >> 
> >> Cheers - Michael
> > 
> > Amitay.