vfs_glusterfs changes to get rid of mutex use.

Ira Cooper ira at samba.org
Wed Jan 21 16:31:06 MST 2015


On Wed, Jan 21, 2015 at 3:06 PM, Jeremy Allison <jra at samba.org> wrote:

> On Wed, Jan 21, 2015 at 11:04:22AM -0800, Jeremy Allison wrote:
> > On Mon, Jan 19, 2015 at 11:17:08PM -0500, Ira Cooper wrote:
> > > This patch removes all of the posix mutexes used in the previous
> GlusterFS
> > > AIO patch in favor of using a standard POSIX pipe.  This also
> simplifies
> > > the actual code, and makes our fork safety issues more apparent.
> > >
> > > As part of this changeset, there is a basic set of "rules" around fork
> use
> > > and VFS modules.  I expect they will be wordsmithed, but the basic
> ideas
> > > are sound.
> > >
> > > 1. A child process should never interfere with a parent process.
> > > 2. If you call into the VFS after it has been initialized in a child
> > > process, the behavior is "undefined" and could be as mean as someone
> > > putting "abort" in every single call.  Calling it in the parent is
> totally
> > > legal.
> > >
> > > These rules also echo the similar conventions we use around "tevent"
> which
> > > also apply so VFS modules can use tevent.   You don't reuse tevent
> contexts
> > > in the child process :).
> > >
> > > Feel free to correct me, this patch is based on those assumptions, as
> were
> > > previous patches.
> >
> > OK, mostly LGTM but I'd suggest one (minor) change.
> >
> > Right now you're creating a pipe, then having the
> > gluster threads call:
> >
> >         sts = sys_write (write_fd, &req, sizeof(struct tevent_req *));
> >         if (sts < 0) {
> >                 DEBUG(0,("\nWrite to pipe failed (%s)",
> strerror(errno)));
> >         }
> >
> > to write (on a 64-bit box) 8 bytes of pointer
> > data into it. And on the read side:
> >
> >         sts = sys_read(read_fd, &req, sizeof(struct tevent_req *));
> >         if (sts < 0) {
> >                 DEBUG(0,("\nRead from pipe failed (%s)",
> strerror(errno)));
> >         }
> >
> >         if (req) {
> >                 tevent_req_done(req);
> >         }
> >
> > However sys_read()/sys_write() are signal-safe, but don't
> > guarentee the entire data being written in one chunk.
> >
> > For that you should use write_data()/read_data() which
> > are blocking calls that will not return unless theres
> > an IO error, or the complete amount of data is written/read.
> >
> > Now I know this is under the catagory of "this can't happen"
> > bugs :-), but in order to ensure complete paranoia safety
> > I'd feel happier if the sys_read/sys_write calls were
> > replaced by read_data/write_data calls.
>
> Oh, my review crossed in the post with the push :-).
>
> Feel free to add this as an additional fix !
>

In my reading:

write(2) is guaranteed to be atomic as long as you keep it under
_POSIX_PIPE_BUF in size, which is defined to be at least 512 bytes.  If you
have pointers that large... I really want to know the architecture ;).
(Note: If write is NOT atomic, this patch fails TOTALLY.)

Because write(2) is guaranteed atomic, if you read the read(2) man page:

"Upon successful completion, where nbyte is greater than 0, read() will
mark for update the st_atime field of the file, and return the number of
bytes read. This number will never be greater than nbyte. The value
returned may be less than nbyte if the number of bytes left in the file is
less than nbyte, if the read() request was interrupted by a signal, or if
the file is a pipe or FIFO or special file and has fewer than nbyte bytes
immediately available for reading. For example, a read() from a file
associated with a terminal may return one typed line of data."

The write being atomic, guarantees a proper size hunk of data always
arrives, thus the read can't be partial.  It will always have a pointer of
data made available to it.

I'd rather not make the code any more complicated than needed ;).  I even
considered not wrapping read/write to make the point that the guarantees
these functions provide in this case is critical to success.

All of this said, I'll be submitting a nice patch adding documentation to
the calls, explaining the guarantees, and why using sys_read/sys_write is
correct ;).

I'll have that patch ready tomorrow, along with some white space
corrections Michael Adam found, and I thought I applied... and somehow I
lost *again*.  (I think I've lost them twice now!)

Sigh....

-Ira


More information about the samba-technical mailing list