[linux-cifs-client] ENOSPC without O_SYNC

Sun Jul 20 18:48:28 GMT 2008

On Sun, 20 Jul 2008 16:40:04 +0200
Oliver Martin <oliver.martin at student.tuwien.ac.at> wrote:

> Am Fri, 18 Jul 2008 17:12:53 -0400 schrieb Jeff Layton:
> 
> > > > could mount another share. It's supposed to stop when it runs out
> > > > of space, but it kept copying data right into nowhere.
> > > 
> > > What's supposed to stop? To quote the write(2) manpage:
> 
> I meant dar, the backup program.
> 

Right, but all "dar" can do is system calls. So it really comes down to
the fact that you're expecting a system call (namely write())to return
error in this situation. The POSIX spec does not specify that the write
must successfully complete before returning. This is a good thing -- it
allows us to batch up writes in ways that are efficient for the
underlying storage.

> > > A  successful return from write() does not make any guarantee that
> > > data has been committed to disk.  In fact, on some buggy
> > > implementations, it does  not  even guarantee that space has
> > > successfully been reserved for the data.  The only way to be sure
> > > is to call fsync(2)  after  you  are done writing all your data.
> 
> So it explicitly calls implementations which do not fail if not enough
> space is available buggy?
> 

That's the linux manpage and I believe it was written with local
storage in mind. I'd probably disagree with the "buggy" comment there.
There is no guarantee that write does *anything* but write the data to
local memory.

> > > ...if you're not calling fsync() then there are no consistency
> > > guarantees here.
> 
> The write manpage also says:
> POSIX  requires  that  a  read(2)  which can be proved to occur after a
> write() has returned returns the new data.  Note that not all file sys‐
> tems are POSIX conforming.
> 
> Now the questions is, does cifs strive to be POSIX compliant in this
> respect? If it does, then the current behaviour for async writes is a
> problem, because if writes succeed when there is no space available, at
> some point data has to be dropped. The section of POSIX referred to
> explicitly mentions caching schemes for networked file systems as a
> particularly tricky situation.
> 
> [...]
> 

Certainly we strive for posix compliance as best we can. CIFS is not a
posix standard, however, and we can never be 100% POSIX compliant (it's
also important to note that NFS is also not strictly POSIX compliant).

> > > fusesmb is probably doing synchronous writes.
> 
> Could be. Throughput is about the same with fusesmb and cifs with
> directio, slightly less than 7 MB/s. With buffering it is around 10
> MB/s.
> 
> > > 
> [...]
> > > 
> > > What might be nice is at some point to implement the solution that
> > > nfs did a year or so ago and make it so that when writes to a file
> > > start failing cifs would flip to using synchronous writes. That's
> > > really just a "nice to have" though.
> 
> I think that would be equally dangerous for a program that assumes a
> POSIX compliant file system: Suppose I have a program that writes data
> to a file until it receives ENOSPC, and then opens another file on a
> different file system. There would be some data I thought was written
> successfully, but in reality was not, which would therefore be lost.
> POSIX guarantees that, in absence of a system crash before it was
> actually written to disk, data can later be read after write succeeded.
> So I think it should be okay to do this without calling fsync.
> 
> How expensive would it be to synchronously reserve space on the server
> during write?

It would be extremely expensive. What system call are you recommending
that we use to allocate space without doing the write? If you want this
behavior, you can have it today via the directio option. It does
obviously slow things down significantly, but you can be more certain
that your I/O's have completed.

It's possible we could try to implement fallocate() at some point in
the future for CIFS. It's not clear to me whether there is a "reserve
space" call already for CIFS (I think there may be). Regardless though
you'll need apps that use it. I'm not sure if "dar" does or not.

> I guess it would slow down applications doing lots of
> small writes, while it wouldn't matter much for those writing large
> amounts of data, because after the buffer is full, they are limited by
> network throughput.
> Of course, maybe the best option is just to document that cifs is only
> POSIX compliant when mounted with -o directio.
> 

The bottom line is that we have 2 options for completing a write:

1) we can buffer the write and flush it to the server later and allow
write() to return

2) we can flush it now and wait for the write to complete

...there is no third option. If you allow #1 (which we do) then you're
stuck -- how do you report an error for a write that has already
returned. Currently, we report it at close or fsync.

The ideal situation would be implement what NFS has done for
this. Have CIFS flip to doing synchronous writes to a file when a write
returns error (and flip back to async writes when writes start
succeeding). This doesn't obviate the need to check fsync or close
return values for errors, but it would make it more likely that writes
would return errors if you have multiple write failures.

Patches for this are welcome, of course...

-- 
Jeff Layton <jlayton at redhat.com>