[linux-cifs-client] Re: fsx-linux failing with latest cifs-2.6 git tree

Wed Nov 26 13:02:39 GMT 2008

Guys, sorry for the late reply. Had a lot of travelling to do...

I'll comment here because you raise most of your questions here, but at
the other end of the thread, it looks like you have a pretty nice patch
so I'll check if I can make any more comments on that.

On Fri, Nov 21, 2008 at 08:51:51PM -0500, Jeff Layton wrote:
> On Sat, 22 Nov 2008 00:53:46 +0100
> Nick Piggin <npiggin at suse.de> wrote:
> 
> > On Fri, Nov 21, 2008 at 05:50:17PM -0500, Jeff Layton wrote:
> > > On Fri, 21 Nov 2008 14:38:18 -0600
> > > "Steve French" <smfrench at gmail.com> wrote:
> > > 
> > > > Fix attached.
> > > > 
> > > > Shaggy/Jeff/Nick etc. do you want to review/ack it since it is late in the rc?
> > > > 
> > > 
> > > Talking with Steve on IRC, we thought it might be better to optimize
> > > away the read when possible. I think this patch should do it. We skip
> > > the read if the write starts past the current end of the file, or if
> > > the offset into the page of the beginning of the write is 0 and we're
> > > writing past the current end of the file. In those situations we just
> > > zero out the rest of the page.
> > > 
> > > Combined patch inlined below. I also took the liberty of adding a page
> > > pointer to make the code look a little cleaner.
> > > 
> > > Thoughts?
> > 
> > You just have to be very careful when marking a page uptodate. This
> > is why I removed that earlier hunk.
> > 
> > 1) the actual write may not cover as much space as we were told here.
> 
> Under what circumstances would that occur? The places where write_begin
> and write_end get called look like they use the same lengths...

mm/filemap.c:generic_perform_write().

They both get the same lengths in their 4th arguments "I want to copy
this much data into the filesystem's pagecache", but write_end also
has a "how much was I actually able to copy this time" as its 5th
argument.

As to why it can happen, because copy_from_user could take a page fault
on the app's source address (eg. to write(2)). In order to complete the
copy, we must handle the page fault, but that has to take mmap_sem,
which gives an ABBA deadlock with the page lock (which we're holding).
And handling the page fault also might have to take another page lock
(another ABBA), or if we're really tricky, it might have to take the
same page lock (lock recursion deadlock).

So it does a copy_from_user_atomic, which can return a short copy.

> > 2) the page no wlooks like this I think?
> > 
> > 0                       offset+len          PAGE_CACHE_SIZE      
> > |---- uninitialized data ---|---- zeroes ---|
> > 
> > Then if you SetPageUptodate, if you are using the generic_mapping_read,
> > it can come and read the page even without locking it. If you are
> > not using the generic pagecache operations at all, then you could do
> > something like this if you are careful, but it still seems a bit
> > risky.
> > 
> > Or am I wrong about the data being uninitialized?
> 
> Bear with me here -- page flag handling makes me dizzy...
> 
> The latest patch that I sent only skips the read if:
> 
> 1) the page lies beyond the current end of the file
> 
> 2) if the page sits on top of the end of the file and the write would
> overwrite all of the data that we would read in.
> 
> The first part should be OK I think. A pagecache read should not be
> able to go beyond EOF, should it?

You _might_ be OK there, but it's not a great idea to SetPageUptodate
first ;) Aside from the problem of the short-copy, SetPageUptodate
actually has a memory barrier in it to ensure the data stored into the
page to bring it uptodate is actually visible before the PageUptodate
flag is. Again, if you are doing DMAs rather than cache coherent stores
to initialise the page, maybe you can get away without that barrier...
But it's just bad practice.

> The second one is a problem. Ideally
> we'd like to be able to get away w/o doing a read in that case, but I
> guess we'd have to delay setting the page uptodate until after the
> write data has been copied to it. cifs_write_end seems to expect
> that the page is uptodate when it gets it though.
> 
> For now, we can probably do the safe thing and perform a read in that
> case, but it would certainly be nice to avoid it. Is there some way that
> we can?

I can't remember the CIFS code very well, but in several of the new
aops conversions I did, I added something like a BUG_ON(!PageUptodate())
in the write_end methods to ensure I wasn't missing some key part of
the logic. It's entirely possible that cifs is almost ready to handle
a !uptodate page in write_end...