[LSF/MM/BPF TOPIC] Enhancing Linux Copy Performance and Function and improving backup scenarios

Sat Feb 1 23:16:11 UTC 2020

On Feb 1, 2020, at 12:54 PM, Steve French <smfrench at gmail.com> wrote:
> 
> On Wed, Jan 29, 2020 at 7:54 PM Darrick J. Wong <darrick.wong at oracle.com> wrote:
>> 
>> On Wed, Jan 22, 2020 at 05:13:53PM -0600, Steve French wrote:
>>> As discussed last year:
>>> 
>>> Current Linux copy tools have various problems compared to other
>>> platforms - small I/O sizes (and most don't allow it to be
>>> configured), lack of parallel I/O for multi-file copies, inability to
>>> reduce metadata updates by setting file size first, lack of cross
>> 
>> ...and yet weirdly we tell everyone on xfs not to do that or to use
>> fallocate, so that delayed speculative allocation can do its thing.
>> We also tell them not to create deep directory trees because xfs isn't
>> ext4.
> 
> Delayed speculative allocation may help xfs but changing file size
> thousands of times for network and cluster fs for a single file copy
> can be a disaster for other file systems (due to the excessive cost
> it adds to metadata sync time) - so there are file systems where
> setting the file size first can help

Sometimes I think it is worthwhile to bite the bullet and just submit
patches to the important upstream tools to make them work well.  I've
sone that in the past for cp, tar, rsync, ls, etc. so that they work
better.  If you've ever straced those tools, you will see they do a
lot of needless filesystem operations (repeated stat() in particular)
that could be optimized - no syscall is better than a fast syscall.

For cp it was changed to not allocate the st_blksize buffer on the
stack, which choked when Lustre reported st_blksize=8MB.  I'm starting
to think that it makes sense for all filesystems to use multi-MB buffers
when reading/copying file data, rather than 4KB or 32KB as it does today.
It might also be good for cp to use O_DIRECT for large file copies rather
than buffered IO to avoid polluting the cache?  Having it use AIO/DIO
would likely be a huge improvement as well.

That probably holds true for many other tools that still use st_blksize.
Maybe filesystems like ext4/xfs/btrfs should start reporting a larger
st_blksize as well?

As for parallel file copying, we've been working on MPIFileUtils, which
has parallel tree/file operations (also multi-node), but has the drawback
that it depends on MPI for remote thread startup, and isn't for everyone.
It should be possible to change it to run in parallel on a single node if
MPI wasn't installed, which would make the tools more generally usable.

>>> And copy tools rely less on
>>> the kernel file system (vs. code in the user space tool) in Linux than
>>> would be expected, in order to determine which optimizations to use.
>> 
>> What kernel interfaces would we expect userspace to use to figure out
>> the confusing mess of optimizations? :)
> 
> copy_file_range and clone_file_range are a good start ... few tools
> use them ...

One area that is really lacking a parallel interface is for directory
and namespace operations.  We still need to do serialized readdir()
and stat for operations in a directory.  There are now parallel VFS
lookups, but it would be useful to allow parallel create and unlink
for regular files, and possibly renames of files within a directory.

For ext4 at least, it would be possible to have parallel readdir()
by generating synthetic telldir() cookies to divide up the directory
into several chunks that can be read in parallel.  Something like:

     seek(dir_fd[0], 0, SEEK_END)
     pos_max = telldir(dir_fd[0])
     pos_inc = pos_max / num_threads
     for (i = 0; i < num_threads; i++)
         seekdir(dir_fd[i], i * pos_inc)

but I don't know if that would be portable to other filesystems.

XFS has a "bulkstat" interface which would likely be useful for
directory traversal tools.

>> There's a whole bunch of xfs ioctls like dioinfo and the like that we
>> ought to push to statx too.  Is that an example of what you mean?
> 
> That is a good example.   And then getting tools to use these,
> even if there are some file system dependent cases.

I've seen that copy to/from userspace is a bottleneck if the storage is
fast.  Since the cross-filesystem copy_file_range() patches have landed,
getting those into userspace tools would be a big performance win.

Dave talked a few times about adding better info than st_blksize for
different IO-related parameters (alignment, etc).  It was not included
in the initial statx() landing because of excessive bikeshedding, but
makes sense to re-examine what could be used there.  Since statx() is
flexible, applications could be patched immediately to check for the
new fields, without having to wait for a new syscall to propagate out.

That said, if data copies are done in the kernel, this may be moot for
some tools, but would still be useful for others.

>>> But some progress has been made since last year's summit, with new
>>> copy tools being released and improvements to some of the kernel file
>>> systems, and also some additional feedback on lwn and on the mailing
>>> lists.

I think if the tools are named anything other than cp, dd, tar, find
it is much less likely that anyone will use them, so focussing developer
efforts on the common GNU tools is more likely to be a win than making
another new copy tool that nobody will use, IMHO.

>>> In addition these discussions have prompted additional
>>> feedback on how to improve file backup/restore scenarios (e.g. to
>>> mounts to the cloud from local Linux systems) which require preserving
>>> more timestamps, ACLs and metadata, and preserving them efficiently.
>> 
>> I suppose it would be useful to think a little more about cross-device
>> fs copies considering that the "devices" can be VM block devs backed by
>> files on a filesystem that supports reflink.  I have no idea how you
>> manage that sanely though.
> 
> I trust XFS and BTRFS and SMB3 and cluster fs etc. to solve this better
> than the block level (better locking, leases/delegation, state management,
> etc.) though.

Getting RichACLs into the kernel would definitely help here.  Non-Linux
filesystems have some variant of NFSv4 ACLs, and having only POSIX ACLs
on Linux is a real hassle here.  Either the outside ACLs are stored as an
xattr blob, which leads to different semantics depending on the access
method (CIFS, NFS, etc) or they are shoe-horned into the POSIX ACL and
lose information.

Cheers, Andreas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 873 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20200201/980404c2/signature.sig>