Filename encodings

Wayne Davison wayned at samba.org
Fri Jul 29 21:26:28 GMT 2005


On Fri, Jul 29, 2005 at 10:36:32PM +0200, David Ayers wrote:
> I've checked up a bit (see below) but if you have a pointer to a
> specific thread where you explain what you requirements are, that
> would be really great.

The mail you're replying to is the most detail I've mentioned on what
I'd like to see -- in the past I've simply mentioned why I didn't like
the fname-convert solution in the patches dir -- it's mainly that it is
inefficient (since it tends to fork a program for every filename in the
transfer) but I also don't like how the filenames don't print right
depending on which side of the transfer is outputting the name.

> But did you really mean to use the terms "client/server" or rather
> "local/remote"

In relation to rsync, client==local, and server==remote, so either set
of terms is fine.  I don't want the option to be specified in terms of
source & destination, though, since that makes the options change
between pulling and pushing to the same system.

I'm currently imagining a single option that lists the local and remote
character sets separated by a comma:

    --charsets=iso8859-8,utf-8

This lets that identical option be passed to the server, which will
notice that it got the --server option and use the character set after
the comma instead of before it.

> OK... I'm not in the code yet but I suppose I could convert from the
> specified encoding to UTF-8 just before sending it over the wire and
> convert back to the specified encoding upon reception.

One very important thing is that the internal representation of the
names must be identical in both the client and the server or the sort
algorithm could mis-sort some of the names (which cannot be allowed to
happen).  So, the easiest thing may well be to convert the names early
into UTF-8, always deal with them in that format internally, and run the
names through a conversion routine anytime a name is going to be passed
to a filesystem-affecting function, or anytime a string is going to be
output to the terminal or the daemon logfile.  This would allow the
current rprint() callers to remain unchanged, making all info/error
messages get passed around in UTF-8 format and only converted when it
needs to gets transformed into an externally-visible format.

> Still it isn't clear how failure to encode/decode names should be
> handled.

Perhaps the best thing would be to fail to process that name, just as if
one of the filesystem calls had failed for that item.  I'm not sure if
that's better than your idea of just warning and using the name
untranslated or not (also, some filesystems will reject the
untransformed name anyway).

..wayne..


More information about the rsync mailing list