another iconv question

Matt McCutchen matt at mattmccutchen.net
Tue Apr 1 19:01:39 GMT 2008


On Mon, 2008-03-31 at 09:54 -0400, Robert DuToit wrote: 
> I am trying to help my friend set up his rsync with iconv. Presently  
> it works fine but re-copies every file with an umlaute in the  
> filename. I saw a recent post about this and the fix but...
> 
> he ran "locale"  ( both source and dest are on same machine) and is  
> running on a German-Swiss locale

Please be more specific about what your friend is trying to accomplish!
If he's trying to perform a nontrivial conversion, he definitely won't
accomplish it with "--iconv=.".  That option tells rsync to convert from
the sending machine's default charset to the receiving machine's
default charset, but in his case, source and destination are on the
same machine, so these charsets are one and the same!  To convert from a
charset A to a charset B on a local run, he should pass "--iconv=A,B".

> xserve-backup-02:/Volumes/Backup RAID 8TB teleclub$ locale
> LANG=
> LC_COLLATE="C"
> LC_CTYPE="C"
> LC_MESSAGES="C"
> LC_MONETARY="C"
> LC_NUMERIC="C"
> LC_TIME="C"
> LC_ALL="C"
> 
> we tried adding option iconv=C and iconv=C,C  but no luck - it still  
> re-copies very file with an umlaute.
> 
> I checked "iconv --list" on my Mac and see no "C" listed. I am not  
> sure if "C" is correct either.

"C" is a "standard" locale whose associated charset is ASCII.  Based on
the log output (below), your friend should probably be using
de_CH.UTF-8 .

> Example:
> 
> The correct file name would be "Action des Monats für vertonung.mov"  
> and not "Action des Monats f\#303\#274r vertonung.mov"
> 
> but the log shows it not translated:
> 
> /Volumes/SAN_Video/Final Cut Pro Documents/Capture Scratch/Action des  
> Monats/! Render/Action des Monats f\#303\#274r vertonung.mov
>         32768   0%  344.09kB/s    0:54:11
>      42205184   3%   40.25MB/s    0:00:26

All that's happening here is that the source filename is in UTF-8, but
rsync is escaping the two high bytes in its log output because they are
invalid in ASCII, the charset implied by the specified locale C.  If
your friend switches to a *.UTF-8 locale, rsync will show him the umlaut
as-is.

> Though the actual file name gets copied correctly to dest, obviously  
> the mapping (if that is what it is called) is different causing rsync  
> to update the file every time.

The output escaping issue won't cause recopying, but from what you say,
I can guess what the real problem is.  I notice that the source filename
is in composed UTF-8, and the Mac OS X HFS+ filesystem has an annoying
behavior of silently decomposing UTF-8 characters in filenames.  Suppose
the destination is on HFS+ and your friend is using --delete.  Rsync
will copy the file, but the destination filesystem will store its name
with a decomposed umlaut-u (three bytes 0x75, 0xcc, 0x88).  Rsync
compares binary filenames without regard for charset-specific
conventions, so on the next run, it will fail to recognize the
decomposed destination file as corresponding to the source file, delete
the destination file, and transfer the file again.  Essentially, rsync
tries and tries again to create a destination file with the same
(binary) name as the source file, but the filesystem keeps foiling it.

You can avoid this problem by passing --iconv=UTF-8,UTF8-MAC .  UTF8-MAC
is a pseudo-charset recognized by Mac OS X iconv in which all characters
are decomposed.  This way, rsync will decompose the source filename and
recognize it as matching the destination filename.

Wayne, please consider adding this material to the "copies every file"
entry on http://rsync.samba.org/FAQ.html .

Matt



More information about the rsync mailing list