[clug] Debian/GNU 'find . -ls' oddity: outputs UTF-8 chars as quasi-octal strings - \314\201, not \0314\0201

steve jenkin sjenkin at canb.auug.org.au
Wed Sep 28 06:01:02 UTC 2016


I have a long list of files generated by ‘find … -ls’ which I’d like to match with a list of directories, but not created the same way.

‘find -ls’ always prints non-ascii as octal quoted strings, which ‘ls’ doesn’t do nor does ‘find’ in it’s usual mode, pushing filenames out ‘as is’.

The find man page, under ‘UNUSUAL FILENAMES’ claims it uses octal escapes.
> Other unusual  characters are printed using an octal escape.

bash’s built-in ‘echo -e’ doesn’t recognise escaped octal without a leading \0, while '/bin/echo -e' does.

The bash builtin echo, will decode properly if I edit the output and add a ‘0’ (\314 -> \0314).

I’ve looked at ‘convmv’ (a perl script), but it won’t read from STDIN, it will only search a directory tree.
Even then, I don’t think it does what I need.

I have a work around using 'xargs ls -dlis', but I’d prefer to be able use the supplied ‘find’ argument.
Right now I have a 1.5M line file sprinkled with ‘octal’ strings that I need to convert back.

===================================

Two questions:

 1. Is there a standard Linux tool I can use in a pipeline that understands the "\314” quasi-octal encoding produced by ‘find -ls’?
     It’s pretty slow using a shell loop 'while read x’ to repeatedly call ‘/bin/echo’.

 2. Is there any way to force ‘find -ls’ to print filenames ‘as-is’, despite what the manual says?


I’m not interested in filing a bug-report / change-request with Debian or GNU.
Life’s Too Short.

regards
steve

===================================

## Versions etc.

steve at bc:~/mac$ cat /etc/debian_version 
8.6

steve at bc:~/mac$ echo $LANG
en_AU.UTF-8

steve at bc:~/mac$ find --version
find (GNU findutils) 4.4.2

steve at bc:~/mac$ echo $SHELL
/bin/bash

steve at bc:~/mac$ bash --version
GNU bash, version 4.3.30(1)-release (x86_64-pc-linux-gnu)


## This is find ‘default’ behaviour, prints names ‘as is’.

steve at bc:~/mac$ (cd Japanese.lproj/; find . -type d)
.
./ÉwÉãÉv
./ÉwÉãÉv/Contents
./ÉwÉãÉv/Images

## more as-is filenames, piping to xargs

steve at bc:~/mac$ (cd Japanese.lproj/;  find . -type d -print0|xargs -0 ls -ldis)
 115846682 0 drwxr-xr-x 3 steve steve  106 Jun 26  2013 .
1081078970 0 drwxr-xr-x 4 steve steve   51 Jun 26  2013 ./ÉwÉãÉv
2151147295 4 drwxr-xr-x 2 steve steve 4096 Jun 26  2013 ./ÉwÉãÉv/Contents
3238766855 4 drwxr-xr-x 2 steve steve 4096 Jun 26  2013 ./ÉwÉãÉv/Images


## /bin/echo -e understands the output of ‘find -ls'

steve at bc:~/mac$ /bin/echo -e './E\314\201wE\314\201a\314\203E\314\201v'
./ÉwÉãÉv


## ‘echo -e’, the bash built-in doesn't

steve at bc:~/mac$ echo -e './E\314\201wE\314\201a\314\203E\314\201v'
./E\314\201wE\314\201a\314\203E\314\201v

steve at bc:~/mac$ echo -e './E\0314\0201wE\0314\0201a\0314\0203E\0314\0201v'
./ÉwÉãÉv

# useful regex to find strings with ‘non-ascii’ chars

steve at bc:~/mac$ (cd Japanese.lproj/; find . -type d |perl -ane '{ if(m/[[:^ascii:]]/) { print } }')
./ÉwÉãÉv
./ÉwÉãÉv/Contents
./ÉwÉãÉv/Images


## Showing that it isn’t the LANG env variable

steve at bc:~/mac$ (cd Japanese.lproj/; LANG=en_AU.UTF-8 find . -type d -ls)
115846682    0 drwxr-xr-x   3 steve    steve         106 Jun 26  2013 .
1081078970    0 drwxr-xr-x   4 steve    steve          51 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v
2151147295    4 drwxr-xr-x   2 steve    steve        4096 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v/Contents
3238766855    4 drwxr-xr-x   2 steve    steve        4096 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v/Images

steve at bc:~/mac$ (cd Japanese.lproj/; LANG=UTF-8 find . -type d -ls)
115846682    0 drwxr-xr-x   3 steve    steve         106 Jun 26  2013 .
1081078970    0 drwxr-xr-x   4 steve    steve          51 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v
2151147295    4 drwxr-xr-x   2 steve    steve        4096 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v/Contents
3238766855    4 drwxr-xr-x   2 steve    steve        4096 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v/Images

steve at bc:~/mac$ (cd Japanese.lproj/; LANG=C find . -type d -ls)
115846682    0 drwxr-xr-x   3 steve    steve         106 Jun 26  2013 .
1081078970    0 drwxr-xr-x   4 steve    steve          51 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v
2151147295    4 drwxr-xr-x   2 steve    steve        4096 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v/Contents
3238766855    4 drwxr-xr-x   2 steve    steve        4096 Jun 26  2013 ./E\314\201wE\314\201a\314\203E\314\201v/Images


man find:
  no way to turn off this behaviour

>   -ls    True; list current file in ls -dils format on standard output.  The block counts are of 1K blocks, unless the environment variable POSIXLY_CORRECT is set, in which  case  512-byte blocks are used.  See the UNUSUAL FILENAMES section for information about how unusual characters in filenames are handled.
> 
>   UNUSUAL FILENAMES
>        Many of the actions of find result in the printing of data which is under the control of other users.  This includes file names, sizes, modification times and so forth.  File names are a potential  problem  since they can contain any character except `\0' and `/'.  Unusual characters in file names can do unexpected and often undesirable things to your terminal (for exam‐ple, changing the settings of your function keys on some terminals).  Unusual characters are handled differently by various actions, as described below.
> 
>        -print0, -fprint0
>               Always print the exact filename, unchanged, even if the output is going to a terminal.
> 
>        -ls, -fls
>               Unusual characters are always escaped.  White space, backslash, and double quote characters are printed using C-style escaping (for example `\f', `\"').  Other unusual  characters are printed using an octal escape.  Other printable characters (for -ls and -fls these are the characters between octal 041 and 0176) are printed as-is.
> 
>        -printf, -fprintf
>               If  the  output is not going to a terminal, it is printed as-is.  Otherwise, the result depends on which directive is in use.  The directives %D, %F, %g, %G, %H, %Y, and %y expand to values which are not under control of files' owners, and so are printed as-is.  The directives %a, %b, %c, %d, %i, %k, %m, %M, %n, %s, %t, %u and %U have values which are under the  control  of  files'  owners but which cannot be used to send arbitrary data to the terminal, and so these are printed as-is.  The directives %f, %h, %l, %p and %P are quoted.
>               This quoting is performed in the same way as for GNU ls.  This is not the same quoting mechanism as the one used for -ls and -fls.  If you are able to decide what  format  to  use for  the  output  of find then it is normally better to use `\0' as a terminator than to use newline, as file names can contain white space and newline characters.  The setting of the `LC_CTYPE' environment variable is used to determine which characters need to be quoted.
> 
>        -print, -fprint
>               Quoting is handled in the same way as for -printf and -fprintf.  If you are using find in a script or in a situation where the matched files might have arbitrary names, you should consider using -print0 instead of -print.
> 
>        The -ok and -okdir actions print the current filename as-is.  This may change in a future release.



--
Steve Jenkin, IT Systems and Design 
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin




More information about the linux mailing list