[clug] Debian/GNU 'find . -ls' oddity: outputs UTF-8 chars as quasi-octal strings - \314\201, not \0314\0201
steve jenkin
sjenkin at canb.auug.org.au
Wed Sep 28 06:01:02 UTC 2016
I have a long list of files generated by ‘find … -ls’ which I’d like to match with a list of directories, but not created the same way.
‘find -ls’ always prints non-ascii as octal quoted strings, which ‘ls’ doesn’t do nor does ‘find’ in it’s usual mode, pushing filenames out ‘as is’.
The find man page, under ‘UNUSUAL FILENAMES’ claims it uses octal escapes.
> Other unusual characters are printed using an octal escape.
bash’s built-in ‘echo -e’ doesn’t recognise escaped octal without a leading \0, while '/bin/echo -e' does.
The bash builtin echo, will decode properly if I edit the output and add a ‘0’ (\314 -> \0314).
I’ve looked at ‘convmv’ (a perl script), but it won’t read from STDIN, it will only search a directory tree.
Even then, I don’t think it does what I need.
I have a work around using 'xargs ls -dlis', but I’d prefer to be able use the supplied ‘find’ argument.
Right now I have a 1.5M line file sprinkled with ‘octal’ strings that I need to convert back.
===================================
Two questions:
1. Is there a standard Linux tool I can use in a pipeline that understands the "\314” quasi-octal encoding produced by ‘find -ls’?
It’s pretty slow using a shell loop 'while read x’ to repeatedly call ‘/bin/echo’.
2. Is there any way to force ‘find -ls’ to print filenames ‘as-is’, despite what the manual says?
I’m not interested in filing a bug-report / change-request with Debian or GNU.
Life’s Too Short.
regards
steve
===================================
## Versions etc.
steve at bc:~/mac$ cat /etc/debian_version
8.6
steve at bc:~/mac$ echo $LANG
en_AU.UTF-8
steve at bc:~/mac$ find --version
find (GNU findutils) 4.4.2
steve at bc:~/mac$ echo $SHELL
/bin/bash
steve at bc:~/mac$ bash --version
GNU bash, version 4.3.30(1)-release (x86_64-pc-linux-gnu)
## This is find ‘default’ behaviour, prints names ‘as is’.
steve at bc:~/mac$ (cd Japanese.lproj/; find . -type d)
.
./ÉwÉãÉv
./ÉwÉãÉv/Contents
./ÉwÉãÉv/Images
## more as-is filenames, piping to xargs
steve at bc:~/mac$ (cd Japanese.lproj/; find . -type d -print0|xargs -0 ls -ldis)
115846682 0 drwxr-xr-x 3 steve steve 106 Jun 26 2013 .
1081078970 0 drwxr-xr-x 4 steve steve 51 Jun 26 2013 ./ÉwÉãÉv
2151147295 4 drwxr-xr-x 2 steve steve 4096 Jun 26 2013 ./ÉwÉãÉv/Contents
3238766855 4 drwxr-xr-x 2 steve steve 4096 Jun 26 2013 ./ÉwÉãÉv/Images
## /bin/echo -e understands the output of ‘find -ls'
steve at bc:~/mac$ /bin/echo -e './E\314\201wE\314\201a\314\203E\314\201v'
./ÉwÉãÉv
## ‘echo -e’, the bash built-in doesn't
steve at bc:~/mac$ echo -e './E\314\201wE\314\201a\314\203E\314\201v'
./E\314\201wE\314\201a\314\203E\314\201v
steve at bc:~/mac$ echo -e './E\0314\0201wE\0314\0201a\0314\0203E\0314\0201v'
./ÉwÉãÉv
# useful regex to find strings with ‘non-ascii’ chars
steve at bc:~/mac$ (cd Japanese.lproj/; find . -type d |perl -ane '{ if(m/[[:^ascii:]]/) { print } }')
./ÉwÉãÉv
./ÉwÉãÉv/Contents
./ÉwÉãÉv/Images
## Showing that it isn’t the LANG env variable
steve at bc:~/mac$ (cd Japanese.lproj/; LANG=en_AU.UTF-8 find . -type d -ls)
115846682 0 drwxr-xr-x 3 steve steve 106 Jun 26 2013 .
1081078970 0 drwxr-xr-x 4 steve steve 51 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v
2151147295 4 drwxr-xr-x 2 steve steve 4096 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v/Contents
3238766855 4 drwxr-xr-x 2 steve steve 4096 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v/Images
steve at bc:~/mac$ (cd Japanese.lproj/; LANG=UTF-8 find . -type d -ls)
115846682 0 drwxr-xr-x 3 steve steve 106 Jun 26 2013 .
1081078970 0 drwxr-xr-x 4 steve steve 51 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v
2151147295 4 drwxr-xr-x 2 steve steve 4096 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v/Contents
3238766855 4 drwxr-xr-x 2 steve steve 4096 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v/Images
steve at bc:~/mac$ (cd Japanese.lproj/; LANG=C find . -type d -ls)
115846682 0 drwxr-xr-x 3 steve steve 106 Jun 26 2013 .
1081078970 0 drwxr-xr-x 4 steve steve 51 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v
2151147295 4 drwxr-xr-x 2 steve steve 4096 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v/Contents
3238766855 4 drwxr-xr-x 2 steve steve 4096 Jun 26 2013 ./E\314\201wE\314\201a\314\203E\314\201v/Images
man find:
no way to turn off this behaviour
> -ls True; list current file in ls -dils format on standard output. The block counts are of 1K blocks, unless the environment variable POSIXLY_CORRECT is set, in which case 512-byte blocks are used. See the UNUSUAL FILENAMES section for information about how unusual characters in filenames are handled.
>
> UNUSUAL FILENAMES
> Many of the actions of find result in the printing of data which is under the control of other users. This includes file names, sizes, modification times and so forth. File names are a potential problem since they can contain any character except `\0' and `/'. Unusual characters in file names can do unexpected and often undesirable things to your terminal (for exam‐ple, changing the settings of your function keys on some terminals). Unusual characters are handled differently by various actions, as described below.
>
> -print0, -fprint0
> Always print the exact filename, unchanged, even if the output is going to a terminal.
>
> -ls, -fls
> Unusual characters are always escaped. White space, backslash, and double quote characters are printed using C-style escaping (for example `\f', `\"'). Other unusual characters are printed using an octal escape. Other printable characters (for -ls and -fls these are the characters between octal 041 and 0176) are printed as-is.
>
> -printf, -fprintf
> If the output is not going to a terminal, it is printed as-is. Otherwise, the result depends on which directive is in use. The directives %D, %F, %g, %G, %H, %Y, and %y expand to values which are not under control of files' owners, and so are printed as-is. The directives %a, %b, %c, %d, %i, %k, %m, %M, %n, %s, %t, %u and %U have values which are under the control of files' owners but which cannot be used to send arbitrary data to the terminal, and so these are printed as-is. The directives %f, %h, %l, %p and %P are quoted.
> This quoting is performed in the same way as for GNU ls. This is not the same quoting mechanism as the one used for -ls and -fls. If you are able to decide what format to use for the output of find then it is normally better to use `\0' as a terminator than to use newline, as file names can contain white space and newline characters. The setting of the `LC_CTYPE' environment variable is used to determine which characters need to be quoted.
>
> -print, -fprint
> Quoting is handled in the same way as for -printf and -fprintf. If you are using find in a script or in a situation where the matched files might have arbitrary names, you should consider using -print0 instead of -print.
>
> The -ok and -okdir actions print the current filename as-is. This may change in a future release.
--
Steve Jenkin, IT Systems and Design
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA
mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
More information about the linux
mailing list