[clug] fun with locale

eyal at eyal.emu.id.au eyal at eyal.emu.id.au
Thu Apr 9 02:04:21 UTC 2020


I assume many of us ran out of things to do on the keyboard these days, so here is
something I actually encountered.

Not being in a hurry these days, I paid more attention to the happenings on my server.
I noticed that a large log file from my IoT system has unusually broken lines.

Here's the thing. The server used to crash with some regularity (now replaced) and this log
file then ends with a line of (binary) zeros at the end. This file is permanently open and
appended to while the system is alive.

So I added to my rc.local a line like this:
	$ sed -i '/^[^ -~]/s/^[^ -~]*//' "$log"
The range of blank (0x20) to tilde (0x7E) is good for this file.

It worked well, tested extensively from the command line before activating the rule.
This was a few months ago. Investigating the broken lines was frustrating as I could
not reproduce the problem from the command line.

After much head scratching I realized that some rc.d scripts specifically set 'LC_ALL=C'
(also seen in 'set|grep LC_ALL') so I tried the command line test after 'unset LC_ALL=C'
and the problem was now reproducing.

I suspect that the selection [^ -~] which is supposed to find unexpected characters in the
log did not treat the range as expected. This actually removed the first word of all lines
starting with '^[0-9a-z]*' and no others.

Luckily the important data was still in the log.

So now I wanted to see what the ascii map (man ascii) is for a different locales.
I could not find a standard command to show the charset for a setting of LC_ALL.
I can see the locale though:

$ locale -c charmap
LC_CTYPE
ANSI_X3.4-1968
$ (export LC_ALL=C ; locale -c charmap)
LC_CTYPE
ANSI_X3.4-1968
$ (unset LC_ALL ; locale -c charmap)
LC_CTYPE
UTF-8

Running this (grabbed from the 'net) with 'LC_ALL=C' or 'unset LC_ALL' shows the same value:

$ awk 'BEGIN{for(n=0;n<256;n++)ord[sprintf("%c",n)]=n}{printf("%x\n", ord[$0])}'
      [this character is a space]
20
~
7e
^C

So the encoding is the same. I now looked at character collating order by examining:

$ awk 'BEGIN{for (n=0;n<127;n++) printf("%c\n",n)}' | sort | less

And this is very different for LC_ALL=C or not. 'sed' clearly uses collating order in
character ranges, not character values. Good to know, though I should have remembered
this from other contexts.

cheers

-- 
Eyal at Home (eyal at eyal.emu.id.au)



More information about the linux mailing list