[PATCH] provision: use ASCII quotes
Douglas Bagnall
douglas.bagnall at catalyst.net.nz
Wed Mar 13 10:44:24 UTC 2019
Philipp Gesang via samba-technical wrote:
>>> - input_file = open(input_file_name, "r")
>>> + input_file = io.open(input_file_name, "rt", encoding='utf8')
>> I had that first actually but then I tested all ldif files in the
>> tree and it turned out that only these two codepoints in a single
>> file were affected.
>>
>> io.open() and open() are the same btw. and "t" mode is redundant.
They are *now*. Last week they weren't! That snippet is from those ancient
times when we theoretically supported 2.6.
But anyway, I am OK with editing extended-rights.ldif, because as you say
it has already been edited and is the only funny one there.
I had a look elsewhere with uchardet:
$ for x in $(git ls-files | grep -v /heimdal | grep -v third_party/ | grep -vP '\.tdb(\.dump)?$' | grep -vP '\.(reg|png|gpg|po|gz|keytab|SAMBABACKUP|dat)$' | grep -v CA-samba.example.com | grep -v examples ); do [ -f $x ] && [ $(uchardet $x) != ASCII ] && printf '%20s %s\n' $(uchardet $x); done | sort | uniq -c
40 ISO-8859-1
1 ISO-8859-15
2 ISO-8859-9
7 unknown
70 UTF-8
6 WINDOWS-1250
13 WINDOWS-1252
33 WINDOWS-1258
the non-UTF-8s are mostly false positives in C files where people spell
their names correctly in copyright lines. They are really utf-8, but
(e.g.) ö decomposes into two iso-8859 chars if you look at it wrong and
uchardet does.
The non-ASCII we parse is probably these:
ISO-8859-1 source4/setup/ad-schema/Attributes_for_AD_DS__Windows_Server_2008_R2.ldf
ISO-8859-1 source4/setup/ad-schema/Attributes_for_AD_DS__Windows_Server_2012.ldf
ISO-8859-1 source4/setup/ad-schema/Classes_for_AD_DS__Windows_Server_2008_R2.ldf
ISO-8859-1 source4/setup/ad-schema/Classes_for_AD_DS__Windows_Server_2012.ldf
UTF-8 libcli/util/ntstatus_err_table.txt
UTF-8 libcli/util/werror_err_table.txt
UTF-8 source4/setup/ad-schema/MS-AD_Schema_2K8_Attributes.txt
UTF-8 source4/setup/ad-schema/MS-AD_Schema_2K8_Classes.txt
UTF-8 source4/setup/ad-schema/MS-AD_Schema_2K8_R2_Attributes.txt
UTF-8 source4/setup/ad-schema/MS-AD_Schema_2K8_R2_Classes.txt
UTF-8 source4/setup/extended-rights.ldif
WINDOWS-1252 source4/setup/ad-schema/AD_DS_Attributes__Windows_Server_2012_R2.ldf
WINDOWS-1252 source4/setup/ad-schema/AD_DS_Attributes__Windows_Server_2016.ldf
WINDOWS-1252 source4/setup/ad-schema/AD_DS_Classes__Windows_Server_2012_R2.ldf
WINDOWS-1252 source4/setup/ad-schema/AD_DS_Classes__Windows_Server_2016.ldf
WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k0.txt
WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k3R2.txt
WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k3.txt
WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k8R2.txt
WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k8.txt
and at least some of the detections are correct. The non-UTF-8 ones must already
have special handling. And meanwhile...
>> read_and_sub_file() is used in other contexts as well so I
>> triggered a CI run; let’s see what breaks ;)
>> https://gitlab.com/samba-team/devel/samba/pipelines/51577845
it passed!
I think we want that, because these files are data and we don't want to leave their
interpretation up to environment variables.
>>> If it does, I would prefer that.
>> Works for me.
Noel Power wrote:
> lgtm - on a side note Douglas we already talked about this before when
> we came across some similar issue and you did some code analysis on
> existing open vs io.open (under python2).
Yes. Perhaps I did. Obviously is io.open defaults to unicode,
and there is this:
https://gitlab.com/samba-team/devel/samba/commit/2e231541b48fc97ca013079585ef556efda6cb95
but all those subtleties... who cares any more if py2 is gone?
cheers,
Douglas
More information about the samba-technical
mailing list