[PATCH] provision: use ASCII quotes

Wed Mar 13 10:44:24 UTC 2019

Philipp Gesang via samba-technical wrote:
>>> -    input_file = open(input_file_name, "r")
>>> +    input_file = io.open(input_file_name, "rt", encoding='utf8')
>> I had that first actually but then I tested all ldif files in the
>> tree and it turned out that only these two codepoints in a single
>> file were affected.
>>
>> io.open() and open() are the same btw. and "t" mode is redundant.

They are *now*. Last week they weren't! That snippet is from those ancient
times when we theoretically supported 2.6.

But anyway, I am OK with editing extended-rights.ldif, because as you say 
it has already been edited and is the only funny one there.

I had a look elsewhere with uchardet:

$ for x in $(git ls-files | grep -v /heimdal | grep -v third_party/ | grep -vP '\.tdb(\.dump)?$' | grep -vP '\.(reg|png|gpg|po|gz|keytab|SAMBABACKUP|dat)$' | grep -v CA-samba.example.com | grep -v examples ); do [ -f $x ] && [ $(uchardet $x) != ASCII ] && printf '%20s %s\n' $(uchardet $x); done  | sort | uniq -c
     40           ISO-8859-1 
      1          ISO-8859-15 
      2           ISO-8859-9 
      7              unknown 
     70                UTF-8 
      6         WINDOWS-1250 
     13         WINDOWS-1252 
     33         WINDOWS-1258

the non-UTF-8s are mostly false positives in C files where people spell
their names correctly in copyright lines. They are really utf-8, but 
(e.g.) ö decomposes into two iso-8859 chars if you look at it wrong and 
uchardet does.

The non-ASCII we parse is probably these:

          ISO-8859-1 source4/setup/ad-schema/Attributes_for_AD_DS__Windows_Server_2008_R2.ldf 
          ISO-8859-1 source4/setup/ad-schema/Attributes_for_AD_DS__Windows_Server_2012.ldf 
          ISO-8859-1 source4/setup/ad-schema/Classes_for_AD_DS__Windows_Server_2008_R2.ldf 
          ISO-8859-1 source4/setup/ad-schema/Classes_for_AD_DS__Windows_Server_2012.ldf 
               UTF-8 libcli/util/ntstatus_err_table.txt 
               UTF-8 libcli/util/werror_err_table.txt 
               UTF-8 source4/setup/ad-schema/MS-AD_Schema_2K8_Attributes.txt 
               UTF-8 source4/setup/ad-schema/MS-AD_Schema_2K8_Classes.txt 
               UTF-8 source4/setup/ad-schema/MS-AD_Schema_2K8_R2_Attributes.txt 
               UTF-8 source4/setup/ad-schema/MS-AD_Schema_2K8_R2_Classes.txt 
               UTF-8 source4/setup/extended-rights.ldif 
        WINDOWS-1252 source4/setup/ad-schema/AD_DS_Attributes__Windows_Server_2012_R2.ldf 
        WINDOWS-1252 source4/setup/ad-schema/AD_DS_Attributes__Windows_Server_2016.ldf 
        WINDOWS-1252 source4/setup/ad-schema/AD_DS_Classes__Windows_Server_2012_R2.ldf 
        WINDOWS-1252 source4/setup/ad-schema/AD_DS_Classes__Windows_Server_2016.ldf 
        WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k0.txt 
        WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k3R2.txt 
        WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k3.txt 
        WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k8R2.txt 
        WINDOWS-1252 source4/setup/display-specifiers/DisplaySpecifiers-Win2k8.txt 

and at least some of the detections are correct. The non-UTF-8 ones must already
have special handling. And meanwhile...

>> read_and_sub_file() is used in other contexts as well so I
>> triggered a CI run; let’s see what breaks ;)
>> https://gitlab.com/samba-team/devel/samba/pipelines/51577845

it passed!

I think we want that, because these files are data and we don't want to leave their
interpretation up to environment variables.

>>> If it does, I would prefer that.
>> Works for me.

Noel Power wrote:

> lgtm - on a side note Douglas we already talked about this before when
> we came across some similar issue and you did some code analysis on
> existing open vs io.open (under python2).

Yes. Perhaps I did. Obviously is io.open defaults to unicode, 
and there is this:

https://gitlab.com/samba-team/devel/samba/commit/2e231541b48fc97ca013079585ef556efda6cb95

but all those subtleties... who cares any more if py2 is gone?

cheers,
Douglas