[clug] python unicode question

Mon Dec 8 01:19:29 GMT 2003

On  8 Dec 2003, Michael Cohen <michael.cohen at netspeed.com.au> wrote:
> Hi everyone,
>    Im currently trying to learn python and cant get my head around the unicode 
> problems with python. I am trying to do a regex substritution on a string 
> which I get from the mime module. The string is mostly 7bit safe, but 
> occasionally there is a high bit value in there,  so i keep getting 
> exceptions like:
> 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 16633: 
> ordinal not in range(128)
> 
>  So I tried to delete the high bit chars by doing:
>  tmp = payload.encode('ascii', 'replace')
> 
> or even
>  tmp = payload.encode(latin_1', 'replace')
> 
> despite the 'replace' there, it keeps generating the same kind of exceptions. 
> Which contradicts the manual.

No, you're just using it backwards.

The key to working safely with non-ascii strings in Python (or really
any language) is to be clear in your own mind about just what encoding
any given string is in.

Python has two types: a byte buffer (type 'string') and a unicode
buffer (type 'unicode').  Unicode can natively represent every
character and isn't in any particular encoding.  Normally you want to
do all your internal processing on Unicode objects and translate
to/from an encoding at the point of i/o.

Remember decode is for input, bytes -> unicode and encode is for
output, unicode->bytes.

So what you ought to do is find the encoding by looking at the mime
headers, and then do

  u = payload.decode(e)

Assuming the mime headers are accurate, that will always work.
Proper mime messages always identify their charset.

If the headers are missing, or if that operation traps, then you might
want to assume that the mail is in latin-1.  This will accept every
character, but if the mail is not actually latin-1 it might mangle
it.  So

  u = payload.decode('latin_1')

For example

>>> a = 'hello w\xf8rld'
>>> a.decode('latin-1')
u'hello w\xf8rld'
>>> a.decode('latin-1').upper()
u'HELLO W\xd8RLD'
>>> au = a.decode('latin-1')
>>> re.findall('[aeoiu]', au)
[u'e', u'o']

See the o" character got capitalized to \xd8?

> Also I dont really want to lose the non-ascii chars anyway, I wish
> the re module can handle non-ascii chars like the manual claims.

It can if you use it properly.

> I finally worked out a workable solution - hack the site.py file and install 
> the default encoding to 'ISO-8859-1' which works fine. Apparently this is the 
> default in most other languages - it seems really odd that python chooses 
> ascii as the default, i woder why that is. This fix seems a little extreme to 
> me, however, since its a site wide change, is this really the best way to fix 
> this?

The default is 'ascii' because that's a safe subset of every encoding.
If you want to use something else you need to explicitly say so.  It
would be dangerous for Python to assume that you want to treat things
as 8859-1, because that would corrupt data the first time you hit,
say, a Japanese message.  It's better to explicitly trap the error and
make you think about it.

> I dont have any experience with unicode until now but it seems like a big 
> pain, is there a way to turn all this off, like in perl's export LC_ALL or 
> somesuch?

Trying to pretend Unicode doesn't exist just causes data corruption.
The Python way is very good once you get used to it.

-- 
Martin 
                               linux.conf.au -- Adelaide, January 2004
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.samba.org/archive/linux/attachments/20031208/432aecc8/attachment.bin