[clug] python unicode question

Sun Dec 7 13:35:40 GMT 2003

Hi everyone,
   Im currently trying to learn python and cant get my head around the unicode 
problems with python. I am trying to do a regex substritution on a string 
which I get from the mime module. The string is mostly 7bit safe, but 
occasionally there is a high bit value in there,  so i keep getting 
exceptions like:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 16633: 
ordinal not in range(128)

 So I tried to delete the high bit chars by doing:
 tmp = payload.encode('ascii', 'replace')

or even
 tmp = payload.encode(latin_1', 'replace')

despite the 'replace' there, it keeps generating the same kind of exceptions. 
Which contradicts the manual. Also I dont really want to lose the non-ascii 
chars anyway, I wish the re module can handle non-ascii chars like the manual 
claims.

I finally worked out a workable solution - hack the site.py file and install 
the default encoding to 'ISO-8859-1' which works fine. Apparently this is the 
default in most other languages - it seems really odd that python chooses 
ascii as the default, i woder why that is. This fix seems a little extreme to 
me, however, since its a site wide change, is this really the best way to fix 
this?

I dont have any experience with unicode until now but it seems like a big 
pain, is there a way to turn all this off, like in perl's export LC_ALL or 
somesuch?

Michael.