[clug] python unicode question
Michael Cohen
michael.cohen at netspeed.com.au
Sun Dec 7 13:35:40 GMT 2003
Hi everyone,
Im currently trying to learn python and cant get my head around the unicode
problems with python. I am trying to do a regex substritution on a string
which I get from the mime module. The string is mostly 7bit safe, but
occasionally there is a high bit value in there, so i keep getting
exceptions like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 16633:
ordinal not in range(128)
So I tried to delete the high bit chars by doing:
tmp = payload.encode('ascii', 'replace')
or even
tmp = payload.encode(latin_1', 'replace')
despite the 'replace' there, it keeps generating the same kind of exceptions.
Which contradicts the manual. Also I dont really want to lose the non-ascii
chars anyway, I wish the re module can handle non-ascii chars like the manual
claims.
I finally worked out a workable solution - hack the site.py file and install
the default encoding to 'ISO-8859-1' which works fine. Apparently this is the
default in most other languages - it seems really odd that python chooses
ascii as the default, i woder why that is. This fix seems a little extreme to
me, however, since its a site wide change, is this really the best way to fix
this?
I dont have any experience with unicode until now but it seems like a big
pain, is there a way to turn all this off, like in perl's export LC_ALL or
somesuch?
Michael.
More information about the linux
mailing list