拒绝懒惰的家: Python Cookbook 1.20 处理包含Unicode字符的文本

2007-06-26

Python Cookbook 1.20 处理包含Unicode字符的文本

需求:

需要处理文本,而且里面包含非Ascii字符.

讨论:

Python提供了unicode类来替代使用str类,可以使用它来进行字符格式的转换:

>>> german_ae = unicode('\xc3\xa4', 'utf8')

在这里,german_ae表示了一个德语特殊字符,'\xc3\xa4'是它的utf-8编码.虽然有很多中编码格式,但utf-8是国际上通用的格式,而且可以很好的处理unicode字符和ascii字符.
在Python中,unicode和str也具有很好的兼容性.

>>> sentence = "This is a " + german_ae
>>> sentence2 = "Easy!"
>>> para = ". ".join([sentence, sentence2])

因为如果表达式中存在unicode对象的话,返回值总是unicode对象,所以上面的表达式尽管和str的使用方法完全一直,可是最终的结果para是一个unicode对象.
如果表达式中有非法字符,则会抛出异常:

>>> bytestring = '\xc3\xa4'     # Uuh, some non-ASCII bytestring!
>>> german_ae += bytestring
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 0: ordinal not in range(128)

如果你掌握了一些基本的处理问题的方法,在Python中使用Unicode是比较方便的,但是,这并不等于说处理问题一定是高效的.
正如上面的例子中体现的,str和unicode是有一些区别的.在使用unicode的时候 ,首选要注意的问题是需要用一个字符序列和编码格式来构造一个unicode字符串.
其次,在Python中处理unicode字符还会经常遇到编码转换的问题,因为Python总是隐式转换str为unicode即使仅仅有个别字符是unicode字符.Python总是默认字符串是ascii的,如果出现了非ascii字符,会报出UnicodeDecodeError异常.
一些做过大型Python项目的开发人员用一句话总结出处理这类问题的方案: 在IO数组中进行转换.下面详细解释一下这句话的含义:

当程序从外界接收到数据的时候(从文件,网络,或者用户输入),立刻构造unicode字符,并判断正确的编码.比如,在处理http文件头的时候,可以根据里面的信息判断.
当程序向外界发送数据的时候(到文件,网络,或用户输出),判断正确的编码,然后转换为字节流输出.否则,Python会把它们转换为ascii字节流,就会出现UnicodeDecodeError异常.

根据这两个规则,能够处理大部分的unicode问题了.如果还出现了UnicodeDecodeError,看一下是否忘记构造unicode对象,或者没有正确的转换编码.
将unicode转换为特定编码的字节流,使用下面的代码:

>>> bytestring = german_ae.decode('latin1')
>>> bytestring
'\xe4'

现在butestring就是一个用latin1编码的德文字符了.它和前面的 '\xc3\xa4'表示同一个字符,只是编码不同.
在Python控制台中输入import this,会有更多有用的信息.

相关说明:

decode(...)
    S.decode([encoding[,errors]]) -> string or unicode

    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registerd with codecs.register_error that is
    able to handle UnicodeDecodeErrors.

import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

标签： Python

# posted by tinylee @ 10:43 上午

Comments: 发表评论

<< Home

拒绝懒惰的家

2007-06-26

Python Cookbook 1.20 处理包含Unicode字符的文本

我的简介

Links

archives