2007-06-28

 

Python Cookbook 1.23 为XML和HTML编码Unicode字符

需求:

需要为XML和HTML编码Unicode字符,比如UTF-8等.

讨论:

Python提供了一个异常处理器xmlcharrefreplace来处理编码异常,当遇到非指定编码的字符时,它使用xml的数字字符参考(XML numeric character references)来替换该字符.

def encode_for_xml(unicode_data, encoding='ascii'):
    return unicode_data.encode(encoding, 'xmlcharrefreplace')

你可以使用这种方法来处理html,但也许你更想使用HTML的符号实体参考(HTML's symbolic entity reference) .要达到这样的目的,你需要自己注册一个编码异常处理器,实现相应的接口,我们可以使用一个Python标准库提供的htmlentitydefs,它里面包含了HTML实体定义:

import codecs
from htmlentitydefs import codepoint2name
def html_replace(exc):
    if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
        s = [ u'&%s;' % codepoint2name[ord(c)]
              for c in exc.object[exc.start:exc.end] ]
        return ''.join(s), exc.end
    else:
        raise TypeError("can't handle %s" % exc._ _name_ _)
codecs.register_error('html_replace', html_replace)


当注册完了异常处理器,可以用一个方法把它简单封装一下:

def encode_for_html(unicode_data, encoding='ascii'):
    return unicode_data.encode(encoding, 'html_replace')

一个好的Python模块,都包含一些使用样例,比如:if _ _name_ _ == '_ _main_ _'代码块:

if _ _name_ _ == '_ _main_ _':
    # demo
    data = u'''\
<html>
<head>
<title>Encoding Test</title>
</head>
<body>
<p>accented characters:
<ul>
<li>\xe0 (a + grave)
<li>\xe7 (c + cedilla)
<li>\xe9 (e + acute)
</ul>
<p>symbols:
<ul>
<li>\xa3 (British pound)
<li>\u20ac (Euro)
<li>\u221e (infinity)
</ul>
</body></html>
'''
    print encode_for_xml(data)
    print encode_for_html(data)


如果单独使用该模块,可以获得这样的输出(encode_for_xml):

<li>&#224; (a + grave)
<li>&#231; (c + cedilla)
<li>&#233; (e + acute)
   ...
<li>&#163; (British pound)
<li>&#8364; (Euro)
<li>&#8734; (infinity)


还有(encode_for_html):

<li>&agrave; (a + grave)
<li>&ccedil; (c + cedilla)
<li>&eacute; (e + acute)
   ...
<li>&pound; (British pound)
<li>&euro; (Euro)
<li>&infin; (infinity)

encode_for_xml方法是比较通用的,因为它适用于任何xml应用cheng程序,不仅仅是html.而encode_for_html 方法具有更好的可读性,如果你使用浏览器来阅读上面的输出,会发现结果是一样的.
请记住,unicode在输出到文件前,必须被编码,UTF-8是一个理想选择,因为它可以表示任何unicode字符.但对于一些欧美用户来说,更倾向于使用ASCII和Latin-1.然而,当遇到超出编码范围的字符时,这些编码不能处理它们.Python提供了一个内建的编码异常处理器xmlcharrefreplace,它将不可编码的字符替换为XML数字字符,如&#8734, 表示无穷大符号.本节也演示了如何自定义一个编码异常处理器,html_replace,用于处理编码范围外的字符.它使用了可读性较好的HTML符号实体来表示不可编码字符,比如&infin也表示无穷大符号.html_replace的通用性不如xmlcharrefreplace,因为它不能处理非html程序中的不可编码字符.

遇到非以上两种格式输出的文件,上面的方法都失效了.比如,TeX和其它标记语言都不能识别XML数字字符参考.当然 ,如果你能建立一种其它语言识别的字符参考,你可以使用本节提供的方法来注册编码异常处理器.
另一种方法(也是很高效的方法),使用Python标准库中的codecs模块:

outfile = codecs.open('out.html', mode='w', encoding='ascii',
                       errors='html_replace')

现在你可以使用outfile.write(unicode_data),所有指定编码外的unicode字符,都会被异常处理器处理.当然,处理完毕了不要忘记使用outfile.close ()方法.

相关说明;

codecs.register_error(...)
    register_error(errors, handler)
   
    Register the specified error handler under the name
    errors. handler must be a callable object, that
    will be called with an exception instance containing
    information about the location of the encoding/decoding
    error and must return a (replacement, new position) tuple.

codecs.open(filename, mode='rb', encoding=None, errors='strict', buffering=1)
    Open an encoded file using the given mode and return
    a wrapped version providing transparent encoding/decoding.

unicode.encode(...)
    S.encode([encoding[,errors]]) -> string or unicode
   
    Encodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
    'xmlcharrefreplace' as well as any other name registered with
    codecs.register_error that can handle UnicodeEncodeErrors.
   
    Note: The wrapped version will only accept the object format
    defined by the codecs, i.e. Unicode objects for most builtin
    codecs. Output is also codec dependent and will usually be
    Unicode as well.
   
    Files are always opened in binary mode, even if no binary mode
    was specified. This is done to avoid data loss due to encodings
    using 8-bit values. The default file mode is 'rb' meaning to
    open the file in binary read mode.
   
    encoding specifies the encoding which is to be used for the
    file.
   
    errors may be given to define the error handling. It defaults
    to 'strict' which causes ValueErrors to be raised in case an
    encoding error occurs.
   
    buffering has the same meaning as for the builtin open() API.
    It defaults to line buffered.
   
    The returned wrapped file object provides an extra attribute
    .encoding which allows querying the used encoding. This
    attribute is only available if an encoding was specified as
    parameter.

标签:


Comments: 发表评论



<< Home

This page is powered by Blogger. Isn't yours?