拒绝懒惰的家: Python Cookbook 2.6 处理文件中的每一个单词

2007-07-10

Python Cookbook 2.6 处理文件中的每一个单词

需求:

要对文件中的每一个单词进行处理.

讨论:

实现这个需求可以用两层的循环嵌套来实现,一个处理文件中的行,另一个处理行中的单词:

for line in open(thefilepath):
    for word in line.split( ):
        dosomethingwith(word)

内层的for语句将单词定义为用空格分隔的字符序列(unix中的wc命令也是这样做的),另外的定义单词的方法,你可以使用正则式:

import re
re_word = re.compile (r"[\w'-]+")
for line in open(thefilepath):
    for word in re_word.finditer(line):
        dosomethingwith(word.group(0))

在这里,单词被定义为由字母,连字符和撇号构成的字符序列.

如果你要重新定义单词,那你需要改变正则表达式,而外层的循环是不用改变的.
当遇到迭代处理的时候,可以把迭代操作封装到一个迭代对象中,这样的封装可简化代码和操作(使用生成器yield):

def words_of_file(thefilepath, line_to_words=str.split):
    the_file = open(thefilepath):
    for line in the_file:
        for word in line_to_words(line):
            yield word
    the_file.close( )
for word in words_of_file(thefilepath):
    dosomethingwith(word)

这种方式让你将问题分开和简化为两个部分:怎样迭代操作对象(本例中是文件中的单词),另一个是怎样处理迭代出的元素.一旦你清晰的封装出迭代操作 (在这里,使用了生成器),剩下的操作就是简单的for循环了.在程序中,可以多次使用这个迭代器,即使在维护代码的时候,也可以只是修改迭代器而不是所有的代码.这个好处类似于人们写一个个的函数而不是到处粘贴代码,在Python中,你可以使用这个技巧来封装循环结构.
在上面的代码中,我们在处理完文件后确保它关闭,这是一个良好的做法,也通过使用split方法把行变成了单词.当然,假如我们要使用正则式来定义单词,可是使用下面的'hook'代码来实现:

import re
def words_by_re(thefilepath, repattern=r"[\w'-]+"):
    wre = re.compile(repattern)
    def line_to_words(line):
        for mo in wre.finditer(line):
            return mo.group(0)
    return words_of_file(thefilepath, line_to_words)

在上面的代码中,我们同样定义了默认的正则式,当然也运行用户重写,一般情况下,将较通用的属性写成默认值,会增加代码的易用性和通用性.

相关说明:

re.finditer(pattern, string, flags=0)
    Return an iterator over all non-overlapping matches in the
    string. For each match, the iterator returns a match object.

    Empty matches are included in the result.

# posted by tinylee @ 10:22 上午

Comments: 发表评论

<< Home

拒绝懒惰的家

2007-07-10

Python Cookbook 2.6 处理文件中的每一个单词

我的简介

Links

archives