2007-06-18

 

Python Cookbook 1.13 处理字符串的字串

需求:

要访问字符串的字串,比如,读取了固定长度的字符串,要获取其中的固定字段.

讨论:

最简单的方法是使用切片功能:

afield = theline[3:8]

可是这种方法一次只能处理一个字符串.
如果以字段的长度来考虑,可以使用struct.unpack来处理:

import struct
# Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest:
baseformat = "5s 3x 8s 8s"
# by how many bytes does theline exceed the length implied by this
# base-format (24 bytes in this case, but struct.calcsize is general)
numremain = len(theline) - struct.calcsize(baseformat)
# complete the format with the appropriate 's' field, then unpack
format = "%s %ds" % (baseformat, numremain)
l, s1, s2, t = struct.unpack(format, theline)

如果想忽略剩下的而不是保留,只要unpack需要的长度就可以了.

l, s1, s2 = struct.unpack(baseformat, theline[:struct.calcsize(baseformat)])

如果想以5个字节为单位切分,可以使用切片列表:

fivers = [theline[k:k+5] for k in xrange(0, len(theline), 5)]

把字符串分割成字符会更简单一些:

chars = list(theline)

如果想按照特定的宽度序列来切分,就使用切片列表来实现:

cuts = [8, 14, 20, 26, 30]
pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]

在切片列表中调用zip方法返回了 (cuts[k], cuts[k+1])值对形式的列表 ,第一个值对是(0, cuts[0]), 而最后一个值对是(cuts[len(cuts)-1], None). 换句话说,每一个值对表示了切片的起始位置和中止位置.如上面的代码,第一片是[0-7],然后是[8-13],[14-19],[20-25],[26-29]最后是[30-...]及字符串的结尾.
用这样的方法,可以灵活的控制切片的宽度.

在Python中,struct.unpack方法是经常被使用的 ,甚至有些时候可以做为切分字符串的不错的选择.我们可以把上面的代码写成一个方法,这个方法还带了一个参数lastField表示是否要处理最后剩下的字串.

def fields(baseformat, theline, lastfield=False):
# by how many bytes does theline exceed the length implied by
# base-format (struct.calcsize computes exactly that length)
    numremain = len(theline)-struct.calcsize(baseformat)
# complete the format with the appropriate 's' or 'x' field, then unpack
    format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x")
    return struct.unpack(format, theline)

这里需要注意的就是lastField参数的处理,我们使用了lastfield and "s" or "x"表示是否保留切分后剩下的字符串.

因为field方法可能在循环中调用,我们将元组(baseformat,len(theline),lastField)缓存起来会更高效一些,下面我们提供一个版本,使用自动缓存的代码:

def fields(baseformat, theline, lastfield=False, _cache={ }):
# build the key and try getting the cached format string
    key = baseformat, len(theline), lastfield
    format = _cache.get(key)
    if format is None:
    # no format string was cached, build and cache it
        numremain = len(theline)-struct.calcsize(baseformat)
        _cache[key] = format = "%s %d%s" % ( baseformat, numremain, lastfield and "s" or "x")
    return struct.unpack(format, theline)

这里缓存了与处理的结构,并将它存在_catch字典里面,通过这样来提高效率.经过测试,这种方法能提高30%到40%的效率,如果field不是程序中映像效率的核心代码,可以不采用这种方法.

下面是两个和需求相关的问题的解决方案:

以n的字节为单位切分字符串:

def split_by(theline, n, lastfield=False):
# cut up all the needed pieces
    pieces = [theline[k:k+n] for k in xrange(0, len(theline), n)]
    # drop the last piece if too short and not required
    if not lastfield and len(pieces[-1]) < n:
        pieces.pop( )
    return pieces

以cuts中描述的字段宽度来切分:

def split_at(theline, cuts, lastfield=False):
# cut up all the needed pieces
    pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]
    # drop the last piece if not required
    if not lastfield:
        pieces.pop( )
    return pieces

另一个版本是使用generator方式:

def split_at(the_line, cuts, lastfield=False):
    last = 0
    for cut in cuts:
        yield the_line[last:cut]
        last = cut
    if lastfield:
        yield the_line[last:]

def split_by(the_line, n, lastfield=False):
    return split_at(the_line, xrange(n, len(the_line), n), lastfield)

基于generator方式的方法适用于对一个序列中的元素进行直接或间接的循环操作使用,如' '.join,如果采用了这样的方式,可以用下面的形式来调用:

list_of_fields = list(split_by(the_line, 5))

相关说明:

zip(...)
    zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
   
    Return a list of tuples, where each tuple contains the i-th element
    from each of the argument sequences.  The returned list is truncated
    in length to the length of the shortest argument sequence.

calcsize(fmt)
    Return size of C struct described by format string fmt.
    See struct.__doc__ for more on format strings.


标签:


Comments: 发表评论



<< Home

This page is powered by Blogger. Isn't yours?