代码之家 › 专栏 › 技术社区 › Gili Nachum

给定起始字符时搜索速度较慢是违反直觉的

performance regex python

Gili Nachum · 技术社区 · 15 年前

我试图通过向regex引擎提供额外的模式信息来加快搜索速度。例如,我不仅在寻找 gold ,我要求该行必须以下划线开头,因此: ^_.*gold 而不是 金 .

由于99%的行都不是以下划线开头的,所以我原本以为会有很大的性能回报,因为regex引擎可能会在一个字符后中止读取该行。我很惊讶地发现了另一种方法。

以下程序说明了问题:

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"^_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = re.search(p, line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

sre_compile.py 我在找一个解释,但它的代码对我来说太复杂了。

考虑到这一点,我试着将行的长度乘以x8,期望行搜索的开始会发光,但结果差距只会变大(22秒对6秒)。

我很困惑:我是不是错过了什么?

4 回复 | 直到 12 年前

Jochen Ritzel 15 年前

实际上,有两件事你做错了:如果你想看字符串的开头使用 match not search . 另外,不要使用 re.match( pattern, line) ,编译模式并使用 pattern.match(line)

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = p.match(line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

您将看到您现在有了预期的行为—两种模式占用的时间完全相同。

Robus 15 年前

怎么样

if line[0] == "_" and "gold" in line:
   print "Yup, it starts with an underscore"
else:
   print "Nope it doesn't"

说真的,不要过度使用正则表达式

Ivo van der Wijk 15 年前

如果你使用重新匹配代替检索对于下划线模式,两者似乎都一样快,即。

def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_.*") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    start = time()
    for i in xrange(1000*1000):
        match = re.match(p1, line)
    end = time() 
    print 'Elapsed: ' + str(end-start) 
    start = time()
    for i in xrange(1000*1000):
        match = re.search(p2, line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

for p in patterns:
    start = time()
    for i in xrange(1000*1000):
        match = p.search(line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

但是速度差仍然存在。。。