代码之家  ›  专栏  ›  技术社区  ›  JV.

分析嵌套模式的字符串

  •  3
  • JV.  · 技术社区  · 17 年前

    最好的方法是什么?

    输入字符串是

    <133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3>
    

    预期输出为

    {'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \
    using either camera now they are just sitting and collecting dust.':[133, 135],
    
    'The other system worked for about 1 month': [116],
    
    'on it then it started doing the same thing as the first one':[137]
    
    }
    

    这看起来像是一个递归的regexp搜索,但我不知道具体的搜索方式。

    我现在可以想到一个冗长的递归函数,但是我觉得应该有更好的方法。

    相关问题: Can regular expressions be used to match nested patterns?

    6 回复  |  直到 17 年前
        1
  •  4
  •   Daniel Naab    17 年前

    使用expat或其他XML解析器;考虑到您处理的是XML数据,它比其他任何东西都更明确。

    但是,请注意,XML元素名称不能以数字开头,因为您的示例中有数字。

    这里有一个解析器,它可以满足您的需要,尽管您需要调整它,将重复的元素组合成一个dict键:

    from xml.parsers.expat import ParserCreate
    
    open_elements = {}
    result_dict = {}
    
    def start_element(name, attrs):
        open_elements[name] = True
    
    def end_element(name):
        del open_elements[name]
    
    def char_data(data):
        for element in open_elements:
            cur = result_dict.setdefault(element, '')
            result_dict[element] = cur + data
    
    if __name__ == '__main__':
        p = ParserCreate()
    
        p.StartElementHandler = start_element
        p.EndElementHandler = end_element
        p.CharacterDataHandler = char_data
    
        p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)
    
        print result_dict
    
        2
  •  4
  •   Aaron Digulla    17 年前

    使用一个XML解析器,使其生成一个DOM(文档对象模型),然后构建一个遍历所有节点的递归算法,在每个节点中调用“text()”(应该给您当前节点和所有子节点中的文本),并将其作为键放在字典中。

        3
  •  2
  •   jfs    17 年前
    from cStringIO   import StringIO
    from collections import defaultdict
    ####from xml.etree   import cElementTree as etree
    from lxml import etree
    
    xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"
    
    d = defaultdict(list)
    for event, elem in etree.iterparse(StringIO(xml)):
        d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))
    
    print(dict(d.items()))
    

    输出:

    {'on it then it started doing the same thing as the first one': [137], 
    'The other system worked for about 1 month': [116], 
    'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
    either camera now they are just sitting and collecting dust. ': [133, 135]}
    
        4
  •  1
  •   Gonzalo Quero    17 年前

    我认为语法是最好的选择。我找到了一个包含以下信息的链接: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

        5
  •  1
  •   harms    17 年前

    请注意,实际上不能通过正则表达式来解决这个问题,因为它们没有执行适当嵌套的表达能力。

    使用以下迷你语言:

    一个特定数字的“(”后面跟着相同的数字”),不管数字是多少。

    您可以非常容易地创建一个正则表达式来表示这种小型语言的一种超级语言(在这种语言中,您不强制要求开始括号和结束括号的数目相等)。您还可以使正则表达式非常容易地表示任何有限的子语言(在这里您将自己限制在某个最大嵌套深度)。但是你永远不能用正则表达式来表示这种精确的语言。

    所以你必须使用语法,是的。

        6
  •  0
  •   jfs    17 年前

    下面是一个不可靠、效率低下的递归regexp解决方案:

    import re
    
    re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)
    
    def iterparse(text, tag=None):
        if tag is not None: yield tag, text
        for m in re_tag.finditer(text):
            for tag, text in iterparse(m.group('content'), m.group('tag')):
                yield tag, text
    
    def strip_tags(content):
        nested = lambda m: re_tag.sub(nested, m.group('content'))
        return re_tag.sub(nested, content)
    
    
    txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
    d = {}
    for tag, text in iterparse(txt):
        d.setdefault(strip_tags(text), []).append(int(tag[:-2]))
    
    print(d)
    

    输出:

    {'on it then it started doing the same thing as the first one': [137], 
     'The other system worked for about 1 month': [116], 
     'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
     either camera now they are just sitting and collecting dust. ': [133, 135]}