代码之家  ›  专栏  ›  技术社区  ›  Gregg Lind

比较XML代码段?

  •  32
  • Gregg Lind  · 技术社区  · 15 年前

    建立在 another SO question ,如何检查两个格式良好的XML片段在语义上是否相等。我只需要“相等”或不相等,因为我在单元测试中使用它。

    在我想要的系统中,它们是相等的(注意“开始”的顺序 和'结束'):

    <?xml version='1.0' encoding='utf-8' standalone='yes'?>
    <Stats start="1275955200" end="1276041599">
    </Stats>
    
    # Reordered start and end
    
    <?xml version='1.0' encoding='utf-8' standalone='yes'?>
    <Stats end="1276041599" start="1275955200" >
    </Stats>
    

    我可以使用lmxl和其他工具,而且一个只允许重新排序属性的简单函数也可以很好地工作!


    基于IANB答案的工作代码段:

    from formencode.doctest_xml_compare import xml_compare
    # have to strip these or fromstring carps
    xml1 = """    <?xml version='1.0' encoding='utf-8' standalone='yes'?>
        <Stats start="1275955200" end="1276041599"></Stats>"""
    xml2 = """     <?xml version='1.0' encoding='utf-8' standalone='yes'?>
        <Stats end="1276041599" start="1275955200"></Stats>"""
    xml3 = """ <?xml version='1.0' encoding='utf-8' standalone='yes'?>
        <Stats start="1275955200"></Stats>"""
    
    from lxml import etree
    tree1 = etree.fromstring(xml1.strip())
    tree2 = etree.fromstring(xml2.strip())
    tree3 = etree.fromstring(xml3.strip())
    
    import sys
    reporter = lambda x: sys.stdout.write(x + "\n")
    
    assert xml_compare(tree1,tree2,reporter)
    assert xml_compare(tree1,tree3,reporter) is False
    
    10 回复  |  直到 8 年前
        1
  •  24
  •   Thomas Grainger    11 年前

    你可以使用 formencode.doctest_xml_compare --xml_compare函数比较两个elementtree或lxml树。

        2
  •  14
  •   Anentropic    11 年前

    元素的顺序在XML中可能很重要,这可能是为什么大多数建议的其他方法在顺序不同时会比较不相等…即使元素具有相同的属性和文本内容。

    但是我也想要一个顺序不敏感的比较,所以我想到了:

    from lxml import etree
    import xmltodict  # pip install xmltodict
    
    
    def normalise_dict(d):
        """
        Recursively convert dict-like object (eg OrderedDict) into plain dict.
        Sorts list values.
        """
        out = {}
        for k, v in dict(d).iteritems():
            if hasattr(v, 'iteritems'):
                out[k] = normalise_dict(v)
            elif isinstance(v, list):
                out[k] = []
                for item in sorted(v):
                    if hasattr(item, 'iteritems'):
                        out[k].append(normalise_dict(item))
                    else:
                        out[k].append(item)
            else:
                out[k] = v
        return out
    
    
    def xml_compare(a, b):
        """
        Compares two XML documents (as string or etree)
    
        Does not care about element order
        """
        if not isinstance(a, basestring):
            a = etree.tostring(a)
        if not isinstance(b, basestring):
            b = etree.tostring(b)
        a = normalise_dict(xmltodict.parse(a))
        b = normalise_dict(xmltodict.parse(b))
        return a == b
    
        3
  •  5
  •   Mark E. Haase    12 年前

    我有同样的问题:我想比较两个具有相同属性但顺序不同的文档。

    似乎LXML中的XML规范化(C14N)可以很好地解决这个问题,但我绝对不是XML专家。我很好奇是否有人能指出这种方法的缺点。

    parser = etree.XMLParser(remove_blank_text=True)
    
    xml1 = etree.fromstring(xml_string1, parser)
    xml2 = etree.fromstring(xml_string2, parser)
    
    print "xml1 == xml2: " + str(xml1 == xml2)
    
    ppxml1 = etree.tostring(xml1, pretty_print=True)
    ppxml2 = etree.tostring(xml2, pretty_print=True)
    
    print "pretty(xml1) == pretty(xml2): " + str(ppxml1 == ppxml2)
    
    xml_string_io1 = StringIO()
    xml1.getroottree().write_c14n(xml_string_io1)
    cxml1 = xml_string_io1.getvalue()
    
    xml_string_io2 = StringIO()
    xml2.getroottree().write_c14n(xml_string_io2)
    cxml2 = xml_string_io2.getvalue()
    
    print "canonicalize(xml1) == canonicalize(xml2): " + str(cxml1 == cxml2)
    

    运行这个可以让我:

    $ python test.py 
    xml1 == xml2: false
    pretty(xml1) == pretty(xml2): false
    canonicalize(xml1) == canonicalize(xml2): true
    
        4
  •  5
  •   Guillaume Vincent    10 年前

    这里是一个简单的解决方案,将XML转换为字典(使用 XMLtoDICT )把字典放在一起比较

    import json
    import xmltodict
    
    class XmlDiff(object):
        def __init__(self, xml1, xml2):
            self.dict1 = json.loads(json.dumps((xmltodict.parse(xml1))))
            self.dict2 = json.loads(json.dumps((xmltodict.parse(xml2))))
    
        def equal(self):
            return self.dict1 == self.dict2
    

    单元测试

    import unittest
    
    class XMLDiffTestCase(unittest.TestCase):
    
        def test_xml_equal(self):
            xml1 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?>
            <Stats start="1275955200" end="1276041599">
            </Stats>"""
            xml2 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?>
            <Stats end="1276041599" start="1275955200" >
            </Stats>"""
            self.assertTrue(XmlDiff(xml1, xml2).equal())
    
        def test_xml_not_equal(self):
            xml1 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?>
            <Stats start="1275955200">
            </Stats>"""
            xml2 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?>
            <Stats end="1276041599" start="1275955200" >
            </Stats>"""
            self.assertFalse(XmlDiff(xml1, xml2).equal())
    

    或者在简单的python方法中:

    import json
    import xmltodict
    
    def xml_equal(a, b):
        """
        Compares two XML documents (as string or etree)
    
        Does not care about element order
        """
        return json.loads(json.dumps((xmltodict.parse(a)))) == json.loads(json.dumps((xmltodict.parse(b))))
    
        5
  •  2
  •   Jeremy Brown    15 年前

    如果采用DOM方法,则可以在比较节点(节点类型、文本、属性)的同时同时遍历两个树。

    递归解决方案将是最优雅的——只要在一对节点不“相等”时进行短路进一步比较,或者当您检测到一棵树中的一片叶子是另一棵树中的一个分支时进行短路进一步比较,等等。

        6
  •  2
  •   user3116268    11 年前

    考虑到这个问题,我提出了以下解决方案,使XML元素具有可比性和可排序性:

    import xml.etree.ElementTree as ET
    def cmpElement(x, y):
        # compare type
        r = cmp(type(x), type(y))
        if r: return r 
        # compare tag
        r = cmp(x.tag, y.tag)
        if r: return r
        # compare tag attributes
        r = cmp(x.attrib, y.attrib)
        if r: return r
        # compare stripped text content
        xtext = (x.text and x.text.strip()) or None
        ytext = (y.text and y.text.strip()) or None
        r = cmp(xtext, ytext)
        if r: return r
        # compare sorted children
        if len(x) or len(y):
            return cmp(sorted(x.getchildren()), sorted(y.getchildren()))
        return 0
    
    ET._ElementInterface.__lt__ = lambda self, other: cmpElement(self, other) == -1
    ET._ElementInterface.__gt__ = lambda self, other: cmpElement(self, other) == 1
    ET._ElementInterface.__le__ = lambda self, other: cmpElement(self, other) <= 0
    ET._ElementInterface.__ge__ = lambda self, other: cmpElement(self, other) >= 0
    ET._ElementInterface.__eq__ = lambda self, other: cmpElement(self, other) == 0
    ET._ElementInterface.__ne__ = lambda self, other: cmpElement(self, other) != 0
    
        7
  •  0
  •   Community CDub    8 年前

    适应 Anentropic's great answer 到python 3(基本上,更改 iteritems() items() basestring string ):

    from lxml import etree
    import xmltodict  # pip install xmltodict
    
    def normalise_dict(d):
        """
        Recursively convert dict-like object (eg OrderedDict) into plain dict.
        Sorts list values.
        """
        out = {}
        for k, v in dict(d).items():
            if hasattr(v, 'iteritems'):
                out[k] = normalise_dict(v)
            elif isinstance(v, list):
                out[k] = []
                for item in sorted(v):
                    if hasattr(item, 'iteritems'):
                        out[k].append(normalise_dict(item))
                    else:
                        out[k].append(item)
            else:
                out[k] = v
        return out
    
    
    def xml_compare(a, b):
        """
        Compares two XML documents (as string or etree)
    
        Does not care about element order
        """
        if not isinstance(a, str):
            a = etree.tostring(a)
        if not isinstance(b, str):
            b = etree.tostring(b)
        a = normalise_dict(xmltodict.parse(a))
        b = normalise_dict(xmltodict.parse(b))
        return a == b
    
        8
  •  0
  •   maxschlepzig    10 年前

    自从 order of attributes is not significant in XML ,您希望忽略由于不同的属性顺序和 XML canonicalization (C14N) 确定性排序属性,您可以使用该方法测试相等性:

    xml1 = b'''    <?xml version='1.0' encoding='utf-8' standalone='yes'?>
        <Stats start="1275955200" end="1276041599"></Stats>'''
    xml2 = b'''     <?xml version='1.0' encoding='utf-8' standalone='yes'?>
        <Stats end="1276041599" start="1275955200"></Stats>'''
    xml3 = b''' <?xml version='1.0' encoding='utf-8' standalone='yes'?>
        <Stats start="1275955200"></Stats>'''
    
    import lxml.etree
    
    tree1 = lxml.etree.fromstring(xml1.strip())
    tree2 = lxml.etree.fromstring(xml2.strip())
    tree3 = lxml.etree.fromstring(xml3.strip())
    
    import io
    
    b1 = io.BytesIO()
    b2 = io.BytesIO()
    b3 = io.BytesIO()
    
    tree1.getroottree().write_c14n(b1)
    tree2.getroottree().write_c14n(b2)
    tree3.getroottree().write_c14n(b3)
    
    assert b1.getvalue() == b2.getvalue()
    assert b1.getvalue() != b3.getvalue()
    

    注意,这个例子假设使用python 3。对于python 3,使用 b'''...''' 弦乐和 io.BytesIO 是必需的,而对于python 2,此方法也适用于普通字符串和 io.StringIO .

        9
  •  0
  •   janbrohl    9 年前

    Simpletal使用自定义xml.sax处理程序比较XML文档 https://github.com/janbrohl/SimpleTAL/blob/python2/tests/TALTests/XMLTests/TALAttributeTestCases.py#L47-L112 (比较getxmlchesum的结果) 但是我更喜欢生成一个列表而不是MD5哈希

        10
  •  0
  •   Pankaj Raheja    8 年前

    下面的代码段怎么样?可以很容易地增强以包括属性:

    def separator(self):
        return "!@#$%^&*" # Very ugly separator
    
    def _traverseXML(self, xmlElem, tags, xpaths):
        tags.append(xmlElem.tag)
        for e in xmlElem:
            self._traverseXML(e, tags, xpaths)
    
        text = ''
        if (xmlElem.text):
            text = xmlElem.text.strip()
    
        xpaths.add("/".join(tags) + self.separator() + text)
        tags.pop()
    
    def _xmlToSet(self, xml):
        xpaths = set() # output
        tags = list()
        root = ET.fromstring(xml)
        self._traverseXML(root, tags, xpaths)
    
        return xpaths
    
    def _areXMLsAlike(self, xml1, xml2):
        xpaths1 = self._xmlToSet(xml1)
        xpaths2 = self._xmlToSet(xml2)`enter code here`
    
        return xpaths1 == xpaths2