代码之家 › 专栏 › 技术社区 › Bram Vanroy

xml.etree.elementtree与lxml.etree:不同的内部节点表示法?

elementtree lxml python-3.x xml python

Bram Vanroy · 技术社区 · 7 年前

我一直在改造我的一些原创作品 xml.etree.ElementTree ( ET 代码到 lxml.etree ( lxmlET )幸运的是,两者有很多相似之处。然而 ,我确实偶然发现了一些奇怪的行为,在任何文档中都找不到记录。它考虑子节点的内部表示。

在ET中, iter() 用于迭代元素的所有子代,可以选择按标记名筛选。因为我在文档中找不到关于这个的任何细节,所以我希望lxmlet也有类似的行为。问题是,通过测试,我得出结论,在lxmlet中,树有一个不同的内部表示。

在下面的示例中,我遍历树中的节点并打印每个节点的子节点,但另外,我还创建这些子节点的所有不同组合并打印这些子节点。这意味着,如果元素有子元素 ('A', 'B', 'C') 我创造改变,即树 [('A'), ('A', 'B'), ('A', 'C'), ('B'), ('B', 'C'), ('C')] 。

# import lxml.etree as ET
import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            new_combo_tree = ET.Element(tree.tag, tree.attrib)
            for recombined_child in combination:
                new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            yield new_combo_tree

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            for combo_tree in get_combination_trees(subnode):
                combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))    

    return None


s = '''<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
'''

parse_xml(s)

这里的预期输出是每个节点的子节点的id,这些子节点用连字符连接在一起,并且还以自上而下的宽度优先的方式组合所有子节点(参见上文)。

2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

但是,当您使用 lxml 模块而不是 xml (取消对lxmlet的导入的注释,并对et的导入进行注释),然后运行代码,您将看到输出是

2-3
2
3
node 2 has no children

因此,更深层的子节点永远不会被访问。这可以通过以下任一方法来避免:

使用 deepcopy (注释/取消注释 get_combination_trees() )
使用 for subnode in node.xpath('.//node') 在里面 parse_xml() 而不是 迭代() 。

所以我知道这是有办法的,但我主要想知道 怎么回事?! 我花了很长时间调试这个,但找不到任何关于它的文档。发生了什么事 实际的 这两个模块之间的根本区别?最重要的是 有效率的 在处理非常大的树时,要四处走动吗?

3 回复 | 直到 7 年前

supersam654 7 年前

虽然路易斯的回答是正确的,我完全同意在遍历数据结构时修改它通常是个坏主意 ^(TM) ,您还询问了为什么代码与 xml.etree.ElementTree 而不是 lxml.etree 对此有一个非常合理的解释。

实施 `.append` 在里面 `xml.etree.elementtree目录树`

这个库是直接用python实现的,并且可以根据您使用的python运行时而有所不同。假设您使用的是cpython,那么您要寻找的实现就是 in vanilla Python :

def append(self, subelement):
    """Add *subelement* to the end of this element.
    The new element will appear in document order after the last existing
    subelement (or directly after the text, if it's the first subelement),
    but before the end tag for this element.
    """
    self._assert_is_element(subelement)
    self._children.append(subelement)

最后一行是我们唯一关心的部分。事实证明, self._children 已初始化 towards the top of that file AS:

self._children = []

因此,向树中添加子元素只是将元素追加到列表中。直观地说,这正是您要寻找的(在本例中)并且实现的行为完全不令人惊讶。

实施 `追加` 在里面 `莱克莫尔`

lxml 是以python、非平凡cython和c代码的混合体实现的,因此通过它进行编程要比纯python实现困难得多。首先, .append is implemented as :

def append(self, _Element element not None):
    u"""append(self, element)
    Adds a subelement to the end of this element.
    """
    _assertValidNode(self)
    _assertValidNode(element)
    _appendChild(self, element)

_appendChild 是在 apihelper.pxi :

cdef int _appendChild(_Element parent, _Element child) except -1:
    u"""Append a new child to a parent element.
    """
    c_node = child._c_node
    c_source_doc = c_node.doc
    # prevent cycles
    if _isAncestorOrSame(c_node, parent._c_node):
        raise ValueError("cannot append parent to itself")
    # store possible text node
    c_next = c_node.next
    # move node itself
    tree.xmlUnlinkNode(c_node)
    tree.xmlAddChild(parent._c_node, c_node)
    _moveTail(c_next, c_node)
    # uh oh, elements may be pointing to different doc when
    # parent element has moved; change them too..
    moveNodeToDocument(parent._doc, c_source_doc, c_node)
    return 0

这里肯定会有更多的事情发生。特别地, LXML 显式地从树中移除节点,然后将其添加到其他位置。这可以防止意外地创建循环xml 图表在操作节点时(这可能是您可以使用 xml.etree 版本)。

工作区 `LXML`

现在我们知道了 埃特里 副本附加时的节点,但是 莱克莫尔 移动他们,为什么这些变通办法有效?基于 tree.xmlUnlinkNode 方法(实际上 defined in C inside of libxml2 ,取消链接只会弄乱一堆指针。因此,任何复制节点元数据的操作都会成功。因为我们关心的所有元数据都是 the xmlNode struct ,任何浅的复制节点就可以了

copy.deepcopy() 绝对有效
node.xpath 返回节点 wrapped in proxy elements 它碰巧是浅层复制树元数据
copy.copy() 也能做到
如果你不需要你的组合在正式的树中,设置 new_combo_tree = [] 也给你列表附加 埃特里 。

如果你真的关心性能和大树,我可能会从浅拷贝开始 复制,拷贝() 尽管你绝对应该分析一些不同的选择,看看哪一个最适合你。

Louis 7 年前

复制问题

一般来说,当您操作xml树并希望复制树中多个位置的信息(与移动从一个地方到另一个地方的信息)是执行 对这些元素进行深度复制操作,而不仅仅是将它们添加到新位置。 生成树的绝大多数xml解析库要求如果要复制周围的结构,则执行深度复制。他们只是不会给你你想要的结果,如果你不深抄袭。 lxml 是这样一个库,它要求您深入复制要复制的结构。

事实上 xml.etree.ElementTree 以这样的方式工作 .append 有效地允许您在树中的两个位置具有相同的元素 绝对不寻常 以我的经验。

边走边改问题

你提到过 for subnode in node.xpath('.//node') 也解决了你的问题。请注意,如果您使用 for subnode in list(node.iter('node')) ,您将得到相同的结果。这是怎么回事 list(node.iter('node')) 或 node.xpath('.//node') 或使用 deepcopy 复制节点而不是移动它们可以保护您免受 另一个 代码有问题: 在修改结构时,您正在行走该结构。

node.iter('node') 创建一个迭代器,该迭代器在您迭代XML结构时遍历XML结构。如果你把它包起来 list() ,然后立即遍历结构并将结果放入列表中。所以在你走之前,你已经拍了一张结构的快照。这样可以防止您的行走操作受到对树的更改的影响。如果你这样做了 node.xpath('.//node') 您还将在遍历树之前获取树的快照,因为该方法返回节点列表。如果你做了一个 深拷贝 并附加节点的副本,而不是附加原始节点,则 不修改 你边走边走的那棵树。

您是否可以使用xpath或 node.xpath('.//node') 而不是 使用 深拷贝 取决于你打算如何处理你的组合。您在问题中显示的代码会在创建组合后立即将其打印到屏幕上。当你打印出来的时候看起来很好,但是如果你不使用 深拷贝 对于创建它们,那么一旦创建了新组合,旧的组合就会变得一团糟,因为出现在旧组合中并且需要出现在新组合中的任何节点 将被移动而不是复制 .

在处理非常大的树时,最有效的工作是什么?

这取决于应用程序的细节和需要分析的数据。你举了一个例子,是一个小文档,但是你问的是“大树”。适用于小文档的内容不一定转移到大文档。您可以针对案例x进行优化,但如果案例x在 真实的 数据,那么你的优化可能不会成功。在某些情况下,它实际上可能是有害的。

在我的一个应用程序中,我不得不用结构本身替换对某些结构的引用。一个简化的说明应该是一个包含如下元素的文档 <define id="...">...</def> 以及类似于 <ref idref="..."/> . 每一个实例 ref 必须用 define 它指向。一般来说,这可能意味着复制一个 定义 多次但有时 定义 可能只有一个 裁判 因此,一个优化是检测到这一点,在只有一个引用的情况下跳过深层副本。我“免费”得到这个优化,因为应用程序已经需要记录 裁判 和 定义 为了其他目的。如果我不得不增加簿记的话 只是为了这个优化 ,不清楚是否值得。

CristiFati 7 年前

一开始我并不认为有这么大的区别(我也没有查过),但是@supersam654和@louis answers都非常清楚地指出了这一点。

但是依赖于 内部代表 (而不是界面 )它使用的东西, 似乎不对 (从设计中) 波夫对我来说。另外,正如我在评论中所问: 儿童套餐 似乎毫无用处:

获取子节点组合框(作为列表)
将列表中的每个节点作为子节点附加到 儿童套餐
返回 儿童套餐
得到 儿童套餐 子项(作为列表)
使用列表(组合)

当事情很容易做的时候:

获取子节点组合框(作为列表)
返回列表
使用列表(组合)

显然, 儿童套餐 该方法还揭示了模块之间的行为差异。

代码_orig_lxml.py :

import lxml.etree as ET
#import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            #new_combo_tree = ET.Element(tree.tag, tree.attrib)
            #for recombined_child in combination:
                #new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            #yield new_combo_tree
            yield combination

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            #for combo_tree in get_combination_trees(subnode):
            for combo_children in get_combination_trees(subnode):
                #combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))

    return None


s = """
<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
"""

parse_xml(s)

笔记 :

这是上面修改过的代码
我没有删除任何东西,只是评论了一些东西(这会产生最小的微分新旧版本之间)

产量 :

(py36x86_test) e:\Work\Dev\StackOverflow\q050749937>"e:\Work\Dev\VEnvs\py36x86_test\Scripts\python.exe" code_orig_lxml.py
2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

在我调查的时候,我进一步修改了你的代码:

解决问题
改进打印
模块化
使用这两种解析方法,使它们之间的区别更清楚

XMLYDATA :

DATA = """
<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
"""

密码 :

import sys
import xml.etree.ElementTree as xml_etree_et
import lxml.etree as lxml_etree
from itertools import combinations
from xml_data import DATA


MAIN_NODE_NAME = "node_main"


def get_children_combinations(tree):
    children = list(tree)
    for i in range(1, len(children)):
        yield from combinations(children, i)


def get_tree(xml_str, parse_func, tag=None):
    root_node = parse_func(xml_str)
    if tag:
        return [item for item in root_node if item.tag == tag]
    return [root_node]


def process_xml(xml_node):
    for node in xml_node.iter("node"):
        print(f"\nNode ({node.tag}, {node.attrib['id']})")
        children = list(node)
        if children:
            print("    Children: " + " - ".join([child.attrib["id"] for child in children]))

        for children_combo in get_children_combinations(node):
            if children_combo:
                print("    Combo: " + " - ".join([child.attrib["id"] for child in children_combo]))


def main():
    parse_funcs = (xml_etree_et.fromstring, lxml_etree.fromstring)
    for func in parse_funcs:
        print(f"\nParsing xml using: {func.__module__} {func.__name__}")
        nodes = get_tree(DATA, func, tag=MAIN_NODE_NAME)
        for node in nodes:
            print(f"\nProcessing node: {node.tag}")
            process_xml(node)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()