代码之家  ›  专栏  ›  技术社区  ›  fmalina

很好的python XML解析器可以处理大量的命名空间文档

  •  8
  • fmalina  · 技术社区  · 14 年前

    python elementtree似乎无法用于命名空间。我的选择是什么? 漂亮的汤在名称空间上也是垃圾。 我不想把它们剥了。

    特定python库如何获取命名元素及其集合的示例都是+1。

    编辑: 你能用你选择的库提供代码来处理这个现实世界的用例吗?

    如何获取字符串'line break'、'2.6'和列表['python'、'xml'、'xml-namespaces']

    <?xml version="1.0" encoding="UTF-8"?>
    <zs:searchRetrieveResponse
        xmlns="http://unilexicon.com/vocabularies/"
        xmlns:zs="http://www.loc.gov/zing/srw/"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:lom="http://ltsc.ieee.org/xsd/LOM">
        <zs:records>
            <zs:record>
                <zs:recordData>
                    <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema">
                        <name>Line Break</name>
                        <dc:title>Processing XML namespaces using Python</dc:title>
                        <dc:description>How to get contents string from an element,
                            how to get a collection in a list...</dc:description>
                        <lom:metaMetadata>
                            <lom:identifier>
                                <lom:catalog>Python</lom:catalog>
                                <lom:entry>2.6</lom:entry>
                            </lom:identifier>
                        </lom:metaMetadata>
                        <lom:classification>
                            <lom:taxonPath>
                                <lom:taxon>
                                    <lom:id>PYTHON</lom:id>
                                </lom:taxon>
                            </lom:taxonPath>
                        </lom:classification>
                        <lom:classification>
                            <lom:taxonPath>
                                <lom:taxon>
                                    <lom:id>XML</lom:id>
                                </lom:taxon>
                            </lom:taxonPath>
                        </lom:classification>
                        <lom:classification>
                            <lom:taxonPath>
                                <lom:taxon>
                                    <lom:id>XML-NAMESPACES</lom:id>
                                </lom:taxon>
                            </lom:taxonPath>
                        </lom:classification>
                    </srw_dc:dc>
                </zs:recordData>
            </zs:record>
            <!-- ... more records ... -->
        </zs:records>
    </zs:searchRetrieveResponse>
    
    3 回复  |  直到 10 年前
        1
  •  13
  •   Serrano shaheenery    12 年前

    lxml 可识别命名空间。

    >>> from lxml import etree
    >>> et = etree.XML("""<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz /></bar></root>""")
    >>> etree.tostring(et, encoding=str) # encoding=str only needed in Python 3, to avoid getting bytes
    '<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz/></bar></root>'
    >>> et.xpath("f:bar", namespaces={"b":"bar", "f": "foo"})
    [<Element {foo}bar at ...>]
    

    编辑:在您的示例中:

    from lxml import etree
    
    # remove the b prefix in Python 2
    # needed in python 3 because
    # "Unicode strings with encoding declaration are not supported."
    et = etree.XML(b"""...""")
    
    ns = {
        'lom': 'http://ltsc.ieee.org/xsd/LOM',
        'zs': 'http://www.loc.gov/zing/srw/',
        'dc': 'http://purl.org/dc/elements/1.1/',
        'voc': 'http://www.schooletc.co.uk/vocabularies/',
        'srw_dc': 'info:srw/schema/1/dc-schema'
    }
    
    # according to docs, .xpath returns always lists when querying for elements
    # .find returns one element, but only supports a subset of XPath
    record = et.xpath("zs:records/zs:record", namespaces=ns)[0]
    # in this example, we know there's only one record
    # but else, you should apply the following to all elements the above returns
    
    name = record.xpath("//voc:name", namespaces=ns)[0].text
    print("name:", name)
    
    lom_entry = record.xpath("zs:recordData/srw_dc:dc/"
                             "lom:metaMetadata/lom:identifier/"
                             "lom:entry",
                             namespaces=ns)[0].text
    
    print('lom_entry:', lom_entry)
    
    lom_ids = [id.text for id in
               record.xpath("zs:recordData/srw_dc:dc/"
                            "lom:classification/lom:taxonPath/"
                            "lom:taxon/lom:id",
                            namespaces=ns)]
    
    print("lom_ids:", lom_ids)
    

    输出:

    name: Frank Malina
    lom_entry: 2.6
    lom_ids: ['PYTHON', 'XML', 'XML-NAMESPACES']
    
        2
  •  1
  •   pyfunc    14 年前
        3
  •  0
  •   iscarface    14 年前

    libxml(http://xmlsoft.org/) 用于XML解析的最佳、更快的lib。 有针对python的实现。