代码之家  ›  专栏  ›  技术社区  ›  Fandango68

需要帮助提取此XML节点-Python中的Excel连接字符串

  •  -1
  • Fandango68  · 技术社区  · 10 月前

    我有一个Python程序打开Excel(XLSX)文件,并试图找到 <connection> 节点。

    这是来自 connections.xml 文件。

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <connections 
        xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
        mc:Ignorable="xr16" 
        xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
        <connection 
            id="1" xr16:uid="{#####}" keepAlive="1" 
            name="Query - CargoData_small" 
            description="Connection to the 'CargoData_small' query in the workbook." 
            type="5" refreshedVersion="7" background="1">
            <dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=&quot;&quot;" 
                command="SELECT * FROM [CargoData_small]"/>
        </connection>
    </connections>
    

    我试图找到 <dbPr> 节点。但我被困在代码的子节点上,如下所示:

    def checkfile(filename):
        if zipfile.is_zipfile(filename):
            zf = zipfile.ZipFile(filename, 'r')
            if "xl/connections.xml" in zf.namelist():
                print(filename)
                xml = zf.read('xl/connections.xml')
                root = parseString(xml)
                connections = root.getElementsByTagName('connection')
                try:
                    for con in connections:
                        for child in con.childNodes:
    
                            # there are no 'children'
                            for children in child.childNodes:
                                dsn = children.attributes.values()[0].nodeValue
                                sql = children.attributes.values()[1].nodeValue
                                writeoutput(filename, dsn, sql )
                except:
                    pass
    

    因此,我得到了“child”值,但找不到dbPr部分。

    这就是我得到的错误:

    {TypeError}TypeError("'dict_values' object is not subscriptable")

    我使用Pycharm作为IDE。

    谢谢

    3 回复  |  直到 10 月前
        1
  •  2
  •   AKX Bryan Oakley    10 月前

    您缺少命名空间。这个 xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 在根节点中意味着它及其没有附加另一个命名空间的后代具有该命名空间,因此您需要用该命名空间限定您的查询。

    一个使用内置 xml.etree.ElementTree ,并且为了方便起见对数据进行了硬编码。。。

    data = """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <connections 
        xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
        mc:Ignorable="xr16" 
        xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
        <connection 
            id="1" xr16:uid="{#####}" keepAlive="1" 
            name="Query - CargoData_small" 
            description="Connection to the 'CargoData_small' query in the workbook." 
            type="5" refreshedVersion="7" background="1">
            <dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=&quot;&quot;" 
                command="SELECT * FROM [CargoData_small]"/>
        </connection>
    </connections>
    """
    
    from xml.etree import ElementTree as ET
    
    root = ET.fromstring(data)
    for connection in root.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}connection'):
        for dbpr in connection.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}dbPr'):
            print(connection.attrib['name'], dbpr.attrib['connection'], dbpr.attrib['command'])
    

    打印出来

    Query - CargoData_small Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties="" SELECT * FROM [CargoData_small]
    
        2
  •  1
  •   folen gateis    10 月前

    beautifulsoup的一个简单例子

    from bs4 import BeautifulSoup
    with open('xl/connections.xml') as f:
        data = f.read()
    soup = BeautifulSoup(data, "xml")
    dbPr = soup.find('dbPr')
    print(dbPr.attrs['connection'], dbPr.attrs['command'])
    
        3
  •  1
  •   Hermann12    10 月前

    我建议使用 lxml xpath :

    ( lxml 可以从根节点检测命名空间映射- nsmap 为了简单 find(xpath, ns) .)

    from lxml import etree as et
    from io import BytesIO
    
    excel_=b"""\
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <connections 
        xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
        mc:Ignorable="xr16" 
        xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
        <connection 
            id="1" xr16:uid="{#####}" keepAlive="1" 
            name="Query - CargoData_small" 
            description="Connection to the 'CargoData_small' query in the workbook." 
            type="5" refreshedVersion="7" background="1">
            <dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=&quot;&quot;" 
                command="SELECT * FROM [CargoData_small]"/>
        </connection>
    </connections>"""
    file = BytesIO(excel_)
    
    root = et.parse(file).getroot()
    ns = root.nsmap
    
    # with find()
    dbPr = root.find(".//dbPr", ns)
    
    for attr in dbPr.keys():
        print(f"{attr}: {dbPr.get(attr)}")
    
    ###############################
    # Alternativ via xpath directly
    #
    # set proxy for empty prefix
    ns_map ={}
    ns_map['root_ns'] = root.nsmap.get(None)
    
    # use xpath direct to the attribute value
    dbPr_con = root.xpath(".//root_ns:dbPr/@connection", namespaces=ns_map)
    print("Connection:", *dbPr_con)
    

    输出:

    connection: Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=""
    command: SELECT * FROM [CargoData_small]
    
    Connection: Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=""