代码之家 › 专栏 › 技术社区 › Fandango68

需要帮助提取此XML节点-Python中的Excel连接字符串

xml excel python

-1

Fandango68 · 技术社区 · 10 月前

我有一个Python程序打开Excel(XLSX)文件,并试图找到 <connection> 节点。

这是来自 connections.xml 文件。

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<connections 
    xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
    mc:Ignorable="xr16" 
    xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
    <connection 
        id="1" xr16:uid="{#####}" keepAlive="1" 
        name="Query - CargoData_small" 
        description="Connection to the 'CargoData_small' query in the workbook." 
        type="5" refreshedVersion="7" background="1">
        <dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=&quot;&quot;" 
            command="SELECT * FROM [CargoData_small]"/>
    </connection>
</connections>

我试图找到 <dbPr> 节点。但我被困在代码的子节点上,如下所示:

def checkfile(filename):
    if zipfile.is_zipfile(filename):
        zf = zipfile.ZipFile(filename, 'r')
        if "xl/connections.xml" in zf.namelist():
            print(filename)
            xml = zf.read('xl/connections.xml')
            root = parseString(xml)
            connections = root.getElementsByTagName('connection')
            try:
                for con in connections:
                    for child in con.childNodes:

                        # there are no 'children'
                        for children in child.childNodes:
                            dsn = children.attributes.values()[0].nodeValue
                            sql = children.attributes.values()[1].nodeValue
                            writeoutput(filename, dsn, sql )
            except:
                pass

因此,我得到了“child”值,但找不到dbPr部分。

这就是我得到的错误:

${TypeError}TypeError("'dict_values' object is not subscriptable")$

我使用Pycharm作为IDE。

谢谢

3 回复 | 直到 10 月前

AKX Bryan Oakley 10 月前

您缺少命名空间。这个 xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 在根节点中意味着它及其没有附加另一个命名空间的后代具有该命名空间,因此您需要用该命名空间限定您的查询。

一个使用内置 xml.etree.ElementTree ,并且为了方便起见对数据进行了硬编码。。。

data = """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<connections 
    xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
    mc:Ignorable="xr16" 
    xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
    <connection 
        id="1" xr16:uid="{#####}" keepAlive="1" 
        name="Query - CargoData_small" 
        description="Connection to the 'CargoData_small' query in the workbook." 
        type="5" refreshedVersion="7" background="1">
        <dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=&quot;&quot;" 
            command="SELECT * FROM [CargoData_small]"/>
    </connection>
</connections>
"""

from xml.etree import ElementTree as ET

root = ET.fromstring(data)
for connection in root.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}connection'):
    for dbpr in connection.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}dbPr'):
        print(connection.attrib['name'], dbpr.attrib['connection'], dbpr.attrib['command'])

打印出来

Query - CargoData_small Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties="" SELECT * FROM [CargoData_small]

folen gateis 10 月前

beautifulsoup的一个简单例子

from bs4 import BeautifulSoup
with open('xl/connections.xml') as f:
    data = f.read()
soup = BeautifulSoup(data, "xml")
dbPr = soup.find('dbPr')
print(dbPr.attrs['connection'], dbPr.attrs['command'])

Hermann12 10 月前

我建议使用 lxml 和 xpath :

( lxml 可以从根节点检测命名空间映射- nsmap 为了简单 find(xpath, ns) .)

from lxml import etree as et
from io import BytesIO

excel_=b"""\
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<connections 
    xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
    mc:Ignorable="xr16" 
    xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
    <connection 
        id="1" xr16:uid="{#####}" keepAlive="1" 
        name="Query - CargoData_small" 
        description="Connection to the 'CargoData_small' query in the workbook." 
        type="5" refreshedVersion="7" background="1">
        <dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=&quot;&quot;" 
            command="SELECT * FROM [CargoData_small]"/>
    </connection>
</connections>"""
file = BytesIO(excel_)

root = et.parse(file).getroot()
ns = root.nsmap

# with find()
dbPr = root.find(".//dbPr", ns)

for attr in dbPr.keys():
    print(f"{attr}: {dbPr.get(attr)}")

###############################
# Alternativ via xpath directly
#
# set proxy for empty prefix
ns_map ={}
ns_map['root_ns'] = root.nsmap.get(None)

# use xpath direct to the attribute value
dbPr_con = root.xpath(".//root_ns:dbPr/@connection", namespaces=ns_map)
print("Connection:", *dbPr_con)

输出:

connection: Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=""
command: SELECT * FROM [CargoData_small]

Connection: Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=""