代码之家  ›  专栏  ›  技术社区  ›  user9238790

将XML转换为数据帧

  •  0
  • user9238790  · 技术社区  · 7 年前

    我正在尝试将[网页][1]中的表格读入熊猫数据帧。 pandas.read_html 返回空表列表,因为HTML中的表确实是空的。它们可能是动态填充的。

    某人 suggested 数据源可以是如下所示的[XML][3]:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <Items xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <Item>
            <ID>Content ID
    Unique identifier for use of tool</ID>
            <Type>Product Type
    1. sample name
    2. sample name
    3. sample name
    4. sample name
    5. sample name
    6. sample name
    7. Accessories</Type>
            <Name>Sub Category
    Name of checkbox item and announcement subhead</Name>
            <PubDate>Published Date
    Text type - not Date
    
    Spell out month name completely</PubDate>
            <Desc>Description
    Enter a full description</Desc>
            <Notes>Special Notes
    To appear under recommendation table</Notes>
            <Image>Product Image
    Enter entire URL path to image or provide the  image through email.
    </Image>
            <LinkA>Announcement URL
    Enter file name, no spaces</LinkA>
            <LinkB>Product URL
    Enter full URL</LinkB>
            <TableA>Product SKU
    Enter product number being discontinued</TableA>
            <TableB>Product Description
    Enter product description for original product</TableB>
            <TableC>Replacement Product SKU
    Enter product number to replace discontinued product
    
    Only use this column when multiple country skus are not needed</TableC>
            <TableD>Replacement Product Description
    Enter product description for replacement product
    
    Only use this column when multiple country skus are not needed</TableD>
            <TableE>Custom Header 1
    
    Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableE>
            <TableF>Custom Header 2
    
    Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableF>
            <TableG>Custom Header 3
    
    Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableG>
            <TableH>Custom Header 4
    
    Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableH>
            <TableI>Custom Header 5
    
    Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableI>
            <TableJ>Custom Header 6
    
    Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableJ>
        </Item>
        <Item>
            <ID>1</ID>
            <Type>1</Type>
            <Name>xx sample namexx</Name>
            <PubDate>June 1, 2011</PubDate>
            <Desc>xx Sample Description xx.</Desc>
            <Image>a3100-24.png</Image>
            <LinkA>HP-A3100SI-ES-Announcement.pdf</LinkA>
        </Item>
        <Item>
            <TableA>JD298A</TableA>
            <TableB>xx Sample Table Name xx</TableB>
            <TableC>N/A</TableC>
        </Item>
        <!-- other Item nodes -->
    </Items>
    

    如何将此XML转换为数据帧?

    1 回复  |  直到 7 年前
        1
  •  2
  •   Parfait    7 年前

    任何时候,如果您使用复杂的XML并需要更简单的结构,例如具有二维逐列行的扁平数据帧,那么您应该考虑 XSLT ,这是一种专用语言,旨在将XML文件转换为其他XML、HTML,甚至文本文件,如下所示!Python的 lxml 可以运行XSLT 1.0脚本。

    下面,XSLT生成一个以管道分隔的文本文件,其中包含命名列,然后将其导入Pandas。然而,这种XML的挑战在于 身份证件 -已识别 项目 不是它的孩子,而是它的兄弟姐妹。因此,必须为 Muenchian Grouping ,一种借鉴自 XSLT Grouping Sibling @Tomalak's answer

    XSLT (另存为.xsl文件,一个特殊的.xml文件,要导入)

    由于需要映射空单元格,脚本将耗尽所有可能的列,从而耗尽其长度。

    <?xml version="1.0" encoding="utf-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
       <xsl:output method="text" indent="yes" omit-xml-declaration="yes"/>
    
       <xsl:key name="item_key" match="Item" use="generate-id(preceding-sibling::Item[count(ID) &gt; 0][1])"/>
    
       <xsl:template match ="/Items">  
            <!-- COLUMN HEADERS -->
            <xsl:text>ID|Type|Name|PubDate|Desc|Notes|Image|LinkA|LinkB|TableA|TableB|TableC|TableD|TableE|TableF|TableG|TableH|TableI|TableJ&#xa;</xsl:text>
            <xsl:apply-templates select="Item[count(ID) > 0 and not(contains(ID, 'Content'))]"/>      
       </xsl:template>
    
       <xsl:template match ="Item">   
        <!-- INDICATORS TO REPEAT ACROSS RELATED ROWS -->
        <xsl:variable name="ID" select="normalize-space(ID)"/>
        <xsl:variable name="Type" select="normalize-space(Type)"/>
        <xsl:variable name="Name" select="normalize-space(Name)"/>
        <xsl:variable name="PubDate" select="normalize-space(PubDate)"/>
        <xsl:variable name="Desc" select="normalize-space(Desc)"/>
        <xsl:variable name="Notes" select="normalize-space(Notes)"/>
        <xsl:variable name="Image" select="normalize-space(Image)"/>
        <xsl:variable name="LinkA" select="normalize-space(LinkA)"/>
        <xsl:variable name="LinkB" select="normalize-space(LinkB)"/>
    
            <!-- ITEM ID NODES -->
            <xsl:value-of select="$ID"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Type"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Name"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$PubDate"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Desc"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Notes"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Image"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$LinkA"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$LinkB"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableA)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableB)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableC)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableD)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableE)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableF)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableG)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableH)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableI)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableJ)"/><xsl:text>|</xsl:text>
            <xsl:text>&#xa;</xsl:text>             <!-- LINE BREAK -->      
    
            <!-- ALL RELATED NODES TO ITEM ID -->
            <xsl:for-each select="key('item_key', generate-id())[position() != last()]" >
                <xsl:value-of select="$ID"/><xsl:text>|</xsl:text>
                <xsl:value-of select="$Type"/><xsl:text>|</xsl:text>
                <xsl:value-of select="$Name"/><xsl:text>|</xsl:text>    
                <xsl:value-of select="$PubDate"/><xsl:text>|</xsl:text>
                <xsl:value-of select="$Desc"/><xsl:text>|</xsl:text>
                <xsl:value-of select="$Notes"/><xsl:text>|</xsl:text>
                <xsl:value-of select="$Image"/><xsl:text>|</xsl:text>
                <xsl:value-of select="$LinkA"/><xsl:text>|</xsl:text>
                <xsl:value-of select="$LinkB"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableA)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableB)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableC)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableD)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableE)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableF)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableG)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableH)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableI)"/><xsl:text>|</xsl:text>
                <xsl:value-of select="normalize-space(TableJ)"/><xsl:text>|</xsl:text>
                <xsl:text>&#xa;</xsl:text>             <!-- LINE BREAK -->      
            </xsl:for-each>
    
       </xsl:template>
    
    </xsl:stylesheet>
    

    python (运行XSLT,保存CSV,导入Pandas)

    import pandas as pd
    from lxml import etree
    
    url = "http://h17007.www1.hpe.com/data/xml/eos/eos.xml?a=0.9317168944148095.xml"
    
    # LOAD XML AND XSL
    xml = etree.parse(url)
    xsl = etree.parse("XSLT_Script.xsl")
    
    # TRANSFORM SOURCE
    transformer = etree.XSLT(xsl)    
    result = transformer(xml)
    
    # SAVE PIPE-DELIMITED FILE
    with open("Output.txt", 'wb') as f:
        f.write(result)
    
    # IMPORT PIPE-DELIMITED FILE
    hp_df = pd.read_table("Output.txt", sep="|", index_col=False)
    
    # ALTERNATIVE: IMPORT DIRECTLY (BYPASS .TXT SAVE)
    from io import StringIO
    
    hp_df = pd.read_table(StringIO(str(result)), sep="|", index_col=False)
    

    输出 (熊猫进口前)

    ID|Type|Name|PubDate|Desc|Notes|Image|LinkA|LinkB|TableA|TableB|TableC|TableD|TableE|TableF|TableG|TableH|TableI|TableJ
    1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||||||||||||
    1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD298A|HP 1 Port Gig-T 3100 SI Module|N/A||||||||
    1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD299A|HP 1 Port Gig-LX SC 3100 SI Module|N/A||||||||
    1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD300A|HP 1 Port Gig-SX SC 3100 SI Module|N/A||||||||
    1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD301A|HP 1-Port 10/100Base-T POE 3100 SI Module|N/A||||||||
    1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD308A|HP A3100-16 SI Switch with 2 Module Slots|JD305A|HP A3100-16 SI Switch|||||||
    1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD309A|HP A3100-24 SI Switch with 2 Slots|JD306A|HP A3100-24 SI Switch|||||||
    1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JF444A|3100 series module|N/A||||||||
    2|2|HP V10ag Wireless Access Point (NA only)|July 26, 2010|The HP V10ag Wireless Access Point has provided secure, reliable 802.11a and 802.11b/g wireless connectivity for small business networks since 2007. Due to the availability of the next generation 802.11n technology and the introduction of the HP V-M200 802.11n Access Point, HP networking is announcing the End of Sale of the HP V10ag Wireless Access Point (J9140A). For specific product rollover details see the announcement.||WAP10ag-1.png|10agAnnouncement-AM-only.pdf|http://h10010.www1.hp.com/wwpc/us/en/sm/WF05a/12883-12883-1137927-3836040-4172284-3637595.html?jumpid=reg_R1002_USEN|||||||||||
    3|2|HP ProCurve Mobility Access Point Series - M110|September 2, 2009|<b>MAC Address Schema Change:</b> We are finalizing the integration of Colubris (previous acquisition) products by transitioning MAC Address assignments to HP ProCurve MAC address assignments. HP will be doing a Product Roll to support this requirement. <b>HP ProCurve Statement on New DFS EU Standards</b> As of July 1st 2010, all wireless devices sold in the EU countries and any country that participates in the EU free market, must meet stringent Dynamic Frequency Selection (DFS) requirements for radar detection and avoidance. HP will be doing a Product Roll to support this requirement. For specific product roll details see our MAC Address A-to-B Roll and DFS Disablement Announcement.||M110_100x100.png|A-to-BRollforVariousHPProCurveAccessPoints.pdf|http://h20195.www2.hp.com/v2/GetDocument.aspx?docname=4AA0-8273ENW&cc=en&lc=en|||||||||||
    

    要筛选出此主数据帧,请使用pandas方法:

    # SPECIFIC PRODUCT WITH [...]
    filtered_df = hp_df[hp_df['Name'] == 'HPE 1410 Fast Ethernet Switches']
    
    # SPECIFIC PRODUCT WITH .query()
    filtered_df = hp_df.query("Name == 'HPE FlexNetwork 5940 Switch Series'")
    
    # PASS A LIST WITH .isin()
    filtered_df = hp_df[hp_df['Name'].isin(['HPE FlexNetwork 5120 SI Switch Series',
                                            'HPE 1410 Fast Ethernet Switches',
                                            'HPE OfficeConnect 1910 Switch Series'])]