代码之家  ›  专栏  ›  技术社区  ›  Hariom Singh

使用panda解析XML

  •  0
  • Hariom Singh  · 技术社区  · 7 年前

    尝试解析XML,然后将其表示为熊猫数据帧

    <?xml version="1.0"?><results>
    <header>
      <cloc_url>github.com/AlDanial/cloc</cloc_url>
      <cloc_version>1.74</cloc_version>
      <elapsed_seconds>0.940369129180908</elapsed_seconds>
      <n_files>124</n_files>
      <n_lines>8440</n_lines>
      <files_per_second>131.863112209998</files_per_second>
      <lines_per_second>8975.19892784178</lines_per_second>
      <report_file>/Users/hariomsingh/Desktop/ignitechute/Repo/ignite-chute-aem_cloc.xml</report_file>
    </header>
    <files>
      <file name="/Users/hariomsingh/Desktop/ignitechute/Repo/ignite-chute-aem/aem-parent/pom.xml" blank="13" comment="23" code="491"  language="Maven" />
      <file name="/Users/hariomsingh/Desktop/ignitechute/Repo/ignite-chute-aem/aem-core/aem-core-bundle/src/test/resources/assets.json" blank="0" comment="0" code="357"  language="JSON" />
      <file name="/Users/hariomsingh/Desktop/ignitechute/Repo/ignite-chute-aem/aem-core/aem-core-bundle/src/main/java/com/chute/aem/core/api/impl/UserServiceImpl.java" blank="26" comment="21" code="202"  language="Java" />
    

    输出类似

    file name                                 blank  comment language code
    Repo/ignite-chute-aem/aem-parent/pom.xml"  "13"   "23"     Maven   491
    <fullpath>/assets.json"                     "12"   "3"      c       432
    

    我只会写几行

    import pandas as pd
    from xml.etree import ElementTree
    tree = ElementTree.parse('/Users/hariomsingh/Desktop/individualxml/ignite-chute-aem_cloc.xml')
    root = tree.getroot()
    
    print(root)
    print(tree.iter())
    
    csv_data = []
    fields =  ['file name','blank','comment', 'language', 'code']
    
    1 回复  |  直到 7 年前
        1
  •  1
  •   TSeymour    7 年前

    假设你对安装漂亮的soup4(即, pip3 install beautifulsoup4 )以及熊猫(即, pip3 install pandas ,那么这应该可以做到:

    from bs4 import BeautifulSoup as Soup
    import pandas
    
    xml = """
    <?xml version="1.0"?><results>
    <header>
      <cloc_url>github.com/AlDanial/cloc</cloc_url>
      <cloc_version>1.74</cloc_version>
      <elapsed_seconds>0.940369129180908</elapsed_seconds>
      <n_files>124</n_files>
      <n_lines>8440</n_lines>
      <files_per_second>131.863112209998</files_per_second>
      <lines_per_second>8975.19892784178</lines_per_second>
      <report_file>/Users/hariomsingh/Desktop/ignitechute/Repo/ignite-chute-aem_cloc.xml</report_file>
    </header>
    <files>
      <file name="/Users/hariomsingh/Desktop/ignitechute/Repo/ignite-chute-aem/aem-parent/pom.xml" blank="13" comment="23" code="491"  language="Maven" />
      <file name="/Users/hariomsingh/Desktop/ignitechute/Repo/ignite-chute-aem/aem-core/aem-core-bundle/src/test/resources/assets.json" blank="0" comment="0" code="357"  language="JSON" />
      <file name="/Users/hariomsingh/Desktop/ignitechute/Repo/ignite-chute-aem/aem-core/aem-core-bundle/src/main/java/com/chute/aem/core/api/impl/UserServiceImpl.java" blank="26" comment="21" code="202"  language="Java" />
    """
    
    soup = Soup(xml, 'lxml')
    
    records = []
    
    for file in soup.findAll('file'):
        records.append(file.attrs)
    
    data_table = pandas.DataFrame(records)
    
    # this prints the table without the long file name to ease seeing all other fields
    print(data_table.drop('name', axis=1))
    
    # this prints just the names (or at least the bit that pandas prints by default)
    print(data_table['name'])
    
    # saving them to disk so you can see the entire table in excel or similar
    data_table.to_csv('output.csv', index=False)