代码之家  ›  专栏  ›  技术社区  ›  kabr hrbrmstr

如何使用xml2和purrr在不同级别提取xml\U属性和xml\U文本?

  •  2
  • kabr hrbrmstr  · 技术社区  · 8 年前

    我想从XML文件中提取信息并将其转换为数据帧。

    信息以XML文本和XML属性的形式存储在嵌套节点中:

    结构示例:

    <xmlnode node-id = "Text about xmlnode">
        <xmlsubnode subnode-id = "123">
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
        </xmlsubnode>
        <xmlsubnode subnode-id = "456">
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
        </xmlsubnode>
    </xmlnode>
    <xmlnode node-id = "Text about xmlnode">
        <xmlsubnode subnode-id = "123">
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
        </xmlsubnode>
        <xmlsubnode subnode-id = "456">
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
            <xmlsubsubnode>
                I want to extract this text
            </xmlsubsubnode>    
        </xmlsubnode>
    </xmlnode>
    

    我想获得以下信息:

    * node-id (attribute)
    * subnode-id (attribute)
    * text in `xmlsubnodenode` (text)
    

    我需要这样的长格式数据框:

    node-id subnode-id  text
    Text about xmlnode 1    123 I want to extract this text
    Text about xmlnode 1    123 I want to extract this text
    Text about xmlnode 1    123 I want to extract this text
    Text about xmlnode 1    123 I want to extract this text
    Text about xmlnode 1    456 I want to extract this text
    Text about xmlnode 1    456 I want to extract this text
    Text about xmlnode 1    456 I want to extract this text
    Text about xmlnode 1    456 I want to extract this text
    Text about xmlnode 2    123 I want to extract this text
    Text about xmlnode 2    123 I want to extract this text
    Text about xmlnode 2    123 I want to extract this text
    Text about xmlnode 2    123 I want to extract this text
    Text about xmlnode 2    456 I want to extract this text
    Text about xmlnode 2    456 I want to extract this text
    Text about xmlnode 2    456 I want to extract this text
    Text about xmlnode 2    456 I want to extract this text
    

    我试着跟随Jenny Bryans的方法 "How to tame XML with nested data frames and purrr" ,但它仅在第一个级别上起作用。

    xml <- xml2::read_xml("input/example.xml")
    rows <- 
      xml %>%
      xml_find_all("//xmlnode")
    rows_df <- data_frame(row = seq_along(rows), nodeset = rows)
    rows_df %>%
      mutate(node_id = nodeset %>% map(~ xml_attr(., "node-id"))) %>%
      select(row, node_id) %>%
      unnest()
    

    你有什么想法来获取这些信息吗 purrr ?

    1 回复  |  直到 8 年前
        1
  •  5
  •   Felix Ebert    8 年前

    一种不需要将行展开/添加到另一个数据帧的方法:为每个数据帧创建一个包含一行的数据帧 subsubnode 使用 purrr 连同 xml2 选择和提取 xmlsubnode 父级和 xmlnode 祖宗

    工作样本:

    library(dplyr)
    library(xml2)
    library(purrr)
    library(tidyr)
    xml <- xml2::read_xml("input/example.xml")
    rows <- xml %>% xml_find_all("//xmlsubsubnode")
    rows_df <- data_frame(node = rows) %>%
      mutate(node_id = node %>% map(~ xml_find_first(., "ancestor::xmlnode")) %>% map(~ xml_attr(., "node-id"))) %>%
      mutate(subnode_id = node %>% map(~ xml_parent(.)) %>% map(~ xml_attr(., "subnode-id"))) %>%
      mutate(text = node %>% map(~ xml_text(.))) %>%
      select(-node)
    
    推荐文章