我想从XML文件中提取信息并将其转换为数据帧。
信息以XML文本和XML属性的形式存储在嵌套节点中:
结构示例:
<xmlnode node-id = "Text about xmlnode">
<xmlsubnode subnode-id = "123">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
<xmlsubnode subnode-id = "456">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
</xmlnode>
<xmlnode node-id = "Text about xmlnode">
<xmlsubnode subnode-id = "123">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
<xmlsubnode subnode-id = "456">
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
<xmlsubsubnode>
I want to extract this text
</xmlsubsubnode>
</xmlsubnode>
</xmlnode>
我想获得以下信息:
* node-id (attribute)
* subnode-id (attribute)
* text in `xmlsubnodenode` (text)
我需要这样的长格式数据框:
node-id subnode-id text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 123 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 1 456 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 123 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
Text about xmlnode 2 456 I want to extract this text
我试着跟随Jenny Bryans的方法
"How to tame XML with nested data frames and purrr"
,但它仅在第一个级别上起作用。
xml <- xml2::read_xml("input/example.xml")
rows <-
xml %>%
xml_find_all("//xmlnode")
rows_df <- data_frame(row = seq_along(rows), nodeset = rows)
rows_df %>%
mutate(node_id = nodeset %>% map(~ xml_attr(., "node-id"))) %>%
select(row, node_id) %>%
unnest()
你有什么想法来获取这些信息吗
purrr
?