这将把类属性和文本放入以长格式输出的名称-值对数据集中。书籍标识符(
book
)已添加到输出数据帧中,以便更容易进行分组操作(例如转换为宽格式):
library(rvest)
library(purrr)
book <- html_elements(books, xpath = "//div[@class='book']")
data <- map_dfr(seq_along(book), \(i) {
b <- book[[i]]
children <- html_children(b)
data.frame(book = i,
name = children |> html_attrs() |> unlist(use.names = F),
value = html_text2(children))
})
# book name value
# 1 1 booktitle Book 1
# 2 1 year 1999
# 3 1 author Author 1
# 4 1 author Author 2
# 5 1 author Author 3
# 6 2 booktitle Book 2
# 7 2 year 2022
# 8 2 author Author 4
# 9 3 booktitle Book 3
# 10 3 year 1845
# 11 3 author Author 5
# 12 3 author Author 6
# 13 3 author Author 7
# 14 3 author Author 8
例如,
library(tidyr)
pivot_wider(data, id_cols = book, values_fn = toString)
# book booktitle year author
# 1 1 Book 1 1999 Author 1, Author 2, Author 3
# 2 2 Book 2 2022 Author 4
# 3 3 Book 3 1845 Author 5, Author 6, Author 7, Author 8