代码之家 › 专栏 › 技术社区 › bill999

如何从抓取的网站创建数据框架,保留数据的嵌套结构

read-html-live rvest tidyverse web-scraping r

bill999 · 技术社区 · 10 月前

说我用 read_html_live() 从 rvest package来提取一些看起来像这样的代码:

books <- minimal_html('
  <div>
    <div class="book">
      <div class="booktitle">Book 1</div>
      <div class="year">1999</div>
      <div class="author">Author 1</div>
      <div class="author">Author 2</div>
      <div class="author">Author 3</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 2</div>
      <div class="year">2022</div>
      <div class="author">Author 4</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 3</div>
      <div class="year">1845</div>
      <div class="author">Author 5</div>
      <div class="author">Author 6</div>
      <div class="author">Author 7</div>
      <div class="author">Author 8</div>
    </div>    
  </div>')

我想用 收割 使用上述信息创建数据帧(或tibble也可以)的包。我希望它按作者级别组织,这样每一行都包含一位作者、书名和年份。

如果我只关心第一作者,那就很容易了。类似于:

data0 <- books %>% html_elements(".book")
title <- data0 %>% html_element(".booktitle") %>% html_text2()
year <- data0 %>% html_element(".year") %>% html_text2()
author1 <- data0 %>% html_element("author") %>% html_text2()
data <- data.frame(title, year, author1)

然而,我实际上想提取所有作者,作者是书中的“孩子”。数据帧现在有八行,每个作者一行。例如,第8行将具有 Book 3 , 1845 ,以及 Author 8 我该怎么做?

这是一个粗略的想法,但我正在寻找更简单的解决方案:

data0 <- books %>% html_elements(".book") 
title <- data0 %>% html_element(".booktitle") %>% html_text2()
year <- data0 %>% html_element(".year") %>% html_text2()

authors <- data0 %>% html_element(".author")

然后遍历作者的三个元素,并将每个元素保存到一个数据帧中。然后将这些作者数据帧中的每一个与相关的标题和年份相关联,并以某种方式将其转换为一个长数据帧。

2 回复 | 直到 10 月前

stefan 10 月前

以下是一种使用 lapply 要遍历图书节点,请执行以下操作:

library(rvest)
library(dplyr, warn = FALSE)
books <- minimal_html('
  <div>
    <div class="book">
      <div class="booktitle">Book 1</div>
      <div class="year">1999</div>
      <div class="author">Author 1</div>
      <div class="author">Author 2</div>
      <div class="author">Author 3</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 2</div>
      <div class="year">2022</div>
      <div class="author">Author 4</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 3</div>
      <div class="year">1845</div>
      <div class="author">Author 5</div>
      <div class="author">Author 6</div>
      <div class="author">Author 7</div>
      <div class="author">Author 8</div>
    </div>
  </div>')

data0 <- books %>%
  html_elements(".book") |>
  lapply(\(x) {
    tibble(
      title = x |> html_element(".booktitle") |> html_text2(),
      year = x |> html_element(".year") |> html_text2(),
      authors = x |> html_elements(".author") |> html_text2(),
    )
  }) |>
  bind_rows()

data0
#> # A tibble: 8 Ã 3
#>   title  year  authors 
#>   <chr>  <chr> <chr>   
#> 1 Book 1 1999  Author 1
#> 2 Book 1 1999  Author 2
#> 3 Book 1 1999  Author 3
#> 4 Book 2 2022  Author 4
#> 5 Book 3 1845  Author 5
#> 6 Book 3 1845  Author 6
#> 7 Book 3 1845  Author 7
#> 8 Book 3 1845  Author 8

LMc 10 月前

这将把类属性和文本放入以长格式输出的名称-值对数据集中。书籍标识符( book )已添加到输出数据帧中,以便更容易进行分组操作(例如转换为宽格式):

library(rvest)
library(purrr)

book <- html_elements(books, xpath = "//div[@class='book']") 

data <- map_dfr(seq_along(book), \(i) {
  b <- book[[i]]
  children <- html_children(b)
  data.frame(book = i,
             name = children |> html_attrs() |> unlist(use.names = F),
             value = html_text2(children))
})
#    book      name    value
# 1     1 booktitle   Book 1
# 2     1      year     1999
# 3     1    author Author 1
# 4     1    author Author 2
# 5     1    author Author 3
# 6     2 booktitle   Book 2
# 7     2      year     2022
# 8     2    author Author 4
# 9     3 booktitle   Book 3
# 10    3      year     1845
# 11    3    author Author 5
# 12    3    author Author 6
# 13    3    author Author 7
# 14    3    author Author 8

例如,

library(tidyr)

pivot_wider(data, id_cols = book, values_fn = toString)
#    book booktitle year  author                             
# 1     1 Book 1    1999  Author 1, Author 2, Author 3          
# 2     2 Book 2    2022  Author 4                              
# 3     3 Book 3    1845  Author 5, Author 6, Author 7, Author 8