代码之家  ›  专栏  ›  技术社区  ›  antecessor

在R中使用字符串中的大空格分隔列

  •  2
  • antecessor  · 技术社区  · 7 年前

    example <- "4.6             (19 ratings)                                                         Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.                                                                                                            151 students enrolled                                                                    "
    df <- data.frame(example)
    

    如您所见,第一个观察结果由一个包含4个不同部分的字符串组成:评分(4.6)、评分数(19个评分)、一个句子(当然……准确地说)和注册学生(151)。

    我雇用了 separate() 函数将该列分成4个一:

    df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep = "     ")
    

    因此,这不符合预期。

    任何想法。

    更新

    这就是你对尼古拉的评论

    > df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep=" {4,}")
    Warning message:
    Expected 4 pieces. Additional pieces discarded in 1 rows [1].
    
    3 回复  |  直到 7 年前
        1
  •  1
  •   Roman    7 年前

    这个怎么样:

    x <- str_split(example, "  ") %>%
        unlist()
    x <- x[x != ""]
    df <- tibble("a", "b", "c", "d")
    df[1, ] <- x
    colnames(df) <- c("Rating", "Number of rating", "Sentence", "Students")
    
    > str(df)
    Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  4 variables:
     $ Rating          : chr "4.6"
     $ Number of rating: chr " (19 ratings)"
     $ Sentence        : chr " Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of ra"| __truncated__
     $ Students        : chr "151 students enrolled"
    
        2
  •  0
  •   JBGruber    7 年前

    答案有两个关键。第一个是使用正确的正则表达式作为分隔符 sep = "[[:space:]]{2,}" \\s{2,} 会是更常见的替代品)。第二个是你的例子实际上有很多尾随空格 separate() 试着放进另一列。只需使用 trimws() . 因此,解决方案如下所示:

    library(tidyr)
    library(dplyr)
    
    example <- "4.6             (19 ratings)                                                         Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.                                                                                                            151 students enrolled                                                                    "
    df <- data.frame(example)
    
    df_new <- df %>%
      mutate(example = trimws(example)) %>% 
      separate(col = "example", 
               into = c("rating", "number_of_ratings", "sentence", "students_enrolled"), 
               sep = "[[:space:]]{2,}")
    
    
    
    
    as_tibble(df_new)
        # A tibble: 1 x 4
          rating number_of_ratings sentence                                                                    students_enrolled
          <chr>  <chr>             <chr>                                                                       <chr>            
        1 4.6    (19 ratings)      Course Ratings are calculated from individual students’ ratings and a vari~ 151 students enr~
    

        3
  •  0
  •   Roman    7 年前

    当然有可能 stringr 包和一些正则表达式:

      rating_mean n_ratings n_students                         descr
    1        4.65        19        151    "Course (...) accurately."
    

    代码

    library(stringr)
    
    # create result data frame
    result <- data.frame(cbind(rating_mean = 0, n_ratings = 0, n_students = 0, descr = 0))
    
    # loop through rows of example data frame
    for (i in 1:nrow(df)){
        # replace spaces
        example[i, 1] <- gsub("\\s+", " ", example[i, 1])
        # match and extract mean rating
        result[i, 1] <- as.numeric(str_match(example[i], "^[0-9]+\\.[0-9]+"))
        # match and extract number of ratings
        result[i, 2] <- as.numeric(str_match(str_match(example[i, 1], "\\(.+\\)"), "[0-9]+"))
        # match and extract number of enrolled students
        result[i, 3] <- as.numeric(str_match(str_match(example[i, 1], "\\s[0-9].+$"), "[0-9]+"))
        # match and extract sentence
        result[i, 4] <- str_match(example[i, 1], "[A-Z].+\\.")
    }
    

    数据

    example <- "4.65             (19 ratings)                                                         Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.                                                                                                            151 students enrolled                                                                    "
    example <- data.frame(example, stringsAsFactors = FALSE)