代码之家 › 专栏 › 技术社区 › Ben

统计向量中的单词在字符串中出现的频率

stringr pattern-matching string r

Ben · 技术社区 · 4 月前

我有一串文本和一个单词向量:

String: "Auch ein blindes Huhn findet einmal ein Korn."
Vector: "auch", "ein"

我想检查向量中的每个单词在字符串中包含的频率,并计算频率之和。例如 正确的结果 会是 3. .

我已经能够检查字符串中出现了哪些单词并计算总和:

library(stringr)
deu <- c("\\bauch\\b", "\\bein\\b")
str_detect(tolower("Auch ein blindes Huhn findet einmal ein Korn."), deu)

[1] TRUE TRUE

sum(str_detect(tolower("Auch ein blindes Huhn findet einmal ein Korn."), deu))

[1] 2

不幸的是 str_detect 不返回出现次数( 1, 2 ),但仅限于单词是否出现在字符串中( TRUE, TRUE ),所以输出的总和 str_detect 不等于单词的数量。

R中是否有类似的函数 preg_match_all 在PHP?

preg_match_all("/\bauch\b|\bein\b/i", "Auch ein blindes Huhn findet einmal ein Korn.", $matches);
print_r($matches);

Array
(
    [0] => Array
        (
            [0] => Auch
            [1] => ein
            [2] => ein
        )

)

echo preg_match_all("/\bauch\b|\bein\b/i", "Auch ein blindes Huhn findet einmal ein Korn.", $matches);

3

我想避免循环。

我看过很多类似的问题,但它们要么 不算数 发生次数或 不要使用模式向量 搜索。我可能忽略了一个回答我的问题,但在你将其标记为重复之前,请确保“重复”实际上问的是完全相同的问题。非常感谢。

3 回复 | 直到 4 月前

Tim G 4 月前

您可以使用 str_count 喜欢

stringr::str_count(tolower("Auch ein blindes Huhn findet mal ein Korn"), paste0("\\b", tolower(c("ein","Huhn")), "\\b"))
[1] 2 1

jay.sf 4 月前

你可以 sprintf 通过添加图案 \\b 用于边界和使用 lengths 在 gregexpr .

> vp <- v |> sprintf(fmt='\\b%s\\b') |> setNames(v) |> print()
        auch          ein 
"\\bauch\\b"  "\\bein\\b" 
> lapply(vp, gregexpr, text=tolower(string)) |> unlist(recursive=FALSE) |> lengths()
auch  ein 
   1    2

这个 |> print() 仅用于同时分配和打印,可以删除。

数据:

string <- "Auch ein blindes Huhn findet einmal ein Korn."
v <- c("auch", "ein")

ThomasIsCoding 4 月前

给定如下字符串和模式

s <- "Auch ein blindes Huhn findet einmal ein Korn."
p <- c("auch", "ein")

你可以试试 strsplit + %in% :

选项1(获取事件总数)

> sum(gsub("\\W", "", strsplit(tolower(s), " ")[[1]]) %in% p)
[1] 3

选项2(使用 table 如果您想查看计数摘要)

> table(gsub("\\W", "", strsplit(tolower(s), " ")[[1]]))[p]

auch  ein
   1    2

Friede 4 月前

字符串处理

如果基R的语法太复杂,我会选择 {stringi}

library(stringi)
String = 'Auch ein blindes Huhn findet einmal ein Korn.'
Vector = c('auch', 'ein')
stri_count_regex(tolower(String), sprintf('\\b%s\\b', Vector)) |> 
  setNames(Vector) # optional

auch  ein 
   1    2