代码之家 › 专栏 › 技术社区 › Sweepy Dodo

JSON lite的格式化

performance json r

Sweepy Dodo · 技术社区 · 6 月前

我有一列有数亿行。示例行为:

x <- "{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}"

# use cat(x) to print

我想把专栏准备好,如下所示:

"{"some field":"249 apples", "y":" m s 33 ", "url":"https://go.s.com?id=7", "source":"multiC"}"

输入以下代码:

y <- sprintf("[%s]"             # wrap in square brackets
               , toString(x)
               ) |>
          jsonlite::fromJSON()

考虑到列的不整洁(半结构化)性质,最简单(和高性能)的正则表达式是什么(假设使用 gsub )为了实现这一目标?

编辑:

再增加2行真实世界数据

x <- data.table(id = c(1,2,3)
                , string = c('{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}'
                             , '{url=https://www.gov.uk/government/publications/damp-and-mould-understanding-and-addressing-the-health-risks-for-rented-housing-providers/understanding-and-addressing-the-health-risks-of-damp-and-mould-in-the-home--2#:~:text=protect%20tenant%20health.-,Respiratory%20effects,wheeze%20and%20shortness%20of%20breath&text=increased%20risk%20of%20airway%20infections,airways%20with%20the%20fungus%20Aspergillus)&text=development%20or%20worsening%20of%20allergic,obstructive%20pulmonary%20disease%20(%20COPD%20))}'
                             , '{"x": "292", "y": "1029", "url": "https://go.skimresources.com?id=76202X1528716&xs=1&url=https%3A%2F%2Fwww.marksandspencer.com%2Fbuckle-detail-faux-fur-jacket%2Fp%2Fclp60700989%23intid%3Dpid_pg1pip48g4r3c2&sref=https%3A%2F%2Fwww.liverpoolecho.co.uk%2Fwhats-on%2Fshopping%2Fmarks--spencers-best-fur-30733210"}'
                             )
                )

我想以一个新的数据帧cbind结束。这个新的数据帧的列数将等于JSON列中不同键的数量 string .

1 回复 | 直到 6 月前

Tim G 6 月前

Gsub有点快:

# one row
x <- rep("{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}", 1000000) # repeat for 1 mil rows for testing

reform <- function(x) {
  gsub('([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"', 
       gsub('([{,])\\s*|\\s*([,}])', '\\1\\2', x))
}

但是 stringr::str_replace_all 更快:

library(stringr)

reform_stringr <- function(x) {
  x <- str_replace_all(x, '([{,])\\s*|\\s*([,}])', '\\1\\2')
  str_replace_all(x, '([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"')
}

1 mil行的测试结果

Unit: seconds
           expr      min       lq     mean   median       uq      max neval cld
         reform 6.168293 6.177294 6.236553 6.211335 6.265154 6.379521    10  a 
 reform_stringr 3.974893 3.990187 3.997749 3.994628 3.997775 4.040163    10   b

这样使用它

y <- sprintf("[%s]", toString(reform_stringr(x))) |>
  jsonlite::fromJSON()

> head(y)

一些领域	y	url	来源
249个苹果	m s 33	https://go.s.com?id=7	multiC
249个苹果	m s 33	https://go.s.com?id=7	multiC
249个苹果	m s 33	https://go.s.com?id=7	multiC
249个苹果	m s 33	https://go.s.com?id=7	multiC
249个苹果	m s 33	https://go.s.com?id=7	multiC
249个苹果	m s 33	https://go.s.com?id=7	multiC