代码之家  ›  专栏  ›  技术社区  ›  Sweepy Dodo

JSON lite的格式化

  •  0
  • Sweepy Dodo  · 技术社区  · 6 月前

    我有一列有数亿行。示例行为:

    x <- "{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}"
    
    # use cat(x) to print
    

    我想把专栏准备好,如下所示:

    "{"some field":"249 apples", "y":" m s 33 ", "url":"https://go.s.com?id=7", "source":"multiC"}"
    

    输入以下代码:

    y <- sprintf("[%s]"             # wrap in square brackets
                   , toString(x)
                   ) |>
              jsonlite::fromJSON()
    

    考虑到列的不整洁(半结构化)性质,最简单(和高性能)的正则表达式是什么(假设使用 gsub )为了实现这一目标?

    编辑:

    再增加2行真实世界数据

    x <- data.table(id = c(1,2,3)
                    , string = c('{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}'
                                 , '{url=https://www.gov.uk/government/publications/damp-and-mould-understanding-and-addressing-the-health-risks-for-rented-housing-providers/understanding-and-addressing-the-health-risks-of-damp-and-mould-in-the-home--2#:~:text=protect%20tenant%20health.-,Respiratory%20effects,wheeze%20and%20shortness%20of%20breath&text=increased%20risk%20of%20airway%20infections,airways%20with%20the%20fungus%20Aspergillus)&text=development%20or%20worsening%20of%20allergic,obstructive%20pulmonary%20disease%20(%20COPD%20))}'
                                 , '{"x": "292", "y": "1029", "url": "https://go.skimresources.com?id=76202X1528716&xs=1&url=https%3A%2F%2Fwww.marksandspencer.com%2Fbuckle-detail-faux-fur-jacket%2Fp%2Fclp60700989%23intid%3Dpid_pg1pip48g4r3c2&sref=https%3A%2F%2Fwww.liverpoolecho.co.uk%2Fwhats-on%2Fshopping%2Fmarks--spencers-best-fur-30733210"}'
                                 )
                    )
    

    我想以一个新的数据帧cbind结束。 这个新的数据帧的列数将等于JSON列中不同键的数量 string .

    1 回复  |  直到 6 月前
        1
  •  4
  •   Tim G    6 月前

    Gsub有点快:

    # one row
    x <- rep("{some field=249 apples, y= m s 33 , url=https://go.s.com?id=7, source=multiC}", 1000000) # repeat for 1 mil rows for testing
    
    reform <- function(x) {
      gsub('([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"', 
           gsub('([{,])\\s*|\\s*([,}])', '\\1\\2', x))
    }
    

    但是 stringr::str_replace_all 更快:

    library(stringr)
    
    reform_stringr <- function(x) {
      x <- str_replace_all(x, '([{,])\\s*|\\s*([,}])', '\\1\\2')
      str_replace_all(x, '([a-zA-Z0-9_ ]+)=([^,}]+)', '"\\1":"\\2"')
    }
    

    1 mil行的测试结果

    Unit: seconds
               expr      min       lq     mean   median       uq      max neval cld
             reform 6.168293 6.177294 6.236553 6.211335 6.265154 6.379521    10  a 
     reform_stringr 3.974893 3.990187 3.997749 3.994628 3.997775 4.040163    10   b
    

    这样使用它

    y <- sprintf("[%s]", toString(reform_stringr(x))) |>
      jsonlite::fromJSON()
    
    > head(y)
    
    一些领域 y url 来源
    249个苹果 m s 33 https://go.s.com?id=7 multiC
    249个苹果 m s 33 https://go.s.com?id=7 multiC
    249个苹果 m s 33 https://go.s.com?id=7 multiC
    249个苹果 m s 33 https://go.s.com?id=7 multiC
    249个苹果 m s 33 https://go.s.com?id=7 multiC
    249个苹果 m s 33 https://go.s.com?id=7 multiC