代码之家  ›  专栏  ›  技术社区  ›  wizzup

用手术刀解析标记的邻接块时出现问题

  •  1
  • wizzup  · 技术社区  · 6 年前

    我使用电脑有问题 scalpel

    给定以下HTML代码段存储在 testS :: String

    <body>
      <h2>Apple</h2>
      <p>I Like Apple</p>
      <p>Do you like Apple?</p>
    
      <h2>Banana</h2>
      <p>I Like Banana</p>
      <p>Do you like Banana?</p>
    
      <h2>Carrot</h2>
      <p>I Like Carrot</p>
      <p>Do you like Carrot?</p>
    </body>
    
    

    我想解析一块 h2 p Block .

    {-#LANGUAGE OverloadedStrings #-}
    
    import Control.Monad
    import Text.HTML.Scalpel
    
    data Block = B String String String
      deriving Show
    
    block :: Scraper String Block
    block = do
      h  <- text $ "h2"
      pa <- text $ "p"
      pb <- text $ "p"
      return $ B h pa pb
    
    blocks :: Scraper String [Block]
    blocks = chroot "body" $ replicateM 3 block
    
    

    但刮削的结果并不是我想要的,看起来它不断重复捕获第一个块,并且从不消耗它。

    λ> traverse (mapM_ print) $ scrapeStringLike testS blocks
    B "Apple" "I Like Apple" "I Like Apple"
    B "Apple" "I Like Apple" "I Like Apple"
    B "Apple" "I Like Apple" "I Like Apple"
    

    预期产出:

    B "Apple" "I Like Apple" "Do you like Apple?"
    B "Banana" "I Like Banana" "Do you like Banana?"
    B "Carrot" "I Like Carrot" "Do you like Carrot?"
    

    如何让它工作?

    1 回复  |  直到 6 年前
        1
  •  1
  •   trevor cook    6 年前

    首先,我很抱歉在没有测试或不了解手术刀的情况下提出了一个解决方案(如此傲慢)。让我来补偿你;这是我完全重写的尝试。

    首先,这个怪物起作用了。

    blocks :: Scraper String [Block]
    blocks = chroot "body" $ do
      hs <- texts "h2"
      ps <- texts "p"
      return $ combine hs ps
      where
        combine (h:hs) (p:p':ps) = B h p p' : combine hs ps
        combine _ _ = []
    

    我称之为怪物,因为它用两个 texts 调用,然后通过 combine <div> .

    testS' :: String
    testS'= unlines [ "<body>",
                  "<div>",
                  "  <h2>Apple</h2>",
                  "  <p>I Like Apple</p>",
                  "  <p>Do you like Apple?</p>",
                  "</div>",
                  "",
                  "<div>",
                  "  <h2>Banana</h2>",
                  "  <p>I Like Banana</p>",
                  "  <p>Do you like Banana?</p>",
                  "",
                  "</div>",
                  "<div>",
                  "  <h2>Carrot</h2>",
                  "  <p>I Like Carrot</p>",
                  "  <p>Do you like Carrot?</p>",
                  "</div>",
                  "</body>"
                  ]
    

    block' :: Scraper String Block
    block' = do
      h  <- text $ "h2"
      [pa,pb] <- texts $ "p"
      return $ B h pa pb
    
    blocks' :: Scraper String [Block]
    blocks' = chroots ("body" // "div") $ block'
    

    顺从的

    B "Apple" "I Like Apple" "Do you like Apple?"
    B "Banana" "I Like Banana" "Do you like Banana?"
    B "Carrot" "I Like Carrot" "Do you like Carrot?"
    

    >>= 结合

    我的 ,是本地的 where >>= ,顺便说一句,它也是一个局部定义的函数,名称稍有不同 combined . 但是,即使它们有相同的名称,这也不重要,因为它们都只在各自功能的范围内。

    >>= ,然后按照观察到的行为,每次刮取都从当前选定标记的开头开始。所以在你的 block 释义 chroot “body” 返回正文中的所有标记, text “h2” <h2> ,以及下两个 text “p” 两个都匹配第一个 <p> <h2> <p> 和(冗余地)a < . 注意,在我的 < 基于我可以使用的语法 文本 < 我在等你。

    最后,当我看到这个行为是基于TagSoup时,我点击了它。(同时也是为什么他们把它命名为标签汤)。每一次刮伤都像是把勺子蘸进无序的标签汤。选择者做汤,刮刀是你的勺子。希望有帮助。

        2
  •  1
  •   fimad    6 年前

    现在,版本0.6.0的手术刀通过使用 SerialScrapers . SerialScrapers 允许您一次关注当前根目录的一个子目录,并公开API以移动焦点并执行 Scrapers 在当前关注的节点上。

    将文档中的示例代码改编为HTML,可提供:

    -- Copyright 2019 Google LLC.
    -- SPDX-License-Identifier: Apache-2.0
    
    -- Chroot to the body tag and start a SerialScraper context with inSerial.
    -- This will allow for focusing each child of body.
    --
    -- Many applies the subsequent logic repeatedly until it no longer matches 
    -- and returns the results as a list.
    chroot "body" $ inSerial $ many $ do
       -- Move the focus forward until text can be extracted from an h2 tag.
       title <- seekNext $ text "h2"
       -- Create a new SerialScraper context that contains just the tags between
       -- the current focus and the next h2 tag. Then until the end of this new 
       -- context, move the focus forward to the next p tag and extract its text.
       ps <- untilNext (matches "h2") (many $ seekNext $ text "p")
       return (title, ps)
    

    这将返回:

    [
      ("Apple", ["I like Apple", "Do you like Apple?"]),
      ("Banana", ["I like Banana", "Do you like Banana?"]),
      ("Carrot", ["I like Carrot", "Do you like Carrot?"])
    ]
    
    推荐文章