代码之家  ›  专栏  ›  技术社区  ›  max

使用python在包含给定单词的标记之间提取文本

  •  0
  • max  · 技术社区  · 6 年前

    我有一些XML文档中的文本,我正试图从其中提取包含特定单词的标记中的文本。

    例如如下:

    search('adverse')
    

    应返回包含单词“不利”的所有标记的文本

    Out: 
      [
        "<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>"
      ]
    

    search('clinical')

    应该返回两个结果,因为两个标记包含这些单词。

    Out: 
      [
        "<title>6.1 Clinical Trials Experience</title>", 
        "<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>"
      ]
    

    我应该使用什么工具?正则表达式?BS4?任何建议都非常感谢。


    示例文本:

     </highlight>
     </excerpt>
     <component>
     <section id="ID40">
     <id root="fbc21d1a-2fb2-47b1-ac53-f84ed1428bb4"></id>
     <title>6.1 Clinical Trials Experience</title>
     <text>
     <paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>
     <list id="ID42" listtype="unordered" stylecode="Disc">
     <item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>
    
    1 回复  |  直到 6 年前
        1
  •  1
  •   olinox14    6 年前

    您可以使用regex对其进行硬编码,也可以使用类似于 lxml

    使用一个regex,它将是:

    import re
    
    your_text = "(...)"
    
    def search(instr):
        return re.findall(r"<.+>.*{}.*<.+>".format(instr), your_text, re.MULTILINE)
    
    print(search("safety"))