代码之家 › 专栏 › 技术社区 › max

使用python在包含给定单词的标记之间提取文本

nlp xml python

max · 技术社区 · 6 年前

我有一些XML文档中的文本,我正试图从其中提取包含特定单词的标记中的文本。

例如如下:

search('adverse')

应返回包含单词“不利”的所有标记的文本

Out: 
  [
    "<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>"
  ]

和 search('clinical')

应该返回两个结果,因为两个标记包含这些单词。

Out: 
  [
    "<title>6.1 Clinical Trials Experience</title>", 
    "<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>"
  ]

我应该使用什么工具?正则表达式?BS4?任何建议都非常感谢。

示例文本:

 </highlight>
 </excerpt>
 <component>
 <section id="ID40">
 <id root="fbc21d1a-2fb2-47b1-ac53-f84ed1428bb4"></id>
 <title>6.1 Clinical Trials Experience</title>
 <text>
 <paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>
 <list id="ID42" listtype="unordered" stylecode="Disc">
 <item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>

1 回复 | 直到 6 年前

olinox14 6 年前

您可以使用regex对其进行硬编码,也可以使用类似于 lxml

使用一个regex,它将是:

import re

your_text = "(...)"

def search(instr):
    return re.findall(r"<.+>.*{}.*<.+>".format(instr), your_text, re.MULTILINE)

print(search("safety"))

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

5 月前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

6 月前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

6 月前

user29715306 · from_users=和chats=电视节目中的差异

6 月前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

6 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

6 月前

prayner · 更新嵌套字典包含列表中的项

6 月前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

6 月前

Dave · 如何在for循环中修改列表值

6 月前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

6 月前