代码之家 › 专栏 › 技术社区 › mehrdadep

使用python中的xpath查询从具有子节点的以下节点中选择整个文本

lxml html-parsing xpath python-3.x python

mehrdadep · 技术社区 · 7 年前

a tag 具有 XPath lxml

from lxml.html import etree, fromstring

reference_titles = root.xpath("//table[@id='vulnrefstable']/tr/td")
for tree in reference_titles:
    a_tag = tree.xpath('a/@href')[0]
    title = tree.xpath('a/following-sibling::text()')

这适用于此HTML:

<tr>

    <td class="r_average">

        <a href="http://somelink.com" target="_blank" title="External url">
            http://somelink.com
        </a>
        <br/> SECUNIA 27633                     
    </td>

</tr>

<tr>

    <td class="r_average">

        <a href="http://somelink.com" target="_blank" title="External url">
            http://somelink.com
        </a>
        <br/> SECUNIA 27633     <i>Release Date:</i> tomorrow               
    </td>

</tr>

SECUNIA 27633 tomorrow

SECUNIA 27633 Release Date: tomorrow

node() 而不是 text() 返回其中的所有节点。所以我用这个创建最后一个字符串 for

title = tree.xpath('a/following-sibling::node()')

2 回复 | 直到 6 年前

Andersson 7 年前

for tree in reference_titles:
    a_tag = tree.xpath('a/@href')[0]
    title = " ".join([node.strip() for node in tree.xpath('.//text()[not(parent::a)]') if node.strip()])

user859652 6 年前

reference_list = {'title': list(), 'link': list()}
reference_titles = root.xpath("//table[@id='vulnrefstable']/tr/td")
for tree in reference_titles:
    reference_list['link'].append(str(tree.xpath('a/@href')[0]))
    reference_list['title'].append(str(" ".join(
        [node.strip() for node in tree.xpath('.//text()[not(parent::strong) and not(parent::a)]') if
         node]).strip()))

推荐文章

user3127554 · Powershell HTML未格式化

7 年前

user1922364 · 从一个页面获取所有链接

7 年前

GonzaloXavier · 提取R中<option>标记的内容

7 年前

Deepa MG · 如何将参数发送到另一个PHP网站的AJAX POST方法并获取JSON信息

7 年前

Anurag Sharma · 从自由流动的文本中删除html标记以形成独立的句子

8 年前

Shafizadeh · 为什么查询与DOM不匹配?

8 年前

Yannis Dran · 提取存储在磁盘上的html文件的url和名称,并分别打印它们-Python

8 年前

Athapali · 如何使用jquery获取变量中元素的文本?

8 年前

Mona G · html中响应头的jmeter正则表达式提取器

9 年前

Paul · Jsoup-从元素中提取html

9 年前