代码之家  ›  专栏  ›  技术社区  ›  mehrdadep

使用python中的xpath查询从具有子节点的以下节点中选择整个文本

  •  2
  • mehrdadep  · 技术社区  · 7 年前

    a tag 具有 XPath lxml

    from lxml.html import etree, fromstring
    
    reference_titles = root.xpath("//table[@id='vulnrefstable']/tr/td")
    for tree in reference_titles:
        a_tag = tree.xpath('a/@href')[0]
        title = tree.xpath('a/following-sibling::text()')
    

    这适用于此HTML:

    <tr>
    
        <td class="r_average">
    
            <a href="http://somelink.com" target="_blank" title="External url">
                http://somelink.com
            </a>
            <br/> SECUNIA 27633                     
        </td>
    
    </tr>
    

    <tr>
    
        <td class="r_average">
    
            <a href="http://somelink.com" target="_blank" title="External url">
                http://somelink.com
            </a>
            <br/> SECUNIA 27633     <i>Release Date:</i> tomorrow               
        </td>
    
    </tr>
    

    SECUNIA 27633 tomorrow

    SECUNIA 27633 Release Date: tomorrow


    node() 而不是 text() 返回其中的所有节点。所以我用这个创建最后一个字符串 for

    title = tree.xpath('a/following-sibling::node()')
    

    2 回复  |  直到 6 年前
        1
  •  1
  •   Andersson    7 年前

    for tree in reference_titles:
        a_tag = tree.xpath('a/@href')[0]
        title = " ".join([node.strip() for node in tree.xpath('.//text()[not(parent::a)]') if node.strip()])
    
        2
  •  1
  •   user859652    6 年前

    reference_list = {'title': list(), 'link': list()}
    reference_titles = root.xpath("//table[@id='vulnrefstable']/tr/td")
    for tree in reference_titles:
        reference_list['link'].append(str(tree.xpath('a/@href')[0]))
        reference_list['title'].append(str(" ".join(
            [node.strip() for node in tree.xpath('.//text()[not(parent::strong) and not(parent::a)]') if
             node]).strip()))