代码之家  ›  专栏  ›  技术社区  ›  Duccio A

Python Web抓取:使用多个标记提取一个属性

  •  1
  • Duccio A  · 技术社区  · 8 年前

    我正试图从我的在线书签服务帐户中获取数据。带有书签的页面组织如下:

    <!DOCTYPE html>
    <html lang="en">
    <body>
    <div id="item1" class="outer_block">
        <div class="title">Bookmark 1</div>
        <div class="link">
            <a href="https://bookmark1.com">https://bookmark1.com</a>
        </div>
        <div class="tags">
            <a href="http://mylink.com/tag1">tag1</a>
            <a href="http://mylink.com/tag2">tag2</a>
        </div>
    </div>
    <div id="item2" class="outer_block">
        <div class="title">Bookmark 2</div>
        <div class="link">
            <a href="https://bookmark2.com">https://bookmark2.com</a>
        </div>
        <div class="tags">
            <a href="http://mylink.com/tag1">tag1</a>
        </div>
    </div>
    <div id="item3" class="outer_block">
        <div class="title">Bookmark 3</div>
        <div class="link">
            <a href="https://bookmark3.com">https://bookmark3.com</a>
        </div>
        <div class="tags">
            <a href="http://mylink.com/tag3">tag3</a>
        </div>
    </div>
    </body>
    </html>
    

    # Import modules
    import requests
    from lxml import html
    
    # Read the html
    # url = 'mylink'
    # page = requests.get(url)
    # tree = html.fromstring(page.content)
    # This is the replicable example
    tree = html.fromstring('<!DOCTYPE html><html lang="en"><body><div id="item1" class="outer_block"> <div class="title">Item 1</div> <div class="link"> <a href="https://bookmark1.com">https://bookmark1.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag1">tag1</a> <a href="http://mylink.com/tag2">tag2</a> </div></div><div id="item2" class="outer_block"> <div class="title">Item 2</div> <div class="link"> <a href="https://bookmark2.com">https://bookmark2.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag1">tag1</a> </div></div><div id="item3" class="outer_block"> <div class="title">Item 3</div> <div class="link"> <a href="https://bookmark3.com">https://bookmark3.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag3">tag3</a> </div></div></body></html>')
    

    我使用 xpath

    titles = tree.xpath('//div[@class="title"]/text()')
    print(titles)
    

    ['Bookmark 1'、'Bookmark 2'、'Bookmark 3']

    为了提取标签,我使用了相同的原则:

    tags = tree.xpath('//div[@class="tags"]//a/text()')
    print(tags)
    

    ['tag1'、'tag2'、'tag1'、'tag3']

    titles 使用阵列 tags .

    blocks = tree.xpath('//div[@class="outer_block"]')
    block1 = blocks[0]
    

    我不明白的是,当我从 block1

    tags_block1 = block1.xpath('//div[@class="tags"]//a/text()'
    print(tags_block1)
    

    我如何提取标题和相应的标签,最好的输出格式是什么,还有其他软件包可以更容易地完成这项工作吗?

    2 回复  |  直到 8 年前
        1
  •  1
  •   Mike R    8 年前

    你应该考虑使用BeautifulSoup。考虑以下代码(源代码是HTML的字符串):

    from bs4 import BeautifulSoup 
    
    soup = BeautifulSoup(source, "html.parser")
    outer_blocks = soup.find_all("div", class_="outer_block")
    
    for block in outer_blocks:
        title = block.find("div", class_="title").contents[0]
        link = block.find("a").contents[0]
        tags = [x.contents[0] for x in block.find("div", class_="tags").find_all("a")]
        print([title, link, tags])
    

    ['Bookmark 1', 'https://bookmark1.com', ['tag1', 'tag2']]
    ['Bookmark 2', 'https://bookmark2.com', ['tag1']]
    ['Bookmark 3', 'https://bookmark3.com', ['tag3']]
    
        2
  •  0
  •   Mohsen Fard    5 年前

    description = tree.xpath("//div[@class='details-content'][@itemprop='description']/text()")