代码之家 › 专栏 › 技术社区 › SIM

无法使用set清除重复的结果

web-scraping python-3.x python

SIM · 技术社区 · 7 年前

England 然后使用这些链接当我的脚本到达内部页面时,它将刮下下一个页面链接。我知道如果我修复脚本中使用的xpath,我可能会得到唯一的下一页url。

set() .

我的剧本:

import requests
from lxml.html import fromstring
from urllib.parse import urljoin

link = "http://tennishub.co.uk/"

processed_links = set()
processed_nextpage_links = set()

def get_links(url):
    response = requests.get(url)
    tree = fromstring(response.text)

    unprocessed_links = [urljoin(link,item.xpath('.//a/@href')[0]) for item in tree.xpath('//*[@class="countylist"]')]
    for nlink in unprocessed_links:
        if nlink not in processed_links:
            processed_links.add(nlink)
    get_nextpage_links(processed_links)

def get_nextpage_links(itemlinks):
    for ilink in itemlinks:
        response = requests.get(ilink)
        tree = fromstring(response.text)
        titles = [title.xpath('.//a/@href')[0] for title in tree.xpath('//div[@class="pagination"]') if title.xpath('.//a/@href')]
        for ititle in titles:
            if ititle not in processed_nextpage_links:
                processed_nextpage_links.add(ititle)

        for rlink in processed_nextpage_links:
            print(rlink)

if __name__ == '__main__':
    get_links(link)

结果我觉得:

/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Cheshire/2
/tennis-clubs-by-county/Derbyshire/2
/tennis-clubs-by-county/Durham/2
/tennis-clubs-by-county/Cheshire/2
/tennis-clubs-by-county/Derbyshire/2
/tennis-clubs-by-county/Durham/2

3 回复 | 直到 7 年前

tripleee 7 年前

你每次打电话都会打印到目前为止收集到的所有链接 get_nextpage_links .

print 完全的,完成后只需打印列表,最好是在任何 def (使函数可重用,并将任何外部副作用推迟到调用代码)。

没有全局变量的更好的解决方案可能是 get_links 收集集合并返回它,将集合的引用传递给无论何时你调用它,并且(显然)让它添加任何新的链接。

因为您使用的是集合,所以在添加链接之前,不需要特别检查集合中是否已存在链接。无法向此数据类型添加副本。

SIM 7 年前

set() 以一种稍有不同的方式在后面的剧本中。现在,它应该产生独特的链接。

import requests
from lxml.html import fromstring
from urllib.parse import urljoin

link = "http://tennishub.co.uk/"

def get_links(url):
    response = requests.get(url)
    tree = fromstring(response.text)
    crude_links = set([urljoin(link,item) for item in tree.xpath('//*[@class="countylist"]//a/@href') if item])
    return crude_links

def get_nextpage(link):
    response = requests.get(link)
    tree = fromstring(response.text)
    titles = set([title for title in tree.xpath('//div[@class="pagination"]//a/@href') if title])
    return titles

if __name__ == '__main__':
    for next_page in get_links(link):
        for unique_link in get_nextpage(next_page):
            print(unique_link)

ycx 7 年前

        for rlink in processed_nextpage_links:
            print(rlink)

推荐文章

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

1 年前

Cam · Pandas列表日期到日期时间

1 年前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

1 年前

jjkennedy · Pandas文本文件导入:当每个文件中存在多个表时,自动选择1个表

1 年前

LMC · Numpy数组布尔索引以获取包含元素

1 年前

vr8ce · 非成对标记中特定字符的正则表达式

1 年前

Kernel · 如果指定了crs参数,shapefile的geopandas.read_file将出错

1 年前

ShaAnder · 为什么sqllachemy返回的是类而不是字符串

1 年前

sixtytrees · detectron2软件包未安装(没有名为“torch”的模块),但我安装了torch

1 年前

Pernoctador · Python映射可以复制吗?我需要参考地图

1 年前