代码之家  ›  专栏  ›  技术社区  ›  SIM

无法使用set清除重复的结果

  •  1
  • SIM  · 技术社区  · 7 年前

    England 然后使用这些链接当我的脚本到达内部页面时,它将刮下下一个页面链接。我知道如果我修复脚本中使用的xpath,我可能会得到唯一的下一页url。

    set() .

    我的剧本:

    import requests
    from lxml.html import fromstring
    from urllib.parse import urljoin
    
    link = "http://tennishub.co.uk/"
    
    processed_links = set()
    processed_nextpage_links = set()
    
    def get_links(url):
        response = requests.get(url)
        tree = fromstring(response.text)
    
        unprocessed_links = [urljoin(link,item.xpath('.//a/@href')[0]) for item in tree.xpath('//*[@class="countylist"]')]
        for nlink in unprocessed_links:
            if nlink not in processed_links:
                processed_links.add(nlink)
        get_nextpage_links(processed_links)
    
    def get_nextpage_links(itemlinks):
        for ilink in itemlinks:
            response = requests.get(ilink)
            tree = fromstring(response.text)
            titles = [title.xpath('.//a/@href')[0] for title in tree.xpath('//div[@class="pagination"]') if title.xpath('.//a/@href')]
            for ititle in titles:
                if ititle not in processed_nextpage_links:
                    processed_nextpage_links.add(ititle)
    
            for rlink in processed_nextpage_links:
                print(rlink)
    
    if __name__ == '__main__':
        get_links(link)
    

    结果我觉得:

    /tennis-clubs-by-county/Durham/2
    /tennis-clubs-by-county/Durham/2
    /tennis-clubs-by-county/Durham/2
    /tennis-clubs-by-county/Cheshire/2
    /tennis-clubs-by-county/Derbyshire/2
    /tennis-clubs-by-county/Durham/2
    /tennis-clubs-by-county/Cheshire/2
    /tennis-clubs-by-county/Derbyshire/2
    /tennis-clubs-by-county/Durham/2
    
    3 回复  |  直到 7 年前
        1
  •  2
  •   tripleee    7 年前

    你每次打电话都会打印到目前为止收集到的所有链接 get_nextpage_links .

    print 完全的,完成后只需打印列表,最好是在任何 def (使函数可重用,并将任何外部副作用推迟到调用代码)。

    没有全局变量的更好的解决方案可能是 get_links 收集集合并返回它,将集合的引用传递给 无论何时你调用它,并且(显然)让它添加任何新的链接。

    因为您使用的是集合,所以在添加链接之前,不需要特别检查集合中是否已存在链接。无法向此数据类型添加副本。

        2
  •  2
  •   SIM    7 年前

    set() 以一种稍有不同的方式在后面的剧本中。现在,它应该产生独特的链接。

    import requests
    from lxml.html import fromstring
    from urllib.parse import urljoin
    
    link = "http://tennishub.co.uk/"
    
    def get_links(url):
        response = requests.get(url)
        tree = fromstring(response.text)
        crude_links = set([urljoin(link,item) for item in tree.xpath('//*[@class="countylist"]//a/@href') if item])
        return crude_links
    
    def get_nextpage(link):
        response = requests.get(link)
        tree = fromstring(response.text)
        titles = set([title for title in tree.xpath('//div[@class="pagination"]//a/@href') if title])
        return titles
    
    if __name__ == '__main__':
        for next_page in get_links(link):
            for unique_link in get_nextpage(next_page):
                print(unique_link)
    
        3
  •  1
  •   ycx    7 年前

            for rlink in processed_nextpage_links:
                print(rlink)