代码之家 › 专栏 › 技术社区 › ryy77

当爬网位于最后一页(请求、python)时,如何中断爬网?

amazon python-requests python

ryy77 · 技术社区 · 8 年前

我制作了一个包含请求的爬网程序,我想在它位于最后一页时停止它。我应该把break语句放在哪里来打破最后一页上的循环?现在它运行,但不会在最后一页停止。我附上了程序。我会感激你的帮助。

import requests
from lxml import html
from time import sleep
import csv

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, sdch, br",
    "Accept-Language": "en-US,en;q=0.8",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
}

proxies = {
    'http': 'http://95.167.116.116:8080',
    'https': 'http://88.157.149.250:8080',
}
page_counter = 1
links = []
while True:
        try:
            url = "https://www.amazon.com/s/ref=sr_pg_{0}?fst=as%3Aoff&rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011%2Cn%3A11444072011%2Cn%3A11444086011%2Cn%3A2632268011&page={0}&bbn=11444086011&ie=UTF8&qid=1517650207".format(
                page_counter)
            response = requests.get(url, headers=headers, proxies=proxies, stream=True)
            if response.status_code == 200:
                source = html.fromstring(response.content)
                links.extend(source.xpath('//*[contains(@id,"result")]/div/div[3]/div[1]/a/@href'))
                page_counter += 1
            else:
                break
        except:
            print("Connection refused by the server..")
            print("Let me sleep for 5 seconds")
            print("ZZzzzz...")
            sleep(5)
            print("Current page ", page_counter)
            print("Was a nice sleep, now let me continue...")

csvfile = "products.csv"

# Assuming res is a flat list
with open(csvfile, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in links:
        writer.writerow([val])

1 回复 | 直到 8 年前

Szabolcs 8 年前

请以此代码段为例,然后使用自定义函数对其进行扩展:

from time import sleep
from urllib.parse import urljoin

import requests
from lxml import html

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, sdch, br",
    "Accept-Language": "en-US,en;q=0.8",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
}

proxies = {
    'http': 'http://95.167.116.116:8080',
    'https': 'http://88.157.149.250:8080',
}

links = []
url = 'https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011%2Cn%3A11444072011%2Cn%3A11444086011%2Cn%3A2632268011&bbn=11444086011&ie=UTF8&qid=1517831374'

while True:
    try:
        print('Fetching url [%s]...' % url)
        response = requests.get(url, headers=headers, stream=True)
        if response.status_code == 200:
            source = html.fromstring(response.content)
            links.extend(source.xpath('//*[contains(@id,"result")]/div/div[3]/div[1]/a/@href'))
            try:
                next_url = source.xpath('//*[@id="pagnNextLink"]/@href')[0]
                url = urljoin('https://www.amazon.com', next_url)
            except IndexError:
                break
    except Exception:
        print("Connection refused by the server..")
        print("Let me sleep for 5 seconds")
        print("ZZzzzz...")
        sleep(5)
        print("Was a nice sleep, now let me continue...")

print(links)

实际上,它会从当前页面中获取下一页的链接。如果可以找到下一页的url,则如下所示。如果找不到,则断开 while 循环,并打印收集的 links 列表

希望有帮助。