代码之家  ›  专栏  ›  技术社区  ›  ryy77

如何从所有亚马逊产品页面中提取产品信息(标题、价格、评论、asin)?(python,web抓取)

  •  -1
  • ryy77  · 技术社区  · 6 年前

    我做了一个刮擦程序,可以浏览所有的亚马逊产品页面(每个页面最多有24个产品,这是模板 https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A284507%2Cn%3A510202%2Ck%3Aas&keywords=as&ie=UTF8&qid=1532414215 )我运行程序,但它只用于第一页。我应该在哪里修改代码?是否必须更改此行的位置(driver.find_element_by_id(“pagnextstring”).click())?我附上了密码。我会感谢你的帮助。谢谢您。

    程序

    from time import sleep
    from urllib.parse import urljoin
    import csv
    import requests
    from lxml import html
    from selenium import webdriver
    import io
    
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, sdch, br",
        "Accept-Language": "en-US,en;q=0.8",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    }
    
    proxies = {
          'http': 'http://198.1.122.29:80',
          'https': 'http://204.52.206.65:8080'
    }
    
    chrome_options = webdriver.ChromeOptions()
    
    chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
    
    driver = webdriver.Chrome(executable_path="C:\\Users\Andrei-PC\Downloads\webdriver\chromedriver.exe",
                                  chrome_options=chrome_options)
    header = ['Product title', 'Product price', 'Review', 'ASIN']
    
    links = []
    url = 'https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A284507%2Cn%3A510202%2Ck%3Aas&keywords=as&ie=UTF8&qid=1532414215'
    
    while True:
        try:
            print('Fetching url [%s]...' % url)
            response = requests.get(url, headers=headers, proxies=proxies, stream=True)
            if response.status_code == 200:
                try:
                    products = driver.find_elements_by_xpath('//li[starts-with(@id, "result_")]')
    
                    for product in products:
                        title = product.find_element_by_tag_name('h2').text
                        price = ([item.text for item in
                                      product.find_elements_by_xpath('.//a/span[contains(@class, "a-color-base")]')] + [
                                         "No price"])[0]
                        review = ([item.get_attribute('textContent') for item in
                                       product.find_elements_by_css_selector('i.a-icon-star>span.a-icon-alt')] + [
                                          "No review"])[0]
                        asin = product.get_attribute('data-asin') or "No asin"
    
                        try:
                            data = [title, price, review, asin]
                        except:
                            print('no items')
                        with io.open('csv/furniture.csv', "a", newline="", encoding="utf-8") as output:
                            writer = csv.writer(output)
                            writer.writerow(data)
                        driver.find_element_by_id("pagnNextString").click()
                except IndexError:
                    break
    
        except Exception:
            print("Connection refused by the server..")
            print("Let me sleep for 5 seconds")
            print("ZZzzzz...")
            sleep(5)
            print("Was a nice sleep, now let me continue...")
    1 回复  |  直到 6 年前
        1
  •  1
  •   Andersson    6 年前
    url = urljoin('https://www.amazon.com', next_url)
    for i in range(len(url)):
        driver.get(url[i])
    

    这些行执行以下操作:

    1. url = urljoin('https://www.amazon.com', next_url) 以字符串形式获取URL,例如 https://www.amazon.com/some_source 并分配给 url 变量
    2. for i in range(len(url)) 遍历整数范围 0, 1, 2, 3, ... len(url) 并将每个任务分配给 i 变量
    3. driver.get(url[i]) 导航到 性格 例如 driver.get("h") , driver.get("t")

    我不知道你到底想做什么,但我想你需要

    url = urljoin('https://www.amazon.com', next_url)
    driver.get(url)
    

    更新

    如果需要检查所有页面,请尝试添加

    driver.find_element_by_xpath('//a/span[@id="pagnNextString"]').click()
    

    在每一页刮擦之后。

    还要注意 for product in products 永远不会导致 IndexError ,这样就可以避免使用 try / except 对于这个循环