代码之家  ›  专栏  ›  技术社区  ›  ryy77

为什么我的程序返回“no review v1”,而不是返回带有评论的产品的平均评论?(网页抓取,python)

  •  1
  • ryy77  · 技术社区  · 6 年前

    我的程序为最后4个产品打印“无评论v1”(抽屉柜、带开放式存储的小树高强调柜、适用于卧室或办公室的文件柜和收藏、橡木(5抽屉)),“带杯托的现代创新床头托盘和用作双层床架的电缆线插入件”,“Mantua Cottage Style楔木B”Lue床头柜,非常适合海边和乡村的DCOR,“阶梯式折叠铝RV台阶平台,防滑表面坚固,重量轻,最大负荷”。我不知道问题出在哪里。对于产品“米兰Seerat-as-RST可调高度旋转凳,生锈”(第一个产品,它返回“5星之5”而不是“无评论v1”)。网址是“ https://www.amazon.com/s/ref=sr_pg_1?fst=as%3off&rh=n%3a1055398%2cn%3a1063306%2ck%3aas&keywords=as&ie=utf8&qid=1532070774 “,问题出现在第40-45行的Try/Except Block with Review上。我附上了代码和csv。我会感谢你的帮助。谢谢您!

    这是csv

    这是节目单

    import csv
    来自Selenium导入WebDriver
    来自BS4进口美汤
    导入请求
    从lxml导入html
    导入IO
    
    链接=[
    'https://www.amazon.com/s/ref=sr_pg_1?fst=as%3off&rh=n%3a1055398%2cn%3a1063306%2ck%3aas&keywords=as&ie=utf8&qid=1532070774'
    ]
    代理人={
    'http':'http://218.50.2.102:8080',
    'https':'http://185.93.3.123:8080'
    }
    
    chrome_options=webdriver.chromeoptions()。
    
    chrome_options.add_参数(“--proxy server=”%s“'%”;“.join([”%s=%s“%(k,v)for k,v in proxys.items()]))
    
    driver=webdriver.chrome(可执行文件\u path=“c:\\users\anderi pc\downloads\webdriver\chromedriver.exe”,
    chrome_options=chrome_选项)
    header=['product title'、'product price'、'review'、'asin']
    
    以open(“csv/demo.csv”,“w”)作为输出:
    writer=csv.writer(输出)
    Writer.WriteRow(标题)
    
    对于范围内的i(len(链接)):
    driver.get(链接[i])
    对于范围(0,23)内的x:
    product_title=driver.find_elements_by_xpath('/li[@id=“result”]/div/div[3]/div/a'.format(x))
    title=[产品标题中x的x.text]
    
    尝试:
    price=driver.find_element_by_xpath('/li[@id=“result”]/div/div[5]/div/a/span[2]..format(x)).text
    除了:
    price='无价格v2'
    打印(‘无价格V2’)
    
    尝试:
    review=driver.find_elements_by_css_selector('i.a-icon-star>span.a-icon-alt')[x].get_attribute('textcontent')
    
    除了:
    review='无review v1'
    打印(无评论v1)
    
    尝试:
    asin=driver.find_element_by_id('result'.format(x)).get_attribute('data-asin')
    
    除了:
    asin='不asin'
    print('no asin')
    
    尝试:
    数据=[标题[0],价格,审核,asin]
    除了:
    print('无项目v3')
    以io.open(“csv/demo.csv”,“a”,newline=“”,encoding=“utf-8”)作为输出:
    writer=csv.writer(输出)
    writer.writerow(数据)
    print(“我解决了此链接%s%”(链接[i]))
    print('产品编号%s'%(i+1))
    driver.quit()
    < / div系列适用于卧室或办公室,橡木(5抽屉)“,”现代创新床边托盘,带杯托和电缆线插入,用作双层床架“,”Mantua平房风格楔形木蓝色床头柜,非常适合海边和乡村DCOR,“阶梯式折叠铝RV台阶平台,防滑表面坚固,重量轻,最大负荷”。我不知道问题出在哪里。对于产品“米兰Seerat-as-RST可调高度旋转凳,生锈”(第一个产品,它返回“5星之5”而不是“无评论v1”)。网址是“ https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A1063306%2Ck%3Aas&keywords=as&ie=UTF8&qid=1532070774 “问题出现在第40-45行的Try/Except块上。我附上了代码和csv。我会感谢你的帮助。谢谢您!

    这是csv

    enter image description here

    这是节目单

    import csv
    from selenium import webdriver
    from bs4 import BeautifulSoup
    import requests
    from lxml import html
    import io
    
    links = [
        'https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A1063306%2Ck%3Aas&keywords=as&ie=UTF8&qid=1532070774'
     ]
    proxies = {
        'http': 'http://218.50.2.102:8080',
        'https': 'http://185.93.3.123:8080'
    }
    
    chrome_options = webdriver.ChromeOptions()
    
    chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
    
    driver = webdriver.Chrome(executable_path="C:\\Users\Andrei-PC\Downloads\webdriver\chromedriver.exe",
                                  chrome_options=chrome_options)
    header = ['Product title', 'Product price', 'Review', 'ASIN']
    
    with open('csv/demo.csv', "w") as output:
        writer = csv.writer(output)
        writer.writerow(header)
    
    for i in range(len(links)):
        driver.get(links[i])
        for x in range(0,23):
            product_title = driver.find_elements_by_xpath('//li[@id="result_{}"]/div/div[3]/div/a'.format(x))
            title = [x.text for x in product_title]
    
            try:
                price = driver.find_element_by_xpath('//li[@id="result_{}"]/div/div[5]/div/a/span[2]'.format(x)).text
            except:
                price = 'No price v2'
                print('No price v2')
    
            try:
                review = driver.find_elements_by_css_selector('i.a-icon-star>span.a-icon-alt')[x].get_attribute('textContent')
    
            except:
                review = 'No review v1'
                print('No review v1')
    
            try:
                asin = driver.find_element_by_id('result_{}'.format(x)).get_attribute('data-asin')
    
            except:
                asin = 'No asin'
                print('No asin')
    
            try:
                data = [title[0], price, review, asin]
            except:
                print('no items v3 ')
            with io.open('csv/demo.csv', "a", newline="", encoding="utf-8") as output:
                writer = csv.writer(output)
                writer.writerow(data)
        print('I solved this link %s' % (links[i]))
        print('Number of product %s' % (i + 1))
        driver.quit()
    1 回复  |  直到 6 年前
        1
  •  1
  •   Andersson    6 年前

    尝试以下循环:

    products = driver.find_elements_by_xpath('//li[starts-with(@id, "result_")]')
    for product in products:
        title = product.find_element_by_tag_name('h2').text
        price = ([item.text for item in product.find_elements_by_xpath('.//a/span[contains(@class, "a-color-base")]')] + ["No price"])[0]
        review = ([item.get_attribute('textContent') for item in product.find_elements_by_css_selector('i.a-icon-star>span.a-icon-alt')] + ["No review"])[0]
        asin = product.get_attribute('data-asin') or "No asin"
        print(title, price, review, asin)