代码之家  ›  专栏  ›  技术社区  ›  ryy77

如何从这个亚马逊模板中提取平均星级?(python,web抓取)

  •  1
  • ryy77  · 技术社区  · 6 年前

    如何从Amazon模板中提取平均评级(5星中有4.0星)信息( https://www.amazon.com/Windsor-Glider-Ottoman-White-Cushion/dp/B017XRDV5S/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520265105&sr=1-1&keywords=-gggg&th=1 )以csv格式。此信息位于左侧标题下方。我认为这是一个动态信息,它使用javascript(使之成为星的平均值)。我附上了密码。我感谢你的帮助。

    import csv
    from selenium import webdriver
    from bs4 import BeautifulSoup
    import requests
    from lxml import html
    import sys
    
    links = [
        'https://www.amazon.com/Windsor-Glider-Ottoman-White-Cushion/dp/B017XRDV5S/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520265105&sr=1-1&keywords=-gggg&th=1'
    ]
    proxies = {
        'http': 'http://218.50.2.102:8080',
        'https': 'http://185.93.3.123:8080'
    }
    
    def get_information(driver,urls):
        with open('csv/sort_products.csv', "w", newline="", encoding="utf-8") as infile:
            writer = csv.writer(infile)
            writer.writerow(['Review' ,'Link'])
            for url in urls:
                driver.get(url)
                soup = BeautifulSoup(driver.page_source,"lxml")
                try:
                    review = driver.find_element_by_xpath('//div[@id="averageCustomerReviews"]/span/span/span/a').text
                except:
                    review='No review'
                    print('No review')
            
                writer.writerow([review,url])
                print(f'{url}\n')
    
    if __name__ == '__main__':
        chrome_options = webdriver.ChromeOptions()
    
        chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
    
        driver = webdriver.Chrome(executable_path="C:\\Users\Andrei-PC\Downloads\webdriver\chromedriver.exe",
                                  chrome_options=chrome_options)
        get_information(driver,links)
        driver.quit()
    1 回复  |  直到 6 年前
        1
  •  3
  •   Andersson    6 年前

    在下面试试

    stars = driver.find_element_by_id('acrPopover').get_attribute('title')