代码之家  ›  专栏  ›  技术社区  ›  ryy77

如何从具有相同父节点的两个模板中抓取csv?(python,web抓取)

  •  0
  • ryy77  · 技术社区  · 6 年前

    导入IO #设置代理以隐藏实际IP 'http':'http://5.189.133.231:80', } chrome_options.add_参数(“--proxy server=”%s“'%”;“.join([”%s=%s“%(k,v)for k,v in proxys.items()])) chrome_options=chrome_选项) writer=csv.writer(输出) 'https://www.amazon.com/instant pot多用途可编程包装/dp/b00flywnyq/ref=sr_1_1?S=home garden&ie=utf8&qid=1520264922&sr=1-1&keywords=-gggh', 对于范围内的i(len(链接)): product_title=driver.find_elements_by_xpath('//*[@id=“ProductTitle”][1]) 尝试: 产品价格=无价格 asin=driver.find_element_by_xpath('//table[@id=“ProductDetails_detailBullets_sections1”]/tbody/tr[5]/td').text 除了: weight=driver.find_element_by_xpath('//table[@id=“ProductDetails_detailBullets_sections1”]/tbody/tr[2]/td').text print('无权重模板1') dimension=driver.find_element_by_xpath('//table[@id=“productdetails_detailbullets_sections1”]/tbody/tr[1]/td').text print('没有维度模板1') 尝试: 除了: weight=driver.find_element_by_xpath('//table[@id=“ProductDetails_techspec_section_1”]/tbody/tr[2]/td').text print('无权重模板2') 除了: 数据=[产品名称[0],产品价格,asin,重量,尺寸,链接[i] 以io.open(“csv/sort_products.csv”,“a”,newline=“”,encoding=“utf-8”)作为输出: writer.writerow(data). .
    格式为csv(我附上csv和代码)。这是一个web抓取程序,我用python编写了这个程序。我会感谢你的帮助。

    enter image description here

    from selenium import webdriver
    import csv
    import io
    
    # set the proxies to hide actual IP
    
    proxies = {
        'http': 'http://5.189.133.231:80',
        'https': 'https://27.111.43.178:8080'
    }
    
    chrome_options = webdriver.ChromeOptions()
    
    chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
    
    driver = webdriver.Chrome(executable_path="C:\\Users\Andrei-PC\Downloads\webdriver\chromedriver.exe",
                              chrome_options=chrome_options)
    header = ['Product title', 'Product price', 'ASIN', 'Product Weight', 'Product dimensions', 'URL']
    
    with open('csv/sort_products.csv', "w") as output:
        writer = csv.writer(output)
        writer.writerow(header)
    
    links = [
        'https://www.amazon.com/Instant-Pot-Multi-Use-Programmable-Packaging/dp/B00FLYWNYQ/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520264922&sr=1-1&keywords=-gggh',
        'https://www.amazon.com/Amagle-Flexible-Batteries-Operated-Included/dp/B01NGTKTDK/ref=sr_1_2?s=furniture&ie=UTF8&qid=1520353343&sr=1-2&keywords=-jhgf'
    ]
    
    for i in range(len(links)):
    
        driver.get(links[i])
        asinFound = False
        product_title = driver.find_elements_by_xpath('//*[@id="productTitle"][1]')
        prod_title = [x.text for x in product_title]
    
        try:
            prod_price = driver.find_element_by_xpath('//span[@id="priceblock_ourprice"]').text
        except:
            prod_price = 'No price'
    
    
        if asinFound == False:  # try template one
            try:
                asin = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[5]/td').text
                asinFound=True
            except:
                print('no ASIN template one')
    
            try:
                weight = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[2]/td').text
            except:
                print('no weight template one')
    
            try:
                dimension = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[1]/td').text
            except:
                print('no dimension template one')
    
        if asinFound == False:  # try template two
            try:
               asin = driver.find_element_by_xpath('//table[@id ="productDetails_detailBullets_sections1"]/tbody/tr[1]/td').text
               asinFound=True
            except:
                print('no ASIN template two')
    
            try:
                weight = driver.find_element_by_xpath('//table[@id ="productDetails_techSpec_section_1"]/tbody/tr[2]/td').text
            except:
                print('no weight template two')
    
            try:
               dimension = driver.find_element_by_xpath('//table[@id ="productDetails_techSpec_section_1"]/tbody/tr[3]/td').text
            except:
                print('no dimension template two')
    
        try:
            data = [prod_title[0], prod_price, asin, weight, dimension, links[i]]
        except:
            print('no data')
    
        with io.open('csv/sort_products.csv', "a", newline="", encoding="utf-8") as output:
            writer = csv.writer(output)
            writer.writerow(data)
    1 回复  |  直到 6 年前
        1
  •  1
  •   SIM    6 年前

    selenium BeautifulSoup Product information id productDetails_detailBullets_sections1 productDetails_techSpec_section_1

    import csv
    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    links = [
        'https://www.amazon.com/Instant-Pot-Multi-Use-Programmable-Packaging/dp/B00FLYWNYQ/ref=sr_1_1?s=home-garden&ie=UTF8&qid=1520264922&sr=1-1&keywords=-gggh',
        'https://www.amazon.com/Amagle-Flexible-Batteries-Operated-Included/dp/B01NGTKTDK/ref=sr_1_2?s=furniture&ie=UTF8&qid=1520353343&sr=1-2&keywords=-jhgf'
    ]
    
    def get_information(driver,urls):
        with open("productDetails.csv","w",newline="") as infile:
            writer = csv.writer(infile)
            writer.writerow(['Title','Dimension','Weight','ASIN'])
    
            for url in urls:
                driver.get(url)
                soup = BeautifulSoup(driver.page_source,"lxml")
                title = soup.select_one("#productTitle").get_text(strip=True)
                dimension = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "Product Dimensions" in item.text]+["N\A"])[0]
                weight = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "Item Weight" in item.text]+["N\A"])[0]
                ASIN = ([item.select_one("td").get_text(strip=True) for item in soup.select("#prodDetails [id^='productDetails_'] tr") if "ASIN" in item.text]+["N\A"])[0]
    
                writer.writerow([title,dimension,weight,ASIN])
                print(f'{title}\n{dimension}\n{weight}\n{ASIN}\n')
    
    if __name__ == '__main__':
        driver = webdriver.Chrome()
        try:
            get_information(driver,links)
        finally:
            driver.quit()