代码之家  ›  专栏  ›  技术社区  ›  Andrew

beatifulsoup未返回页面的完整HTML

  •  1
  • Andrew  · 技术社区  · 7 年前

    这是我剧本的一部分

    keyword = "men jeans".replace(' ', '+')
    
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
    url = "https://www.amazon.com/s/field-keywords={}".format(keyword)
    
    request = requests.session()
    req = request.get(url, headers = headers)
    sleep(3)
    soup = BeautifulSoup(req.content, 'html.parser')
    print(soup)
    
    2 回复  |  直到 7 年前
        1
  •  2
  •   SIM    7 年前

    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    def fetch_item(driver,keyword):
        driver.get(url.format(keyword.replace(" ", "+")))
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        for items in soup.select("[id^='result_']"):
            try:
                name = items.select_one("h2").text
            except AttributeError: name = ""
            print(name)
    
    if __name__ == '__main__':
        url = "https://www.amazon.com/s/field-keywords={}"
        driver = webdriver.Chrome()
        try:
            fetch_item(driver,"men jeans")
        finally:
            driver.quit()
    

    运行上述脚本后,您将得到56个名称或其他结果。

        2
  •  0
  •   ThunderHorn    7 年前
    import requests
    from bs4 import BeautifulSoup
    
    for page in range(1, 21):
        keyword = "red car".replace(' ', '+')
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
        url = "https://www.amazon.com/s/field-keywords=" + keyword + "?page=" + str(page)
        request = requests.session()
        req = request.get(url, headers=headers)
        soup = BeautifulSoup(req.content, 'html.parser')
        results = soup.findAll("li", {"class": "s-result-item"})
    
        for i in results:
            try:
                print(i.find("h2", {"class": "s-access-title"}).text.replace('[SPONSORED]', ''))
    
                print(i.find("span", {"class": "sx-price-large"}).text.replace("\n", ' '))
    
                print('*' * 20)
            except:
                pass