代码之家  ›  专栏  ›  技术社区  ›  outkast20

你怎么从网页垃圾中获取所有页面

  •  0
  • outkast20  · 技术社区  · 5 年前

    https://www.dickssportinggoods.com/f/all-mens-footwear 但是我不知道在我的代码里还能写些什么。 基本上我想从网站的所有页面选择一个品牌的鞋子。例如,我想选择新的平衡鞋,我想打印一个名单的所有鞋的名字布兰克我选择。下面是我的代码

    from bs4 import BeautifulSoup as soup
    from urllib.request import urlopen as uReq
    Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
    uClient = uReq(Url)
    Page = uClient.read()
    uClient.close()
    page_soup = soup(Page, "html.parser")
    for i in page_soup.findAll("div", {"class":"rs-facet-name-container"}):
        print(i.text)
    
    0 回复  |  直到 5 年前
        1
  •  0
  •   Pygirl    5 年前

    那个站点正在使用js脚本更新它的元素,所以你不能单独使用beautifulsoup,你必须使用自动化。

    下面的代码不起作用,因为元素在几毫秒后更新。它将首先显示所有的品牌,然后它将更新和显示选定的品牌,所以使用自动化。

    from bs4 import BeautifulSoup as soup
    import time
    from urllib.request import urlopen as uReq
    Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
    url_st = 'https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&filterFacets=X_BRAND'
    
    for idx, br in enumerate(brands_name):
        if idx==0:
            url_st += '%3A'+ '%20'.join(br.split(' '))
        else: 
            url_st += '%2C' + '%20'.join(br.split(' '))
    
    uClient = uReq(url_st)
    time.sleep(4)
    Page = uClient.read()
    uClient.close()
    
    page_soup = soup(Page, "html.parser") 
    for match in page_soup.find_all('div', class_='rs_product_description d-block'):
        print(match.text)
    

    代码:(selenium+bs4)

    from bs4 import BeautifulSoup as soup
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    import time
    from webdriver_manager.chrome import ChromeDriverManager
    
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(ChromeDriverManager().install())#, chrome_options=chrome_options)
    driver.set_window_size(1024, 600)
    driver.maximize_window()
    
    brands_name = ['New Balance']
    
    filter_facet ='filterFacets=X_BRAND'
    for idx, br in enumerate(brands_name):
        if idx==0:
            filter_facet += '%3A'+ '%20'.join(br.split(' '))
        else: 
            filter_facet += '%2C' + '%20'.join(br.split(' '))
    
    url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&{filter_facet}"        
    driver.get(url)
    time.sleep(4)
    page_soup = soup(driver.page_source, 'html.parser')  
    elem = driver.find_element_by_class_name('close')
    if elem:
        elem.click()
    for match in page_soup.find_all('div', class_='rs_product_description d-block'):
        print(match.text)
        
    page_num = page_soup.find_all('a', class_='rs-page-item')
    pnum = [int(pn.text) for pn in page_num if pn.text!='']
    if len(pnum)>=2:
        for pn in range(1, len(pnum)):
            url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber={pn}&{filter_facet}"
            driver.get(url)
            time.sleep(2)
            page_soup = soup(driver.page_source, "html.parser") 
            for match in page_soup.find_all('div', class_='rs_product_description d-block'):
                print(match.text)
    

    New Balance Men's 410v6 Trail Running Shoes
    New Balance Men's 623v3 Training Shoes
    .
    .
    .
    New Balance Men's Fresh Foam Beacon Running Shoes
    New Balance Men's Fresh Foam Cruz v2 SockFit Running Shoes
    New Balance Men's 470 Running Shoes
    New Balance Men's 996v3 Tennis Shoes
    New Balance Men's 1260 V7 Running Shoes
    New Balance Men's Fresh Foam Beacon Running Shoes
    

    我已经注释掉了headerlesschrome,因为当你打开它时,你会得到一个对话框按钮,关闭它后,你可以获取产品的详细信息。在无浏览器的自动化系统中,您将无法做到这一点(无法回答这个问题)。不太擅长硒的概念)

    别忘了安装: webdriver_manager 使用 pip install webdriver_manager

        2
  •  0
  •   Ambarish Singh    5 年前

    你只需要做 driver.find element by xpath() 如果你使用硒,你必须知道这一点。

        3
  •  0
  •   Diego Suarez    5 年前

    页面正在使用java脚本创建您想要的链接,您不能刮取该链接,您需要复制页面请求,在这种情况下,页面正在发送post请求:

    Request URL: https://prod-catalog-product-api.dickssportinggoods.com/v1/search
    Request Method: POST
    Status Code: 200 OK
    Remote Address: [2600:1400:d:696::25db]:443
    Referrer Policy: no-referrer-when-downgrade
    

    这是发送post请求的url:

    https://prod-catalog-product-api.dickssportinggoods.com/v1/search
    

    {selectedCategory: "12301_1714863", selectedStore: "1406", selectedSort: 1,…}
    isFamilyPage: true
    pageNumber: 0
    pageSize: 48
    searchTypes: []
    selectedCategory: "12301_1714863"
    selectedFilters: {X_BRAND: ["New Balance"]}   #<--- this is the info that you want to get
    selectedSort: 1
    selectedStore: "1406"
    storeId: 15108
    totalCount: 3360
    

    页面可能还需要标头,因此请确保模拟浏览器发送的请求。