代码之家  ›  专栏  ›  技术社区  ›  simon5968

beautifulsoup硒汤find_all选择家长然后选择孩子

  •  0
  • simon5968  · 技术社区  · 1 年前

    我有点卡住了,我有下面的for循环,已经花了好几天时间让它工作了,我晚上一直在从底部的html代码中获取数据,手工操作不是一种选择,因为会有1000多个要收集。

    import re 
    
    from selenium import webdriver 
    
    from selenium.webdriver.chrome.service import Service 
    
    from bs4 import BeautifulSoup 
    
    from time import sleep 
    
    from random import randint 
    
    from selenium.webdriver.common.by import By 
    
      
    
    # login credentials 
    
    username="xxxxxxx" 
    
    password="xxxxxxx" 
    
      
    
    url_subcategorise_list={ 
    
        "https://www./search/all-items?id=C99" 
    
    } 
    
    # url_subcategorise_list.add("") 
    
    # print(url_subcategorise_list) 
    
      
    
    # initialize the Chrome driver 
    
    chromedriver_path="C:\\Users\\xxx\\projects\\web_scrapper\\web_scrapper.exe" 
    
    service=Service(chromedriver_path) 
    
    driver=webdriver.Chrome(service=service) 
    
      
    
    url="https://www.xxxx.co.uk" 
    
    driver.get(url) 
    
      
    
    # at login page 
    
    # find username/email field and send the username itself to the input field 
    
    driver.find_element(By.ID, "username").send_keys(username) 
    
    # find password input field and insert password as well 
    
    driver.find_element(By.XPATH, "//input[@type='password']").send_keys(password) 
    
    # click login button 
    
    driver.find_element(By.XPATH, "//input[@type='submit']").submit() 
    
      
    
    # start driver get 
    
      
    
    #    driver=webdriver.Chrome(service=service) 
    
    for link in url_subcategorise_list: 
    
        driver.get(link) 
    
        get_url=driver.current_url 
    
        print("The current url is:"+str(get_url)) 
    
        soup=BeautifulSoup(driver.page_source, 'html.parser') 
    
        print(soup) 
    
        driver.quit() 
    
      
    
    # driver.implicitly_wait(10) 
    
      
    
    # now leaving site 
    
    # driver.close() 
    

    以下是我需要从中选择数据的部分页面html,我选择了下一个父div,它涵盖了我需要的所有数据:

    <div class="md-modal md-effect-12" id="modal-promo"> 
        <div class="md-content"><h3>75mm(3") NO.1 POZI S/DRIVER</h3> 
            <div> 
                <div class="dninfo pure-g"> 
                    <div class="pure-u-1-2 img both"><img class="pure-img" 
                                                          src="/info/images/by-supplier/W72/6021.jpg"></div> 
                    <div class="bullets pure-u-1-2">75mm(3") NO.1 POZI S/DRIVER</div> 
                </div> 
                <input class="md-close" type="button" value="Close"></div> 
        </div> 
    </div> 
    
    

    我需要从以下文本中获取数据:

    <h3>75mm(3") NO.1 POZI S/DRIVER</h3> 
    <div class="bullets pure-u-1-2">75mm(3") NO.1 POZI S/DRIVER</div>
    

    唯一的标识符是/info/images/by-supplier/W72/6021.jpg

    links = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['md-effect-12'])
    for link in links:
        divcollector.add(link.get("div"))
    
    0 回复  |  直到 1 年前