代码之家 › 专栏 › 技术社区 › simon5968

beautifulsoup硒汤find_all选择家长然后选择孩子

beautifulsoup selenium-webdriver

simon5968 · 技术社区 · 1 年前

我有点卡住了,我有下面的for循环,已经花了好几天时间让它工作了,我晚上一直在从底部的html代码中获取数据,手工操作不是一种选择,因为会有1000多个要收集。

import re 

from selenium import webdriver 

from selenium.webdriver.chrome.service import Service 

from bs4 import BeautifulSoup 

from time import sleep 

from random import randint 

from selenium.webdriver.common.by import By 

  

# login credentials 

username="xxxxxxx" 

password="xxxxxxx" 

  

url_subcategorise_list={ 

    "https://www./search/all-items?id=C99" 

} 

# url_subcategorise_list.add("") 

# print(url_subcategorise_list) 

  

# initialize the Chrome driver 

chromedriver_path="C:\\Users\\xxx\\projects\\web_scrapper\\web_scrapper.exe" 

service=Service(chromedriver_path) 

driver=webdriver.Chrome(service=service) 

  

url="https://www.xxxx.co.uk" 

driver.get(url) 

  

# at login page 

# find username/email field and send the username itself to the input field 

driver.find_element(By.ID, "username").send_keys(username) 

# find password input field and insert password as well 

driver.find_element(By.XPATH, "//input[@type='password']").send_keys(password) 

# click login button 

driver.find_element(By.XPATH, "//input[@type='submit']").submit() 

  

# start driver get 

  

#    driver=webdriver.Chrome(service=service) 

for link in url_subcategorise_list: 

    driver.get(link) 

    get_url=driver.current_url 

    print("The current url is:"+str(get_url)) 

    soup=BeautifulSoup(driver.page_source, 'html.parser') 

    print(soup) 

    driver.quit() 

  

# driver.implicitly_wait(10) 

  

# now leaving site 

# driver.close()

以下是我需要从中选择数据的部分页面html,我选择了下一个父div,它涵盖了我需要的所有数据:

<div class="md-modal md-effect-12" id="modal-promo"> 
    <div class="md-content"><h3>75mm(3") NO.1 POZI S/DRIVER</h3> 
        <div> 
            <div class="dninfo pure-g"> 
                <div class="pure-u-1-2 img both"><img class="pure-img" 
                                                      src="/info/images/by-supplier/W72/6021.jpg"></div> 
                <div class="bullets pure-u-1-2">75mm(3") NO.1 POZI S/DRIVER</div> 
            </div> 
            <input class="md-close" type="button" value="Close"></div> 
    </div> 
</div>

我需要从以下文本中获取数据:

<h3>75mm(3") NO.1 POZI S/DRIVER</h3> 
<div class="bullets pure-u-1-2">75mm(3") NO.1 POZI S/DRIVER</div>

唯一的标识符是/info/images/by-supplier/W72/6021.jpg

links = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['md-effect-12'])
for link in links:
    divcollector.add(link.get("div"))

0 回复 | 直到 1 年前

推荐文章

yash agarwal · Python Selenium-如何基于span标记内的文本提取元素?

3 年前

Amar · 漂亮汤错误:“NoneType”对象没有属性“find\u all”

3 年前

ihonestlydontKnow · Python(BeautifulSoup)仅1个结果

3 年前

ARH · 如何使用Selenium识别网站中使用的所有标签

3 年前

Kevin Rodgers Jr. · Python BeautifulSoup:在in select语句中排除其他标记

3 年前

Jensen Holm · 在非常大的字符串中查找链接时遇到问题

3 年前

koshiboto · 使用python(bs4)从段落中获取第一个不位于括号之间的常规链接

3 年前

LaddieMawery · Beautifulsoup获取嵌套跨元素时遇到问题

3 年前

Ventorro · Python和Web抓取的新手。抓取一个HTML表格——但是它并没有显示所有的列

3 年前

aphexlog · 正在尝试使用BeautifulSoup将新行附加到表体中的第一行

3 年前