我有点卡住了,我有下面的for循环,已经花了好几天时间让它工作了,我晚上一直在从底部的html代码中获取数据,手工操作不是一种选择,因为会有1000多个要收集。
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
from time import sleep
from random import randint
from selenium.webdriver.common.by import By
# login credentials
username="xxxxxxx"
password="xxxxxxx"
url_subcategorise_list={
"https://www./search/all-items?id=C99"
}
# url_subcategorise_list.add("")
# print(url_subcategorise_list)
# initialize the Chrome driver
chromedriver_path="C:\\Users\\xxx\\projects\\web_scrapper\\web_scrapper.exe"
service=Service(chromedriver_path)
driver=webdriver.Chrome(service=service)
url="https://www.xxxx.co.uk"
driver.get(url)
# at login page
# find username/email field and send the username itself to the input field
driver.find_element(By.ID, "username").send_keys(username)
# find password input field and insert password as well
driver.find_element(By.XPATH, "//input[@type='password']").send_keys(password)
# click login button
driver.find_element(By.XPATH, "//input[@type='submit']").submit()
# start driver get
# driver=webdriver.Chrome(service=service)
for link in url_subcategorise_list:
driver.get(link)
get_url=driver.current_url
print("The current url is:"+str(get_url))
soup=BeautifulSoup(driver.page_source, 'html.parser')
print(soup)
driver.quit()
# driver.implicitly_wait(10)
# now leaving site
# driver.close()
以下是我需要从中选择数据的部分页面html,我选择了下一个父div,它涵盖了我需要的所有数据:
<div class="md-modal md-effect-12" id="modal-promo">
<div class="md-content"><h3>75mm(3") NO.1 POZI S/DRIVER</h3>
<div>
<div class="dninfo pure-g">
<div class="pure-u-1-2 img both"><img class="pure-img"
src="/info/images/by-supplier/W72/6021.jpg"></div>
<div class="bullets pure-u-1-2">75mm(3") NO.1 POZI S/DRIVER</div>
</div>
<input class="md-close" type="button" value="Close"></div>
</div>
</div>
我需要从以下文本中获取数据:
<h3>75mm(3") NO.1 POZI S/DRIVER</h3>
<div class="bullets pure-u-1-2">75mm(3") NO.1 POZI S/DRIVER</div>
唯一的标识符是/info/images/by-supplier/W72/6021.jpg
links = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['md-effect-12'])
for link in links:
divcollector.add(link.get("div"))