代码之家 › 专栏 › 技术社区 › Lukinator

为什么这个使用Selenium的网络爬虫不返回整个网站?

web-scraping selenium-webdriver python

-2

Lukinator · 技术社区 · 1 年前

我试图用Selenium编写一个网络爬虫,用于教育目的,显示《华尔街日报》的股市数据。我想知道这个链接中前进和下降的问题数量: https://www.wsj.com/market-data/stocks/us

我的网络爬虫看起来像这样:

import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

url = "https://www.wsj.com/market-data/stocks/us"
driver.get(url)

time.sleep(10)

try:
    element = driver.find_element(By.XPATH, "/html/body/div/div/div/div/div[1]/div[2]/div/div[2]/table/tbody[1]/tr/td[2]")
    data = element.text
    print(f"Found data: {data}")
except Exception as e:
    print(f"Error: {e}")

driver.quit()

例如,如果我试图像这样查找纽约证券交易所提前发行的股票数量:

element = driver.find_element(By.XPATH, â/html/body/div/div/div/div/div[1]/div[2]/div/div[2]/table/tbody[1]/tr/td[2]â)

我收到以下错误消息:

Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/div/div/div/div/div[1]/div[2]/div/div[2]/table/tbody[1]/tr/td[2]"}
  (Session info: chrome=131.0.6778.205); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception

因此,我试图寻找整个身体,看看我是否可以访问这些数据:

element = driver.find_element(By.XPATH, "//body")

然而,我只收到了标题、“股票指数”部分和“股票新闻”部分的全文。我没有从“市场日记”部分或页面下方的其他部分获得任何内容。增加等待时间 time.sleep(10) 没有改变任何事情。为什么我没有看到整个身体?

1 回复 | 直到 1 年前

SIGHUP 1 年前

Selenium不再需要用户代码来管理浏览器驱动程序(如下面的代码所示)。

您还需要记住,WebElement.text并不总是能揭示您所期望的内容。这段代码包含一个处理这种情况的有用实用函数。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement

OPTIONS = webdriver.ChromeOptions()
OPTIONS.add_argument("--headless=true")

URL = "https://www.wsj.com/market-data/stocks/us"

def etext(e: WebElement) -> str:
    if t := e.text:
        return t.strip()
    p = e.get_property("textContent")
    return p.strip() if isinstance(p, str) else ""

with webdriver.Chrome(OPTIONS) as driver:
    driver.get(URL)
    wait = WebDriverWait(driver, 10)
    ec = EC.presence_of_element_located
    selector = By.XPATH, "//*[@id='root']/div/div/div/div[2]/div[4]/div[1]/div[3]/table/tbody[2]/tr[1]/td[2]"
    if td := wait.until(ec(selector)):
        print("Advancing", etext(td))
    selector = By.XPATH, "//*[@id='root']/div/div/div/div[2]/div[4]/div[1]/div[3]/table/tbody[2]/tr[2]/td[2]"
    if td := wait.until(ec(selector)):
        print("Declining", etext(td))

输出:

Advancing 509
Declining 2,284