代码之家 › 专栏 › 技术社区 › Dinosaur

使用BeautifulSoup点击div标签后抓取html页面

web-crawler beautifulsoup html python javascript

Dinosaur · 技术社区 · 11 月前

我在从网站上抓取问题和答案时遇到了一些麻烦:

https://tech12h.com/bai-hoc/trac-nghiem-lich-su-12-bai-1-su-hinh-thanh-trat-tu-gioi-moi-sau-chien-tranh-gioi-thu-hai

问题是,只有当我点击一个div Xem p n(向下滚动到末尾)时,答案才会出现,但它不是一个链接,它只是一个div,我想它在点击div后使用Javascript事件触发器来呈现内容。

我该如何处理Beautifulsoup的问题。我在Ubuntu上使用Selenium时遇到了冲突驱动程序问题。

非常感谢。

2 回复 | 直到 11 月前

SIGHUP 11 月前

您不需要明确下载Chrome驱动程序。现代硒版本可以为您处理。

因此,您所需要的就是:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from collections.abc import Iterable

OPTIONS = webdriver.ChromeOptions()
OPTIONS.add_argument("--headless=true")

URL = "https://tech12h.com/bai-hoc/trac-nghiem-lich-su-12-bai-1-su-hinh-thanh-trat-tu-gioi-moi-sau-chien-tranh-gioi-thu-hai"

CSS_P = "#accordionExample > p"
CSS_L = "#accordionExample > ul li"

def etext(e: WebElement) -> str:
    if t := e.text:
        return t.strip()
    p = e.get_property("textContent")
    if isinstance(p, str):
        return p.strip()
    return ""

def get_common(driver: webdriver.Chrome, css: str) -> Iterable[str]:
    wait = WebDriverWait(driver, 10)
    ec = EC.presence_of_all_elements_located
    selector = By.CSS_SELECTOR, css
    yield from map(etext, wait.until(ec(selector)))

def chunks(items: list[str], chunk=4) -> Iterable[list[str]]:
    for i in range(0, len(items), chunk):
        yield items[i : i + chunk]

if __name__ == "__main__":
    with webdriver.Chrome(OPTIONS) as driver:
        driver.get(URL)
        questions = list(get_common(driver, CSS_P))
        answers = list(get_common(driver, CSS_L))
        assert len(questions) * 4 == len(answers)
        qanda = dict(zip(questions, chunks(answers)))
        for k, v in qanda.items():
            print(k)
            for a in v:
                print(f"\t{a}")

输出(部分):

CÃ¢u 1: Äá» káº¿t thÃºc nhanh chiáº¿n tranh á» chÃ¢u Ãu vÃ  chÃ¢u Ã - ThÃ¡i BÃ¬nh - DÆ°Æ¡ng, ba cÆ°á»ng quá»c ÄÃ£ thá»ng nháº¥t má»¥c ÄÃch gÃ¬?
        A. Sá» dá»¥ng bom nguyÃªn tá» Äá» tiÃªu diá»t phÃ¡t xÃt Nháºt.
        B. Há»ng quÃ¢n LiÃªn XÃ´ nhanh chÃ³ng táº¥n cÃ´ng vÃ o táºn sÃ o huyá»t cá»§a phÃ¡t xÃt Äá»©c á» Bec-lin.
        C. TiÃªu diá»t táºn gá»c chá»§ nghÄ©a phÃ¡t xÃt Äá»©c vÃ  quÃ¢n phiá»t Nháºt.
        D. Táº¥t cáº£ cÃ¡c má»¥c ÄÃch trÃªn.
CÃ¢u 2: Sá»± kiá»n nÃ o dáº«n Äáº¿n thÃ nh láºp nÆ°á»c Cá»ng hÃ²a LiÃªn bang Äá»©c?
        A. NÆ°á»c Äá»©c ÄÆ°á»£c hÃ²an toÃ n thá»ng nháº¥t.
        B. NÆ°á»c Äá»©c ÄÃ£ tiÃªu diá»t táºn gá»c chá»§ nghÄ©a phÃ¡t xÃt.
        C. MÄ©, Anh, PhÃ¡p há»£p nháº¥t cÃ¡c vÃ¹ng chiáº¿m ÄÃ³ng.
        D. Táº¥t cáº£ cÃ¡c sá»± kiá»n trÃªn.

-1

Samu Németh Dinasour 11 月前

我终于解决了这个问题!

由于chrome应用程序版本和chrome驱动程序版本(位于 usr/local/bin ).

我发现这个存储库包含我的chrome应用程序版本: https://github.com/dreamshao/chromedriver

我下载了适合我的版本,并放入 usr/local/bin 文件夹。