代码之家 › 专栏 › 技术社区 › clattenburg cake

使用Beautiful Soup和Selenium将数据插入CSV

beautifulsoup web-scraping selenium python

clattenburg cake · 技术社区 · 3 年前

我在用 beautifulsoup 和 selenium 在python中获取一些数据。这是我在url中运行的代码 https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10 :

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

DRIVER_PATH = '$PATH/chromedriver.exe'

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

class_name = "matchHistoryRow__dartThrows"

def write_to_output(url):  
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    print(soup.find_all("div", {"class": class_name}))
    return

这就是我正在尝试的模式——我想获得冒号之间的一对跨距,并将它们放在csv上的单独列中,问题是 class 要么在冒号之前,要么在冒号之后,所以我不知道该怎么做。例如:

<div class="matchHistoryRow__dartThrows"><span><span class="matchHistoryRow__dartServis">321</span>:<span>501</span>
        <span class="dartType dartType__180" title="180 thrown">180</span></span>, <span><span>321</span>:<span
            class="matchHistoryRow__dartServis">361</span><span class="dartType dartType__140"
            title="140+ thrown">140+</span></span>, <span><span
            class="matchHistoryRow__dartServis">224</span>:<span>361</span></span></div>

我希望以csv的方式表示:

player_1_score,player_2_score
321,501
321,361
224,361

最好的办法是什么?

0 回复 | 直到 3 年前

Andrej Kesely 3 年前

您可以使用regex解析分数(如果文本的结构是相应的,那么这是最简单的方法):

import re
import pandas as pd
from bs4 import BeautifulSoup


html_doc = """
<div class="matchHistoryRow__dartThrows"><span><span class="matchHistoryRow__dartServis">321</span>:<span>501</span>
        <span class="dartType dartType__180" title="180 thrown">180</span></span>, <span><span>321</span>:<span
            class="matchHistoryRow__dartServis">361</span><span class="dartType dartType__140"
            title="140+ thrown">140+</span></span>, <span><span
            class="matchHistoryRow__dartServis">224</span>:<span>361</span></span></div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

# 1. parse whole text from a row
txt = soup.select_one(".matchHistoryRow__dartThrows").get_text(
    strip=True, separator=" "
)

# 2. find scores with regex
scores = re.findall(r"(\d+)\s+:\s+(\d+)", txt)

# 3. create dataframe from regex
df = pd.DataFrame(scores, columns=["player_1_score", "player_2_score"])
print(df)
df.to_csv("data.csv", index=False)

印刷品:

  player_1_score player_2_score
0            321            501
1            321            361
2            224            361

这个箱子 data.csv (LibreOffice截图):

另一种方法,不使用 re :

scores = [
    s.get_text(strip=True)
    for s in soup.select(
        ".matchHistoryRow__dartThrows > span > span:nth-of-type(1), .matchHistoryRow__dartThrows > span > span:nth-of-type(2)"
    )
]

df = pd.DataFrame(
    {"player_1_score": scores[::2], "player_2_score": scores[1::2]}
)

print(df)

undetected Selenium 3 年前

使用硒和 css-selectors 对于 球员1分 你需要 span:first-child 为了 球员2分 你需要 span:nth-child(2) 。因此,您可以使用以下解决方案:

driver.get('https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
player_1_scores = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.matchHistoryRow__dartThrows span span:first-child")))[:3]]
player_2_scores = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.matchHistoryRow__dartThrows span span:nth-child(2)")))[:3]]
df = pd.DataFrame(data=list(zip(player_1_scores, player_2_scores)), columns=['player_1_score', 'player_2_score'])
print(df)

控制台输出:

  player_1_score player_2_score
0            501            321
1            361            321
2            361            181

写信给 CSV :

df = pd.DataFrame(data=list(zip(player_1_scores, player_2_scores)), columns=['player_1_score', 'player_2_score'])
df.to_csv("my_data.csv", index=False)

快照: