代码之家  ›  专栏  ›  技术社区  ›  clattenburg cake

使用Beautiful Soup和Selenium将数据插入CSV

  •  0
  • clattenburg cake  · 技术社区  · 3 年前

    我在用 beautifulsoup selenium 在python中获取一些数据。这是我在url中运行的代码 https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10 :

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from bs4 import BeautifulSoup
    
    DRIVER_PATH = '$PATH/chromedriver.exe'
    
    options = Options()
    options.headless = True
    options.add_argument("--window-size=1920,1200")
    
    driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
    
    class_name = "matchHistoryRow__dartThrows"
    
    def write_to_output(url):  
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        print(soup.find_all("div", {"class": class_name}))
        return
    

    这就是我正在尝试的模式——我想获得冒号之间的一对跨距,并将它们放在csv上的单独列中,问题是 class 要么在冒号之前,要么在冒号之后,所以我不知道该怎么做。例如:

    <div class="matchHistoryRow__dartThrows"><span><span class="matchHistoryRow__dartServis">321</span>:<span>501</span>
            <span class="dartType dartType__180" title="180 thrown">180</span></span>, <span><span>321</span>:<span
                class="matchHistoryRow__dartServis">361</span><span class="dartType dartType__140"
                title="140+ thrown">140+</span></span>, <span><span
                class="matchHistoryRow__dartServis">224</span>:<span>361</span></span></div>
    

    我希望以csv的方式表示:

    player_1_score,player_2_score
    321,501
    321,361
    224,361
    

    最好的办法是什么?

    0 回复  |  直到 3 年前
        1
  •  2
  •   Andrej Kesely    3 年前

    您可以使用regex解析分数(如果文本的结构是相应的,那么这是最简单的方法):

    import re
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    html_doc = """
    <div class="matchHistoryRow__dartThrows"><span><span class="matchHistoryRow__dartServis">321</span>:<span>501</span>
            <span class="dartType dartType__180" title="180 thrown">180</span></span>, <span><span>321</span>:<span
                class="matchHistoryRow__dartServis">361</span><span class="dartType dartType__140"
                title="140+ thrown">140+</span></span>, <span><span
                class="matchHistoryRow__dartServis">224</span>:<span>361</span></span></div>
    """
    
    soup = BeautifulSoup(html_doc, "html.parser")
    
    # 1. parse whole text from a row
    txt = soup.select_one(".matchHistoryRow__dartThrows").get_text(
        strip=True, separator=" "
    )
    
    # 2. find scores with regex
    scores = re.findall(r"(\d+)\s+:\s+(\d+)", txt)
    
    # 3. create dataframe from regex
    df = pd.DataFrame(scores, columns=["player_1_score", "player_2_score"])
    print(df)
    df.to_csv("data.csv", index=False)
    

    印刷品:

      player_1_score player_2_score
    0            321            501
    1            321            361
    2            224            361
    

    这个箱子 data.csv (LibreOffice截图):

    enter image description here


    另一种方法,不使用 re :

    scores = [
        s.get_text(strip=True)
        for s in soup.select(
            ".matchHistoryRow__dartThrows > span > span:nth-of-type(1), .matchHistoryRow__dartThrows > span > span:nth-of-type(2)"
        )
    ]
    
    df = pd.DataFrame(
        {"player_1_score": scores[::2], "player_2_score": scores[1::2]}
    )
    
    print(df)
    
        2
  •  1
  •   undetected Selenium    3 年前

    使用 对于 球员1分 你需要 span:first-child 为了 球员2分 你需要 span:nth-child(2) 。因此,您可以使用以下解决方案:

    driver.get('https://www.flashscore.co.uk/match/YwbnUyDn/#/match-summary/point-by-point/10')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
    player_1_scores = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.matchHistoryRow__dartThrows span span:first-child")))[:3]]
    player_2_scores = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.matchHistoryRow__dartThrows span span:nth-child(2)")))[:3]]
    df = pd.DataFrame(data=list(zip(player_1_scores, player_2_scores)), columns=['player_1_score', 'player_2_score'])
    print(df)
    

    控制台输出:

      player_1_score player_2_score
    0            501            321
    1            361            321
    2            361            181
    

    写信给 CSV :

    df = pd.DataFrame(data=list(zip(player_1_scores, player_2_scores)), columns=['player_1_score', 'player_2_score'])
    df.to_csv("my_data.csv", index=False)
    

    快照:

    panda_csv