代码之家 › 专栏 › 技术社区 › Sparkles

如何自动刮取以下CSV

web-scraping html python

Sparkles · 技术社区 · 1 年前

Page

在上面的页面上,如果您单击“下载CSV”,它将把CSV文件下载到您的计算机上。我想建立一个每晚下载CSV的过程。我也很乐意收集数据,CSV似乎更容易。我真的什么都没找到。帮助

3 回复 | 直到 1 年前

Noah 1 年前

import requests
from bs4 import BeautifulSoup
import os

# URL of the webpage
url = "https://baseballsavant.mlb.com/leaderboard/custom?year=2024&type=batter&filter=&min=q&selections=pa%2Ck_percent%2Cbb_percent%2Cwoba%2Cxwoba%2Csweet_spot_percent%2Cbarrel_batted_rate%2Chard_hit_percent%2Cavg_best_speed%2Cavg_hyper_speed%2Cwhiff_percent%2Cswing_percent&chart=false&x=pa&y=pa&r=no&chartType=beeswarm&sort=xwoba&sortDir=desc"

# Send a GET request to the webpage
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the link to the CSV file
    csv_link = soup.find('a', text='Download CSV')['href']
    
    # Download the CSV file
    csv_response = requests.get(csv_link)
    
    # Check if the request was successful
    if csv_response.status_code == 200:
        # Specify the directory to save the CSV file
        save_dir = "/path/to/save/directory"
        
        # Create the directory if it doesn't exist
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)
        
        # Save the CSV file
        with open(os.path.join(save_dir, 'data.csv'), 'wb') as f:
            f.write(csv_response.content)
        
        print("CSV file downloaded successfully.")
    else:
        print("Failed to download CSV file.")
else:
    print("Failed to retrieve webpage.")

n1c9 1 年前

import requests

def get_daily_stats(url):
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
        'Referer': 'https://baseballsavant.mlb.com/leaderboard/custom?year=2024&type=batter&filter=&min=q&selections=pa%2Ck_percent%2Cbb_percent%2Cwoba%2Cxwoba%2Csweet_spot_percent%2Cbarrel_batted_rate%2Chard_hit_percent%2Cavg_best_speed%2Cavg_hyper_speed%2Cwhiff_percent%2Cswing_percent&chart=false&x=pa&y=pa&r=no&chartType=beeswarm&sort=xwoba&sortDir=desc'
    })
    with open('daily_stats.csv', 'wb') as f:
        f.write(response.content)
    return

def main():
    url = 'https://baseballsavant.mlb.com/leaderboard/custom?year=2024&type=batter&filter=&min=q&selections=pa%2Ck_percent%2Cbb_percent%2Cwoba%2Cxwoba%2Csweet_spot_percent%2Cbarrel_batted_rate%2Chard_hit_percent%2Cavg_best_speed%2Cavg_hyper_speed%2Cwhiff_percent%2Cswing_percent&chart=false&x=pa&y=pa&r=no&chartType=beeswarm&sort=xwoba&sortDir=desc&csv=true'
    get_daily_stats(url)

if __name__ == '__main__':
    main()

这将为您下载CSV并将其保存到 daily_stats.csv 在脚本所在的文件夹中。您必须安装 requests 也 python -m pip install requests .如何在晚上做这件事,更多的是取决于什么对你最有效。我的意思是,你可以每天晚上都运行它,或者你的目标是在你的计算机上有一个自动运行它的进程?

我想这将在2025年停止工作,但你可以在那时更改URL中的年份。

user24131350 1 年前

import requests
import datetime

def download_csv(url, filename):
  response = requests.get(url)
  if response.status_code == 200:
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"CSV file downloaded successfully as {filename}")
else:
    print("Failed to download CSV file")

if __name__ == "__main__":
  # URL of the webpage where the CSV file is located
  csv_url = "https://example.com/download/csv"

  # Filename to save the CSV file as
  timestamp = datetime.datetime.now().strftime("%Y-%m-%d")
  csv_filename = f"data_{timestamp}.csv"  # You can customize the 
  filename as needed

  # Download the CSV file
  download_csv(csv_url, csv_filename)

这定义了一个函数(download_csv),该函数将URL和文件名作为输入。它使用请求库来获取网页的内容,并将其保存到计算机上指定的“文件名”中。