代码之家 › 专栏 › 技术社区 › DJG

使用BeautifulSoup删除所有链接

beautifulsoup web-scraping selenium-webdriver python

DJG · 技术社区 · 2 年前

我正试图从页面上抓取所有匹配报告链接,但有“加载更多”按钮,我不想使用selenium。有没有任何解决方案可以在没有硒的情况下收集所有链接。提前谢谢。

以下是我尝试的内容:

 from bs4 import BeautifulSoup as bs
 import requests
 r=requests.get('https://www.iplt20.com/news/match-reports')
 soup = bs(r.text,'lxml')

 for match in soup.find_all('div',class_='latest-slider-wrap 
 position-relative'):
      links = match.find('a')
      print(links['href'])

1 回复 | 直到 2 年前

Andrej Kesely 2 年前

尝试

import requests
from bs4 import BeautifulSoup

url = "https://www.iplt20.com/news/match-reports"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select("#div-match-report a:has(li)"):
    print(a["href"])

打印:

https://www.iplt20.com/news/4014/tata-ipl-2024-match-11-lsg-vs-pbks-match-report
https://www.iplt20.com/news/4012/tata-ipl-2024-match-10-rcb-vs-kkr-match-report
https://www.iplt20.com/news/4011/tata-ipl-2024-match-09-rr-vs-dc-match-report
https://www.iplt20.com/news/4009/tata-ipl-2024-match-08-srh-vs-mi-match-report
https://www.iplt20.com/news/4007/tata-ipl-2024-match-07-csk-vs-gt-match-report
https://www.iplt20.com/news/4006/tata-ipl-2024-match-06-rcb-vs-pbks-match-report
https://www.iplt20.com/news/4004/tata-ipl-2024-match-05-gt-vs-mi-match-report
https://www.iplt20.com/news/4003/tata-ipl-2024-match-04-rr-vs-lsg-match-report
https://www.iplt20.com/news/4001/tata-ipl-2024-match-03-kkr-vs-srh-match-report
https://www.iplt20.com/news/4000/tata-ipl-2024-match-02-pbks-vs-dc-match-report
https://www.iplt20.com/news/3999/tata-ipl-2024-match-01-csk-vs-rcb-match-report
https://www.iplt20.com/news/3976/tata-ipl-2023-final-csk-vs-gt-match-report
https://www.iplt20.com/news/3974/tata-ipl-2023-qualifier-2-gt-vs-mi-match-report
https://www.iplt20.com/news/3973/tata-ipl-2023-eliminator-lsg-vs-mi-match-report
https://www.iplt20.com/news/3972/tata-ipl-2023-qualifier-1-gt-vs-csk-match-report
https://www.iplt20.com/news/3971/tata-ipl-2023-match-70-rcb-vs-gt-match-report
https://www.iplt20.com/news/3970/tata-ipl-2023-match-69-mi-vs-srh-match-report
https://www.iplt20.com/news/3969/tata-ipl-2023-match-68-kkr-vs-lsg-match-report
https://www.iplt20.com/news/3968/tata-ipl-2023-match-67-dc-vs-csk-match-report
https://www.iplt20.com/news/3967/tata-ipl-2023-match-66-pbks-vs-rr-match-report
https://www.iplt20.com/news/3966/tata-ipl-2023-match-65-srh-vs-rcb-match-report

编辑:要获得所有链接,您可以使用它们的Ajax分页API:

import requests

api_url = "https://www.iplt20.com/add-more-match-report?page={page}&type=match-reports"

for page in range(1, 4):  # <-- adjust number of pages here
    print(f"{page=}")
    data = requests.get(api_url.format(page=page)).json()
    for d in data["newsResponce"]["data"]:
        print(f'https://www.iplt20.com/news/{d["id"]}/{d["titleUrlSegment"]}')

打印:


...

page=2
https://www.iplt20.com/news/3964/tata-ipl-2023-match-64-pbks-vs-dc-match-report
https://www.iplt20.com/news/3963/tata-ipl-2023-match-63-lsg-vs-mi-match-report
https://www.iplt20.com/news/3962/tata-ipl-2023-match-62-gt-vs-srh-match-report
https://www.iplt20.com/news/3960/tata-ipl-2023-match-61-csk-vs-kkr-match-report
https://www.iplt20.com/news/3959/tata-ipl-2023-match-60-rr-vs-rcb-match-report
https://www.iplt20.com/news/3958/tata-ipl-2023-match-59-dc-vs-pbks-match-report
https://www.iplt20.com/news/3956/tata-ipl-2023-match-58-srh-vs-lsg-match-report
https://www.iplt20.com/news/3955/tata-ipl-2023-match-57-mi-vs-gt-match-report
https://www.iplt20.com/news/3953/tata-ipl-2023-match-56-kkr-vs-rr-match-report
https://www.iplt20.com/news/3952/tata-ipl-2023-match-55-csk-vs-dc-match-report
https://www.iplt20.com/news/3951/tata-ipl-2023-match-54-mi-vs-rcb-match-report
https://www.iplt20.com/news/3947/tata-ipl-2023-match-53-kkr-vs-pbks-match-report
https://www.iplt20.com/news/3946/tata-ipl-2023-match-52-rr-vs-srh-match-report
https://www.iplt20.com/news/3945/tata-ipl-2023-match-51-gt-vs-lsg-match-report
https://www.iplt20.com/news/3944/tata-ipl-2023-match-50-dc-vs-rcb-match-report
https://www.iplt20.com/news/3943/tata-ipl-2023-match-49-csk-vs-mi-match-report
https://www.iplt20.com/news/3942/tata-ipl-2023-match-48-rr-vs-gt-match-report
https://www.iplt20.com/news/3940/tata-ipl-2023-match-47-srh-vs-kkr-match-report
https://www.iplt20.com/news/3938/tata-ipl-2023-match-46-pbks-vs-mi-match-report
https://www.iplt20.com/news/3937/tata-ipl-2023-match-45-lsg-vs-csk-match-report
https://www.iplt20.com/news/3936/tata-ipl-2023-match-44-gt-vs-dc-match-report
page=3
https://www.iplt20.com/news/3934/tata-ipl-2023-match-43-lsg-vs-rcb-match-report
https://www.iplt20.com/news/3932/tata-ipl-2023-match-42-mi-vs-rr-match-report
https://www.iplt20.com/news/3931/tata-ipl-2023-match-41-csk-vs-pbks-match-report
https://www.iplt20.com/news/3930/tata-ipl-2023-match-40-dc-vs-srh-match-report

...