代码之家  ›  专栏  ›  技术社区  ›  DJG

使用BeautifulSoup删除所有链接

  •  2
  • DJG  · 技术社区  · 2 年前

    我正试图从页面上抓取所有匹配报告链接,但有“加载更多”按钮,我不想使用selenium。有没有任何解决方案可以在没有硒的情况下收集所有链接。 提前谢谢。

    以下是我尝试的内容:

     from bs4 import BeautifulSoup as bs
     import requests
     r=requests.get('https://www.iplt20.com/news/match-reports')
     soup = bs(r.text,'lxml')
    
     for match in soup.find_all('div',class_='latest-slider-wrap 
     position-relative'):
          links = match.find('a')
          print(links['href'])
    
    1 回复  |  直到 2 年前
        1
  •  1
  •   Andrej Kesely    2 年前

    尝试

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.iplt20.com/news/match-reports"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    for a in soup.select("#div-match-report a:has(li)"):
        print(a["href"])
    

    打印:

    https://www.iplt20.com/news/4014/tata-ipl-2024-match-11-lsg-vs-pbks-match-report
    https://www.iplt20.com/news/4012/tata-ipl-2024-match-10-rcb-vs-kkr-match-report
    https://www.iplt20.com/news/4011/tata-ipl-2024-match-09-rr-vs-dc-match-report
    https://www.iplt20.com/news/4009/tata-ipl-2024-match-08-srh-vs-mi-match-report
    https://www.iplt20.com/news/4007/tata-ipl-2024-match-07-csk-vs-gt-match-report
    https://www.iplt20.com/news/4006/tata-ipl-2024-match-06-rcb-vs-pbks-match-report
    https://www.iplt20.com/news/4004/tata-ipl-2024-match-05-gt-vs-mi-match-report
    https://www.iplt20.com/news/4003/tata-ipl-2024-match-04-rr-vs-lsg-match-report
    https://www.iplt20.com/news/4001/tata-ipl-2024-match-03-kkr-vs-srh-match-report
    https://www.iplt20.com/news/4000/tata-ipl-2024-match-02-pbks-vs-dc-match-report
    https://www.iplt20.com/news/3999/tata-ipl-2024-match-01-csk-vs-rcb-match-report
    https://www.iplt20.com/news/3976/tata-ipl-2023-final-csk-vs-gt-match-report
    https://www.iplt20.com/news/3974/tata-ipl-2023-qualifier-2-gt-vs-mi-match-report
    https://www.iplt20.com/news/3973/tata-ipl-2023-eliminator-lsg-vs-mi-match-report
    https://www.iplt20.com/news/3972/tata-ipl-2023-qualifier-1-gt-vs-csk-match-report
    https://www.iplt20.com/news/3971/tata-ipl-2023-match-70-rcb-vs-gt-match-report
    https://www.iplt20.com/news/3970/tata-ipl-2023-match-69-mi-vs-srh-match-report
    https://www.iplt20.com/news/3969/tata-ipl-2023-match-68-kkr-vs-lsg-match-report
    https://www.iplt20.com/news/3968/tata-ipl-2023-match-67-dc-vs-csk-match-report
    https://www.iplt20.com/news/3967/tata-ipl-2023-match-66-pbks-vs-rr-match-report
    https://www.iplt20.com/news/3966/tata-ipl-2023-match-65-srh-vs-rcb-match-report
    

    编辑:要获得所有链接,您可以使用它们的Ajax分页API:

    import requests
    
    api_url = "https://www.iplt20.com/add-more-match-report?page={page}&type=match-reports"
    
    for page in range(1, 4):  # <-- adjust number of pages here
        print(f"{page=}")
        data = requests.get(api_url.format(page=page)).json()
        for d in data["newsResponce"]["data"]:
            print(f'https://www.iplt20.com/news/{d["id"]}/{d["titleUrlSegment"]}')
    

    打印:

    
    ...
    
    page=2
    https://www.iplt20.com/news/3964/tata-ipl-2023-match-64-pbks-vs-dc-match-report
    https://www.iplt20.com/news/3963/tata-ipl-2023-match-63-lsg-vs-mi-match-report
    https://www.iplt20.com/news/3962/tata-ipl-2023-match-62-gt-vs-srh-match-report
    https://www.iplt20.com/news/3960/tata-ipl-2023-match-61-csk-vs-kkr-match-report
    https://www.iplt20.com/news/3959/tata-ipl-2023-match-60-rr-vs-rcb-match-report
    https://www.iplt20.com/news/3958/tata-ipl-2023-match-59-dc-vs-pbks-match-report
    https://www.iplt20.com/news/3956/tata-ipl-2023-match-58-srh-vs-lsg-match-report
    https://www.iplt20.com/news/3955/tata-ipl-2023-match-57-mi-vs-gt-match-report
    https://www.iplt20.com/news/3953/tata-ipl-2023-match-56-kkr-vs-rr-match-report
    https://www.iplt20.com/news/3952/tata-ipl-2023-match-55-csk-vs-dc-match-report
    https://www.iplt20.com/news/3951/tata-ipl-2023-match-54-mi-vs-rcb-match-report
    https://www.iplt20.com/news/3947/tata-ipl-2023-match-53-kkr-vs-pbks-match-report
    https://www.iplt20.com/news/3946/tata-ipl-2023-match-52-rr-vs-srh-match-report
    https://www.iplt20.com/news/3945/tata-ipl-2023-match-51-gt-vs-lsg-match-report
    https://www.iplt20.com/news/3944/tata-ipl-2023-match-50-dc-vs-rcb-match-report
    https://www.iplt20.com/news/3943/tata-ipl-2023-match-49-csk-vs-mi-match-report
    https://www.iplt20.com/news/3942/tata-ipl-2023-match-48-rr-vs-gt-match-report
    https://www.iplt20.com/news/3940/tata-ipl-2023-match-47-srh-vs-kkr-match-report
    https://www.iplt20.com/news/3938/tata-ipl-2023-match-46-pbks-vs-mi-match-report
    https://www.iplt20.com/news/3937/tata-ipl-2023-match-45-lsg-vs-csk-match-report
    https://www.iplt20.com/news/3936/tata-ipl-2023-match-44-gt-vs-dc-match-report
    page=3
    https://www.iplt20.com/news/3934/tata-ipl-2023-match-43-lsg-vs-rcb-match-report
    https://www.iplt20.com/news/3932/tata-ipl-2023-match-42-mi-vs-rr-match-report
    https://www.iplt20.com/news/3931/tata-ipl-2023-match-41-csk-vs-pbks-match-report
    https://www.iplt20.com/news/3930/tata-ipl-2023-match-40-dc-vs-srh-match-report
    
    ...