代码之家  ›  专栏  ›  技术社区  ›  Petris

Web爬虫不能处理多个网页

  •  3
  • Petris  · 技术社区  · 7 年前

    我正试图从一个网页中提取一些关于mtg卡的信息,程序如下,但我反复检索有关给定初始页面(initurl)的信息。爬行器无法继续前进。我已经开始相信我没有使用正确的URL,或者可能在使用ULLIB时限制了我的注意力。这是我几周来一直在努力的准则:

    import re
    from math import ceil
    from urllib.request import urlopen as uReq, Request
    from bs4 import BeautifulSoup as soup
    
    InitUrl = "https://mtgsingles.gr/search?q=dragon"
    NumOfCrawledPages = 0
    URL_Next = ""
    NumOfPages = 4   # depth of pages to be retrieved
    
    query = InitUrl.split("?")[1]
    
    
    for i in range(0, NumOfPages):
        if i == 0:
            Url = InitUrl
        else:
            Url = URL_Next
    
        print(Url)
    
        UClient = uReq(Url)  # downloading the url
        page_html = UClient.read()
        UClient.close()
    
        page_soup = soup(page_html, "html.parser")
    
        cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
    
        for card in cards:
            card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
    
            if len(card.div.contents) > 3:
                cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
            else:
                cardP_T = "Does not exist"
    
            cardType = card.contents[3].text
            print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
    
        try:
            URL_Next = InitUrl + "&page=" + str(i + 2)
    
            print("The next URL is: " + URL_Next + "\n")
        except IndexError:
            print("Crawling process completed! No more infomation to retrieve!")
        else:
            NumOfCrawledPages += 1
            Url = URL_Next
        finally:
            print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
    
    2 回复  |  直到 7 年前
        1
  •  1
  •   jlaur    7 年前

    代码失败的原因之一是,您不使用cookies。网站似乎需要这些来允许分页。

    提取感兴趣数据的一种简洁方法如下:

    import requests
    from bs4 import BeautifulSoup
    
    # the site actually uses this url under the hood for paging - check out Google Dev Tools
    paging_url = "https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon"
    return_list = []
    # the page-scroll will only work when we support cookies
    # so we fetch the page in a session
    session = requests.Session()
    session.get("https://mtgsingles.gr/")
    

    除最后一页外,所有页都有一个“下一页”按钮。所以我们用这些知识循环直到下一个按钮消失。当它确实-意味着到达了最后一页-按钮被替换为一个“li”-标记,类为“next hidden”。这只存在于最后一页

    现在我们可以开始循环了

    page = 1 # set count for start page
    keep_paging = True # use flag to end loop when last page is reached
    while keep_paging:
        print("[*] Extracting data for page {}".format(page))
        r = session.get(paging_url.format(page))
        soup = BeautifulSoup(r.text, "html.parser")
        items = soup.select('.iso-item.item-row-view.clearfix')
        for item in items:
            name = item.find('div', class_='col-md-10').get_text().strip().split('\xa0')[0]
            toughness_element = item.find('div', class_='card-power-toughness')
            try:
                toughness = toughness_element.get_text().strip()
            except:
                toughness = None
            cardtype = item.find('div', class_='cardtype').get_text()
            card_dict = {
                "name": name,
                "toughness": toughness,
                "cardtype": cardtype
            }
            return_list.append(card_dict)
    
        if soup.select('li.next.hidden'): # this element only exists if the last page is reached
            keep_paging = False
            print("[*] Scraper is done. Quitting...")
        else:
            page += 1
    
    # do stuff with your list of dicts - e.g. load it into pandas and save it to a spreadsheet
    

    这将滚动,直到没有更多的网页存在-无论有多少子页将在网站上。

    我在上面的评论中的观点仅仅是,如果在代码中遇到异常,那么pagecount永远不会增加。这可能不是你想做的,这就是为什么我建议你更多地了解整个尝试的行为,除非最终达成交易。

        2
  •  0
  •   Paul Würtz    7 年前

    我也被唬住了,因为请求相同的回答,忽略了页面参数。作为一个肮脏的灵魂,我可以先给你设置 page-size 达到足够高的数量以获得所需的所有项目(此参数出于某种原因工作)

      import re
      from math import ceil
      import requests
      from bs4 import BeautifulSoup as soup
    
      InitUrl = Url = "https://mtgsingles.gr/search"
      NumOfCrawledPages = 0
      URL_Next = ""
      NumOfPages = 2   # depth of pages to be retrieved
    
      query = "dragon"
      cardSet=set()
    
      for i in range(1, NumOfPages):
          page_html = requests.get(InitUrl,params={"page":i,"q":query,"page-size":999})
          print(page_html.url)
          page_soup = soup(page_html.text, "html.parser")
    
          cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
    
          for card in cards:
              card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
    
              if len(card.div.contents) > 3:
                  cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
              else:
                  cardP_T = "Does not exist"
    
              cardType = card.contents[3].text
              cardString=card_name + "\n" + cardP_T + "\n" + cardType + "\n"
              cardSet.add(cardString)
              print(cardString)
          NumOfCrawledPages += 1
          print("Moving to page : " + str(NumOfCrawledPages + 1) + " with " +str(len(cards)) +"(cards)\n")