代码之家  ›  专栏  ›  技术社区  ›  raju

分页不工作

  •  1
  • raju  · 技术社区  · 7 年前

    我正在努力学习刮痧。

        # -*- coding: utf-8 -*-
    import scrapy
    
    
    class QuotesSpider(scrapy.Spider):
        name = 'quotes'
        allowed_domains = ['quotes.toscrape.com/']
        start_urls = ['http://quotes.toscrape.com/']
    
        def parse(self, response):
            quotes = response.xpath('//*[@class="quote"]')
    
            for quote in quotes:
                text = quote.xpath(".//*[@class='text']/text()").extract_first()
                author = quote.xpath("//*[@itemprop='author']/text()").extract_first()
                tags = quote.xpath(".//*[@class='tag']/text()").extract();
    
                item = {
                    'author_name':author,
                    'text':text,
                    'tags':tags
                }
                yield item
        next_page_url = response.xpath("//*[@class='next']/a/@href").extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)
    

    但scrapy只是在解析第一页。此代码中有什么错误。我从youtube教程中复制了它。

    请帮忙。

    2 回复  |  直到 7 年前
        1
  •  3
  •   alecxe    7 年前

    只是 除第一个请求外,所有请求都将被筛选为“场外” . 这是因为你有这个额外的 / allowed_domains 值:

    allowed_domains = ['quotes.toscrape.com/']
                        # REMOVE THIS SLASH^
    
        2
  •  0
  •   Jalees Developer    5 年前

    删除或注释掉允许的\u域。(可选)删除分号行#15。 此外,将以下代码缩进parse方法:

    next_page_url = response.xpath("//*[@class='next']/a/@href").extract_first()
    absolute_next_page_url = response.urljoin(next_page_url)
    yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)
    

    因此,它将成为以下代码:

    import scrapy
    
    
    class QuotesSpider(scrapy.Spider):
        name = 'quotes'
        #allowed_domains = ['quotes.toscrape.com/']
        start_urls = ['http://quotes.toscrape.com/']
    
        def parse(self, response):
            quotes = response.xpath('//*[@class="quote"]')
    
            for quote in quotes:
                text = quote.xpath(".//*[@class='text']/text()").extract_first()
                author = quote.xpath("//*[@itemprop='author']/text()").extract_first()
                tags = quote.xpath(".//*[@class='tag']/text()").extract()
    
                item = {
                    'author_name':author,
                    'text':text,
                    'tags':tags
                }
                yield item
    
            next_page_url = response.xpath("//*[@class='next']/a/@href").extract_first()
            absolute_next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)