代码之家 › 专栏 › 技术社区 › Toleo

在一页中刮一页有时不会进入第二页

scrapy web-scraping python

0

Toleo · 技术社区 · 7 年前

我用的蜘蛛是:

import scrapy

questions = {}

class SovSpider(scrapy.Spider):
    name = 'StackOverflow'
    allowed_domains = ['stackoverflow.com']
    start_urls = ['https://stackoverflow.com/questions']

    def parse(self, response):
        for link in response.css('a.question-hyperlink::attr(href)').extract():

            yield scrapy.Request(url=response.urljoin(link), callback=self.parse_questions)

            yield scrapy.Request(url=response.urljoin(response.css('a[rel="next"]::attr(href)').extract_first()), callback=self.parse)


    def parse_questions(self, response):
        questions["title"] = response.css('a.question-hyperlink::text').extract_first()
        questions["user"] = response.css('.user-details a::text').extract_first()

        yield scrapy.Request(url=response.urljoin(response.css('.user-details a::attr(href)').extract_first()), callback=self.parse_user)

        yield questions

    def parse_user(self, response):
        questions["user_reputation"] = response.css('div.reputation::text').extract_first().strip()

试图 Practice 抓取一个页面,然后从同一个页面获取一个url来抓取它的页面 [Page1(Scraped) -[Page1[Url-Inside]]> Page2(Scrape)]

蜘蛛要做的是:

刮削 Questions Page 网址
刮削 Question Title 从输入页面 URLs
刮削 User Reputation 从用户页由 Scraped URL 属于 Question

例如,我在这里的问题应该是:

{"title": "Scraping a Page within a Page sometimes doesn't enter the second Page", "user": "Toleo", "user_reputation": 455}

问题是几乎 3/4 只有 parse_question 像这样的部分

{"title": "Scraping a Page within a Page sometimes doesn't enter the second Page", "user": "Toleo"}

有时不是,这里有什么问题?

1 回复 | 直到 7 年前

1

mxmn 7 年前

问题是你拒绝了 parse_user 在你屈服的同时 questions 但是项目和请求是由不同的中间件处理的,因此它们不会一个接一个地执行。

你最好把第一部分问题发给 副用户 通过使用meta和only yield 问题 在里面 副用户

def parse_questions(self, response):
    questions = {}

    questions["title"] = response.css('a.question-hyperlink::text').extract_first()
    questions["user"] = response.css('.user-details a::text').extract_first()

    yield scrapy.Request(url=response.urljoin(response.css('.user-details a::attr(href)').extract_first()),
                         callback=self.parse_user,
                         meta={'questions': questions})

def parse_user(self, response):
    questions = response.meta.get('questions')
    questions["user_reputation"] = response.css('div.reputation::text').extract_first().strip()
    yield questions

你最好创建一个新变量 问题 每次呼叫 parse_questions 就像上面一样,因为它不应该是全局变量。

而且应该改正 parse 这样地

def parse(self, response):
    for link in response.css('a.question-hyperlink::attr(href)').extract():

        yield scrapy.Request(url=response.urljoin(link), callback=self.parse_questions)

    yield scrapy.Request(url=response.urljoin(response.css('a[rel="next"]::attr(href)').extract_first()), callback=self.parse)

因为你对一页上的每一个链接都会产生一个下一页的请求,这并不是什么问题,因为scrapy和duprefilter一样,但是它可能更有效。