代码之家  ›  专栏  ›  技术社区  ›  Toleo

如何根据现有的JSON列表防止碎片提取重复

  •  0
  • Toleo  · 技术社区  · 7 年前

    在这个蜘蛛里

    import scrapy
    
    class RedditSpider(scrapy.Spider):
        name = 'Reddit'
        allowed_domains = ['reddit.com']
        start_urls = ['https://old.reddit.com']
    
        def parse(self, response):
    
            for link in response.css('li.first a.comments::attr(href)').extract():
                yield scrapy.Request(url=response.urljoin(link), callback=self.parse_topics)
    
    
    
        def parse_topics(self, response):
            topics = {}
            topics["title"] = response.css('a.title::text').extract_first()
            topics["author"] = response.css('p.tagline a.author::text').extract_first()
    
            if response.css('div.score.likes::attr(title)').extract_first() is not None:
                topics["score"] = response.css('div.score.likes::attr(title)').extract_first()
            else:
                topics["score"] = "0"
    
            if int(topics["score"]) > 10000:
                author_url = response.css('p.tagline a.author::attr(href)').extract_first()
                yield scrapy.Request(url=response.urljoin(author_url), callback=self.parse_user, meta={'topics': topics})
            else:
                yield topics
    
        def parse_user(self, response):
            topics = response.meta.get('topics')
    
            users = {}
            users["name"] = topics["author"]
            users["karma"] = response.css('span.karma::text').extract_first()
    
            yield users
            yield topics
    

    我得到这些结果:

    [
      {"name": "Username", "karma": "00000"},
      {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
      {"name": "Username2", "karma": "00000"},
      {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
      {"name": "Username3", "karma": "00000"},
      {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
      {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
      ....
    ]
    

    但是我每天都运行这个蜘蛛来获取这周的最后一天,所以如果今天是一周的第7天,我会在今天之前得到6天的副本,就像这样

    day1: result_day1
    day2: result_day2, result_day1
    day3: result_day3, result_day2, result_day1
    . . . . . . .
    day7: result_day7, result_day6, result_day5, result_day4, result_day3, result_day2, result_day1
    

    所有数据都存储在 杰森 文件如前所示,我要做的是告诉spider检查提取的结果已经存在于 杰森 文件,如果是,则跳过,如果不是,则添加到文件中,

    有可能使用Scrapy吗?

    例如:

    如果昨天(06.json)的结果是

    [
      {"name": "Username", "karma": "00000"},
      {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
      {"name": "Username2", "karma": "00000"},
      {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
      {"name": "Username3", "karma": "00000"},
      {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
      {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
    ]
    

    今天(07.json)的结果是

    [
      {"name": "Username", "karma": "00000"},
      {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
      {"name": "Username2", "karma": "00000"},
      {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
      {"name": "Username3", "karma": "00000"},
      {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
      {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
      {"title": "ExampleTitle5", "author": "Username5", "score": "16700"}
    ]
    

    今天的列表(07.json)的结果是

    [
      {"title": "ExampleTitle5", "author": "Username5", "score": "16700"}
    ]
    

    过滤后