代码之家  ›  专栏  ›  技术社区  ›  Toleo

如何让这个蜘蛛为每个项目列表导出一个JSON文件?

  •  0
  • Toleo  · 技术社区  · 7 年前

    在我的以下文件中 Reddit.py ,它有一只蜘蛛:

    import scrapy
    
    class RedditSpider(scrapy.Spider):
        name = 'Reddit'
        allowed_domains = ['reddit.com']
        start_urls = ['https://old.reddit.com']
    
        def parse(self, response):
    
            for link in response.css('li.first a.comments::attr(href)').extract():
                yield scrapy.Request(url=response.urljoin(link), callback=self.parse_topics)
    
    
    
        def parse_topics(self, response):
            topics = {}
            topics["title"] = response.css('a.title::text').extract_first()
            topics["author"] = response.css('p.tagline a.author::text').extract_first()
    
            if response.css('div.score.likes::attr(title)').extract_first() is not None:
                topics["score"] = response.css('div.score.likes::attr(title)').extract_first()
            else:
                topics["score"] = "0"
    
            if int(topics["score"]) > 10000:
                author_url = response.css('p.tagline a.author::attr(href)').extract_first()
                yield scrapy.Request(url=response.urljoin(author_url), callback=self.parse_user, meta={'topics': topics})
            else:
                yield topics
    
        def parse_user(self, response):
            topics = response.meta.get('topics')
    
            users = {}
            users["name"] = topics["author"]
            users["karma"] = response.css('span.karma::text').extract_first()
    
            yield users
            yield topics
    

    它从主页获取所有的URL old.reddit ,然后抓取每个URL的 标题 ,请 作者 分数 .

    我添加的是第二部分,它检查 分数 高于 10000个 如果是的话,蜘蛛就会去 用户 的页面和刮擦他的 因果报应 从它开始。

    我知道我可以刮 因果报应 来自 话题 的页面,但我想这样做,因为 用户 我擦掉的那一页 话题 的页面。

    我要做的是出口 topics 包含以下内容的列表 title, author, score 变成 JSON 文件名为 topics.json ,那么如果 话题 的分数高于 10000个 导出 users 包含以下内容的列表 name, karma 变成 JSON公司 文件名为 users.json .

    我只知道如何使用 command-line 属于

    scrapy runspider Reddit.py -o Reddit.json
    

    将所有列表导出到一个 JSON公司 文件名为 Reddit 但在这样的糟糕结构中

    [
      {"name": "Username", "karma": "00000"},
      {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
      {"name": "Username2", "karma": "00000"},
      {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
      {"name": "Username3", "karma": "00000"},
      {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
      {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
      ....
    ]
    

    我完全不知道 碎屑 Item Pipeline 也没有 Item Exporters &安培; Feed Exporters 关于如何在我的蜘蛛上实现它们,或者如何全面地使用它们,我试图从文档中了解它,但似乎我不知道如何在我的蜘蛛中使用它。


    我想要的最终结果是两个文件:

    主题.json

    [
     {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
     {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
     {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
     {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
     ....
    ]
    

    用户.json

    [
      {"name": "Username", "karma": "00000"},
      {"name": "Username2", "karma": "00000"},
      {"name": "Username3", "karma": "00000"},
      ....
    ]
    

    同时清除列表中的重复项。

    2 回复  |  直到 7 年前
        1
  •  1
  •   Tarun Lalwani    7 年前

    从下面应用方法

    Export scrapy items to different files

    我创建了一个样本刮刀

    import scrapy
    
    
    class ExampleSpider(scrapy.Spider):
        name = 'example'
        allowed_domains = ['example.com']
        start_urls = ['http://example.com/']
    
        def parse(self, response):
            yield {"type": "unknown item"}
            yield {"title": "ExampleTitle1", "author": "Username", "score": "11000"}
            yield {"name": "Username", "karma": "00000"}
            yield {"name": "Username2", "karma": "00000"}
            yield {"someothertype": "unknown item"}
    
            yield {"title": "ExampleTitle2", "author": "Username2", "score": "12000"}
            yield {"title": "ExampleTitle3", "author": "Username3", "score": "13000"}
            yield {"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
            yield {"name": "Username3", "karma": "00000"}
    

    然后在 exporters.py

    from scrapy.exporters import JsonItemExporter
    from scrapy.extensions.feedexport import FileFeedStorage
    
    
    class JsonMultiFileItemExporter(JsonItemExporter):
        types = ["topics", "users"]
    
        def __init__(self, file, **kwargs):
            super().__init__(file, **kwargs)
            self.files = {}
            self.kwargs = kwargs
    
            for itemtype in self.types:
                storage = FileFeedStorage(itemtype + ".json")
                file = storage.open(None)
                self.files[itemtype] = JsonItemExporter(file, **self.kwargs)
    
        def start_exporting(self):
            super().start_exporting()
            for exporters in self.files.values():
                exporters.start_exporting()
    
        def finish_exporting(self):
            super().finish_exporting()
            for exporters in self.files.values():
                exporters.finish_exporting()
                exporters.file.close()
    
        def export_item(self, item):
            if "title" in item:
                itemtype = "topics"
            elif "karma" in item:
                itemtype = "users"
            else:
                itemtype = "self"
    
            if itemtype == "self" or itemtype not in self.files:
                super().export_item(item)
            else:
                self.files[itemtype].export_item(item)
    

    在下面添加到 settings.py

    FEED_EXPORTERS = {
        'json': 'testing.exporters.JsonMultiFileItemExporter',
    }
    

    运行scraper,生成3个文件

    示例.json

    [
    {"type": "unknown item"},
    {"someothertype": "unknown item"}
    ]
    

    主题.json

    [
    {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
    {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
    {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
    {"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
    ]
    

    用户.json

    [
    {"name": "Username", "karma": "00000"},
    {"name": "Username2", "karma": "00000"},
    {"name": "Username3", "karma": "00000"}
    ]
    
        2
  •  0
  •   Apalala    7 年前

    蜘蛛爬过一个用户页面时会产生两个项目。如果:

    def parse_user(self, response):
        topics = response.meta.get('topics')
    
        users = {}
        users["name"] = topics["author"]
        users["karma"] = response.css('span.karma::text').extract_first()
        topics["users"] = users
    
        yield topics
    

    您可以根据需要对JSON进行后期处理。

    顺便说一句,我不明白为什么在处理单个元素(单个“主题”)时使用复数(“主题”)。

    推荐文章