代码之家  ›  专栏  ›  技术社区  ›  farhan jatt

使用css选择器scrapy找不到类

  •  0
  • farhan jatt  · 技术社区  · 3 年前

    我正在测试是否可以使用scrapy刮取网站。我从网站上得到回复,但我可以访问我想要的元素或数据。我的选择器是正确的,虽然我是scrapy的初学者,但我认为命令中没有错误。 我想在课堂上获得标签 结果比赛名称 我把它穿过了脏兮兮的外壳 在shell中,我使用了以下命令

    In [1]: fetch('https://greyhoundbet.racingpost.com/#results-list/r_date=2021-01-01/')
    
    2022-01-07 15:08:58 [scrapy.core.engine] INFO: Spider opened
    2022-01-07 15:09:01 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://greyhoundbet.racingpost.com/robots.txt> (referer: None)
    2022-01-07 15:09:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://greyhoundbet.racingpost.com/#results-list/r_date=2021-01-01/> (referer: None)
    
    In [2]: view(response)
    Out[2]: True
    
    In [3]: response.css('.results-race-name').extract()
    Out[3]: []
    

    笔记 视图(响应)给出了直到加载标志的输出

    0 回复  |  直到 3 年前
        1
  •  1
  •   SuperUser    3 年前

    这不是css问题。数据是动态创建的。您可以从json文件中获取它(在浏览器中打开devtools,单击网络选项卡,查看json请求并获取所需内容)。

    In [1]: req = scrapy.Request('https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cm
       ...: eetings')
    
    In [2]: fetch(req)
    [scrapy.core.engine] INFO: Spider opened
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cmeetings> (referer: None)
    
    In [3]: json_data = response.json()
    
    In [4]: for data in json_data['meetings']['tracks']['1']['races']:
       ...:     print(data['track'])
       ...:
    Newcastle
    Swindon
    Kinsley
    
    In [5]: for data in json_data['meetings']['tracks']['2']['races']:
       ...:     print(data['track'])
       ...:
    Monmore
    Crayford
    Hove
    Harlow
    Henlow
    

    编辑:

    蜘蛛py

    import scrapy
    
    
    class ExampleSpider(scrapy.Spider):
        name = "exampleSpider"
        start_urls = ['https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cmeetings']
    
        def parse(self, response):
            json_data = response.json()
    
            for data in json_data['meetings']['tracks']['1']['races']:
                yield {'race': data['track']}
    
            for data in json_data['meetings']['tracks']['2']['races']:
                yield {'race': data['track']}
    

    Example for spider

    主要的py:

    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    if __name__ == "__main__":
        spider = 'exampleSpider'
        settings = get_project_settings()
        settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
        process = CrawlerProcess(settings)
        process.crawl(spider)
        process.start()
    

    How to run scrapy from a script

    推荐文章