代码之家  ›  专栏  ›  技术社区  ›  SY9

碎片飞溅提取的文本不足

  •  0
  • SY9  · 技术社区  · 6 年前

    我正试着用剪贴画从ieee期刊摘要中提取文本。
    目标URL是: https://ieeexplore.ieee.org/document/4383371/
    os是windows10,chrome用于指定用于抓取的xpath。

    守则 如下所示。

    # -*- coding: utf-8 -*-
    import scrapy, time, json, requests, urllib.request, datetime
    from scrapy_splash import SplashRequest
    from scrapy.crawler import CrawlerProcess
    from fake_useragent import UserAgent
    
    '''main code begins'''
    class IEEE_proc_Item(scrapy.Item):
        abstract = scrapy.Field()
        URL = scrapy.Field()
    
    class IeeeProcScraperSpider(scrapy.Spider):
        name = 'ieee_proc_scraper'
        allowed_domains = ['ieeexplore.ieee.org']
    
        start_urls = ['https://ieeexplore.ieee.org/document/7067026/',
                      ]
    
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse, args={'wait': 1},
                                errback=lambda failure, item=IEEE_proc_Item(): self._errorback(failure, item))
    
        def parse(self, response):
            item = IEEE_proc_Item()
    
            try:
                item['abstract'] = response.xpath('//p/text()').extract()
            except:
                item['abstract'] = 'None'
    
            try:
                item['URL'] = response.url
            except:
                item['URL'] = 'None'  
            yield item
    
        def _errorback(self, failure, item):
            pass
    
    process = CrawlerProcess({
        'USER_AGENT': UserAgent().chrome,
        'FEED_FORMAT': 'json',
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_EXPORT_FIELDS': ['abstract', 'URL'],
        'FEED_URI': 'Test.json',
        })
    
    process.crawl(IeeeProcScraperSpider)
    process.start()
    
    '''main code ends'''
    

    实际上,在目标url中,抽象文档有一个惟一的html路径,如下所示,但是,我正在尝试提取 <div> 标记是为了简单起见,因为抽象文本显然用 <分区> .

    Actual Xpath: //*[@id="4383371"]/div[2]/div[1]/div/div/div
    

    但是,我得到了一个意外的输出,如下所示。
    产量

    2018-09-15 00:29:37 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
    2018-09-15 00:29:37 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2k  26 Jan 2017), cryptography 2.1.4, Platform Windows-10-10.0.17134-SP0
    2018-09-15 00:29:37 [scrapy.crawler] INFO: Overridden settings: {'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['abstract', 'URL'], 'FEED_FORMAT': 'json', 'FEED_URI': 'Test.json', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
    2018-09-15 00:29:37 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.feedexport.FeedExporter',
     'scrapy.extensions.logstats.LogStats']
    2018-09-15 00:29:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2018-09-15 00:29:37 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2018-09-15 00:29:37 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2018-09-15 00:29:37 [scrapy.core.engine] INFO: Spider opened
    2018-09-15 00:29:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-09-15 00:29:37 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2018-09-15 00:29:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ieeexplore.ieee.org/document/4383371/> (referer: None)
    2018-09-15 00:29:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ieeexplore.ieee.org/document/4383371/>
    
    {'URL': 'https://ieeexplore.ieee.org/document/4383371/',
     'abstract': 'For IEEE to continue sending you helpful information on our '
             'products\r'
             '                and services, please consent to our updated '
             'Privacy Policy.I have read and accepted the.SubscribeA '
             "not-for-profit organization, IEEE is the world's largest "
             'technical professional organization dedicated to advancing '
             'technology for the benefit of humanity.© Copyright 2018 IEEE - '
             'All rights reserved. Use of this web site signifies your '
             'agreement to the terms and conditions.US & Canada:Worldwide:'}
    2018-09-15 00:29:39 [scrapy.core.engine] INFO: Closing spider (finished)
    2018-09-15 00:29:39 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: IEEE_proc_abstract_1998to2018_180914.json
    2018-09-15 00:29:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 309,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 57312,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2018, 9, 14, 15, 29, 39, 597179),
     'item_scraped_count': 1,
     'log_count/DEBUG': 3,
     'log_count/INFO': 8,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
    'scheduler/enqueued/memory': 1,
    'start_time': datetime.datetime(2018, 9, 14, 15, 29, 37, 549966)}
    2018-09-15 00:29:39 [scrapy.core.engine] INFO: Spider closed (finished)
    

    scrapy只从页脚部分提取文本。这个输出看起来像js执行之前的输出,但是splashrequest似乎工作正常。
    所以现在我想知道是什么原因和我如何解决这个网页的问题。

    我会非常高兴并感谢任何关于这个问题的建议、答案和观点。

    0 回复  |  直到 6 年前