代码之家  ›  专栏  ›  技术社区  ›  Luis Ramon Ramirez Rodriguez

使用scrapy代理的ssl握手失败

  •  1
  • Luis Ramon Ramirez Rodriguez  · 技术社区  · 7 年前

    我正在尝试在一个零碎的项目上设置代理。 我遵循了这个的指示 answer :

    “1-创建一个名为middlewares.py的新文件,并将其保存在scrapy项目中,然后向其中添加以下代码:”

    import base64
    class ProxyMiddleware(object):
        # overwrite process request
        def process_request(self, request, spider):
            # Set the location of the proxy
            request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
    
            # Use the following lines if your proxy requires authentication
            proxy_user_pass = "USERNAME:PASSWORD"
            # setup basic authentication for the proxy
            encoded_user_pass = base64.encodestring(proxy_user_pass)
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
    

    要获取代理,我将使用以下免费订阅: https://proxy.webshare.io/

    提供端口、用户和地址:

    import base64
    class ProxyMiddleware(object):
        # overwrite process request
        def process_request(self, request, spider):
            # Set the location of the proxy
            request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"
    
            # Use the following lines if your proxy requires authentication
            proxy_user_pass = "sarnencj:password"
            # setup basic authentication for the proxy
            encoded_user_pass = base64.encodestring(proxy_user_pass)
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
    

    但当我运行spider时,我得到以下错误:

    2018-04-30 21:44:30 [scrapy] DEBUG: Gave up retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
    

    编辑

    设置上的中间件如下:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
        'moocs.middlewares.ProxyMiddleware': 100,
    }
    

    完整日志

    2018-05-02 12:28:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: moocs)
    2018-05-02 12:28:38 [scrapy] INFO: Optional features available: ssl, http11, boto
    2018-05-02 12:28:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'moocs.spiders', 'SPIDER_MODULES': ['moocs.spiders'], 'BOT_NAME': 'moocs'}
    2018-05-02 12:28:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
    2018-05-02 12:28:39 [boto] DEBUG: Retrieving credentials from metadata server.
    2018-05-02 12:28:39 [boto] ERROR: Caught exception reading instance data
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
        r = opener.open(req, timeout=timeout)
      File "/usr/lib/python2.7/urllib2.py", line 404, in open
        response = self._open(req, data)
      File "/usr/lib/python2.7/urllib2.py", line 422, in _open
        '_open', req)
      File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
        result = func(*args)
      File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
        return self.do_open(httplib.HTTPConnection, req)
      File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
        raise URLError(err)
    URLError: <urlopen error [Errno 101] Network is unreachable>
    2018-05-02 12:28:40 [boto] ERROR: Unable to read instance data, giving up
    2018-05-02 12:28:40 [py.warnings] WARNING: /usr/local/lib/python2.7/dist-packages/scrapy/utils/deprecate.py:155: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` instead
      ScrapyDeprecationWarning)
    
    2018-05-02 12:28:40 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2018-05-02 12:28:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2018-05-02 12:28:40 [scrapy] INFO: Enabled item pipelines: 
    2018-05-02 12:28:40 [scrapy] INFO: Spider opened
    2018-05-02 12:28:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-05-02 12:28:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    2018-05-02 12:28:42 [scrapy] DEBUG: Retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
    2018-05-02 12:28:44 [scrapy] DEBUG: Retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
    2018-05-02 12:28:45 [scrapy] DEBUG: Gave up retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
    2018-05-02 12:28:45 [scrapy] ERROR: Error downloading <GET https://www.coursetalk.com/subjects/data-science/courses>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
    2018-05-02 12:28:45 [scrapy] INFO: Closing spider (finished)
    2018-05-02 12:28:45 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
     'downloader/request_bytes': 909,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2018, 5, 2, 16, 58, 45, 996708),
     'log_count/DEBUG': 5,
     'log_count/ERROR': 3,
     'log_count/INFO': 7,
     'log_count/WARNING': 1,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2018, 5, 2, 16, 58, 40, 255414)}
    2018-05-02 12:28:45 [scrapy] INFO: Spider closed (finished)
    

    编辑

    我尝试在spider类中设置代理:

    import scrapy
    from scrapy import  Request
    from scrapy.loader import ItemLoader
    
    from urlparse import urljoin 
    from moocs.items import MoocsItem,MoocsReviewItem
    
    
    
    class MoocsSpiderSpider(scrapy.Spider):
        name = "moocs_spider"
        #allowed_domains = ["https://www.coursetalk.com/subjects/data-science/courses"]
        start_urls = (
            'https://www.coursetalk.com/subjects/data-science/courses',
        )
    
        custom_settings = {
            'DOWNLOADER_MIDDLEWARES': {
                'moocs.middlewares.ProxyMiddleware': 100
            }
        }
        def parse(self, response):
            #print response.body#xpath()
            courses_xpath = '//*[@class="course-listing-card"]//a[contains(@href, "/courses/")]/@href'
            courses_url = [urljoin(response.url,relative_url)  for relative_url in response.xpath(courses_xpath).extract()]  
            for course_url in courses_url[0:30]:
                print course_url
                yield Request(url=course_url, callback=self.parse_reviews)
    

    在中间产品中。py:

    class ProxyMiddleware(object):
        # overwrite process request
        def process_request(self, request, spider):
            # Set the location of the proxy
            request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"
    

    现在我得到一个不同的错误:

    2018-05-03 18:07:17 [scrapy] ERROR: Error downloading <GET https://www.coursetalk.com/subjects/data-science/courses>: Could not open CONNECT tunnel.
    2018-05-03 18:07:17 [scrapy] INFO: Closing spider (finished)
    2018-05-03 18:07:17 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 1,
     'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
     'downloader/request_bytes': 245,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'finish_reason': 'finished',
    

    编辑2

    我正在使用linux Mint 17。我没有在虚拟环境中安装scrapy。

    从“pip冻结”

    Warning: cannot find svn location for apsw==3.8.2-r1
    BeautifulSoup==3.2.1
    CherryPy==3.2.2
    EasyProcess==0.2.2
    Flask==0.11.1
    GDAL==2.1.0
    GraphLab-Create==1.6.1
    Jinja2==2.8
    Mako==0.9.1
    Markdown==2.4
    MarkupSafe==0.18
    PAM==0.4.2
    Pillow==2.3.0
    PyAudio==0.2.7
    PyInstaller==2.1
    PyVirtualDisplay==0.2
    PyYAML==3.11
    Pygments==2.0.2
    Routes==2.0
    SFrame==2.1
    SQLAlchemy==0.8.4
    Scrapy==1.0.3
    Send2Trash==1.5.0
    Shapely==1.5.17
    Sphinx==1.2.2
    Theano==0.8.2
    Twisted==16.2.0
    Twisted-Core==13.2.0
    Twisted-Names==13.2.0
    Twisted-Web==13.2.0
    Werkzeug==0.11.10
    adblockparser==0.7
    ## FIXME: could not find svn URL in dependency_links for this package:
    apsw==3.8.2-r1
    apt-xapian-index==0.45
    apturl==0.4.1ubuntu4
    argparse==1.2.1
    backports-abc==0.4
    backports.ssl-match-hostname==3.4.0.2
    beautifulsoup4==4.4.1
    bokeh==0.11.1
    boto==2.41.0
    branca==0.1.1
    bz2file==0.98
    captcha-solver==0.1.1
    certifi==2015.9.6.2
    characteristic==14.3.0
    chardet==2.0.1
    click==5.1
    cloudpickle==0.2.1
    colorama==0.2.5
    command-not-found==0.3
    configglue==1.1.2
    cssselect==0.9.1
    cssutils==0.9.10
    cymem==1.31.2
    debtagshw==0.1
    decorator==4.0.2
    defer==1.0.6
    deluge==1.3.6
    dirspec==13.10
    dnspython==1.11.1
    docutils==0.11
    drawnow==0.71.1
    duplicity==0.6.23
    enum34==1.1.6
    feedparser==5.1.3
    folium==0.2.1
    functools32==3.2.3-2
    futures==3.0.5
    gensim==0.13.1
    geocoder==1.8.2
    geolocation-python==0.2.2
    geopandas==0.2.1
    geopy==1.11.0
    gmplot==1.1.1
    googlemaps==2.4.2
    gyp==0.1
    html5lib==0.999
    httplib2==0.8
    ipykernel==4.0.3
    ipython==4.0.0
    ipython-genutils==0.1.0
    ipywidgets==4.0.3
    itsdangerous==0.24
    jsonschema==2.6.0
    jupyter==1.0.0
    jupyter-client==5.2.2
    jupyter-console==4.0.2
    jupyter-core==4.4.0
    jupyterlab==0.31.8
    jupyterlab-launcher==0.10.5
    lockfile==0.8
    lxml==3.3.3
    matplotlib==1.3.1
    mechanize==0.2.5
    mistune==0.7.1
    mpmath==0.19
    murmurhash==0.26.4
    mysql-connector-python==1.1.6
    nbconvert==4.0.0
    nbformat==4.3.0
    netifaces==0.8
    nltk==3.2.1
    nose==1.3.1
    notebook==5.4.0
    numpy==1.14.0
    oauth2==1.9.0.post1
    oauthlib==1.1.2
    oneconf==0.3.7
    opencage==1.1.4
    pandas==0.22.0
    paramiko==1.10.1
    path.py==7.6
    patsy==0.4.1
    pexpect==3.1
    pickleshare==0.5
    piston-mini-client==0.7.5
    plac==0.9.6
    plotly==2.0.6
    preshed==0.46.4
    protobuf==2.5.0
    psutil==5.0.1
    psycopg2==2.4.5
    ptyprocess==0.5
    py==1.4.31
    pyOpenSSL==0.13
    pyasn1==0.1.9
    pyasn1-modules==0.0.8
    pycrypto==2.6.1
    pycups==1.9.66
    pycurl==7.19.3
    pygobject==3.12.0
    pyinotify==0.9.4
    pymongo==3.3.0
    pyparsing==2.0.1
    pyserial==2.7
    pysmbc==1.0.14.1
    pyspatialite==3.0.1
    pysqlite==2.6.3
    pytesseract==0.2.0
    pytest==2.9.2
    python-Levenshtein==0.12.0
    python-apt==0.9.3.5
    python-dateutil==2.6.1
    python-debian==0.1.21-nmu2ubuntu2
    python-libtorrent==0.16.13
    pytz==2017.3
    pyxdg==0.25
    pyzmq==14.7.0
    qt5reactor==0.3
    qtconsole==4.0.1
    queuelib==1.4.2
    ratelim==0.1.6
    reportlab==3.0
    repoze.lru==0.6
    requests==2.10.0
    requests-oauthlib==0.6.2
    roman==2.0.0
    scikit-learn==0.17
    scipy==0.17.1
    scrapy-random-useragent==0.1
    scrapy-splash==0.7.1
    seaborn==0.7.0
    selenium==2.53.6
    semver==2.6.0
    service-identity==14.0.0
    sessioninstaller==0.0.0
    shub==1.3.4
    simpledbf==0.2.6
    simplegeneric==0.8.1
    simplejson==3.3.1
    singledispatch==3.4.0.3
    six==1.11.0
    smart-open==1.3.3
    smartystreets.py==0.2.4
    spacy==0.101.0
    sputnik==0.9.3
    spyder==2.3.9
    statsmodels==0.6.1
    stevedore==0.14.1
    subprocess32==3.2.7
    sympy==1.0
    system-service==0.1.6
    terminado==0.8.1
    tesseract==0.1.3
    textblob==0.11.1
    textrazor==1.2.2
    thinc==5.0.8
    tornado==4.3
    traitlets==4.3.2
    tweepy==3.3.0
    uTidylib==0.2
    urllib3==1.7.1
    utils==0.9.0
    vboxapi==1.0
    vincent==0.4.4
    virtualenv==15.0.2
    virtualenv-clone==0.2.4
    virtualenvwrapper==4.1.1
    w3lib==1.12.0
    wordcloud==1.2.1
    wsgiref==0.1.2
    yelp==1.0.2
    zope.interface==4.0.5
    

    我运行:

    curl -v --proxy "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128" "https://www.coursetalk.com/subjects/data-science/courses" and see if it works or not 
    

    是否工作并加载页面:

    > Host: www.coursetalk.com:443
    > Proxy-Authorization: Basic c2FybmVuY2otdXMtMTprZDk5NzIybDJrN3k=
    > User-Agent: curl/7.35.0
    > Proxy-Connection: Keep-Alive
    > 
    < HTTP/1.1 200 Connection established
    < Date: Fri, 04 May 2018 22:02:00 GMT
    < Age: 0
    < Transfer-Encoding: chunked
    * CONNECT responded chunked
    < Proxy-Connection: keep-alive
    < Server: Webshare
    < 
    * Proxy replied OK to CONNECT request
    * successfully set certificate verify locations:
    *   CAfile: none
      CApath: /etc/ssl/certs
    * SSLv3, TLS handshake, Client hello (1):
    * SSLv3, TLS handshake, Server hello (2):
    * SSLv3, TLS handshake, CERT (11):
    * SSLv3, TLS handshake, Server key exchange (12):
    * SSLv3, TLS handshake, Server finished (14):
    * SSLv3, TLS handshake, Client key exchange (16):
    * SSLv3, TLS change cipher, Client hello (1):
    * SSLv3, TLS handshake, Finished (20):
    

    编辑3

    这是当前日志:

    2018-05-04 19:17:07 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: moocs)
    2018-05-04 19:17:07 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.6 (default, Jun 22 2015, 18:00:18) - [GCC 4.8.2], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Linux-3.13.0-107-generic-i686-with-LinuxMint-17-qiana
    2018-05-04 19:17:07 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'moocs.spiders', 'SPIDER_MODULES': ['moocs.spiders'], 'DOWNLOAD_DELAY': 3, 'BOT_NAME': 'moocs'}
    2018-05-04 19:17:07 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.memusage.MemoryUsage',
     'scrapy.extensions.logstats.LogStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.corestats.CoreStats']
    2018-05-04 19:17:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['moocs.middlewares.ProxyMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2018-05-04 19:17:07 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2018-05-04 19:17:07 [py.warnings] WARNING: /media/luis/DATA/articulos/moocs/scripts/moocs/moocs/pipelines.py:9: ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762
      from scrapy.xlib.pydispatch import dispatcher
    
    2018-05-04 19:17:07 [scrapy.middleware] INFO: Enabled item pipelines:
    ['moocs.pipelines.MultiCSVItemPipeline']
    2018-05-04 19:17:07 [scrapy.core.engine] INFO: Spider opened
    2018-05-04 19:17:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-05-04 19:17:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    ^C2018-05-04 19:17:08 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
    2018-05-04 19:17:08 [scrapy.core.engine] INFO: Closing spider (shutdown)
    
    2 回复  |  直到 7 年前
        1
  •  1
  •   Tarun Lalwani    7 年前

    我认为这个问题可能与你触动 ProxyMiddleware .我更新了你的代码并按如下方式运行

    来自scrapy import Spider

    class Test(Spider):
        name ="proxyapp"
        start_urls = ["https://www.coursetalk.com/subjects/data-science/courses"]
    
    
        custom_settings = {
            'DOWNLOADER_MIDDLEWARES': {
                'jobs.middlewares.ProxyMiddleware': 100
            }
        }
    
        def parse(self, response):
            print(response.text)
    

    以及 middlewares.py

    class ProxyMiddleware(object):
        # overwrite process request
        def process_request(self, request, spider):
            # Set the location of the proxy
            request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"
    

    运行了代码,效果很好

    Working Fine

    我测试的刮擦版如下

    Scrapy==1.5.0
    

    为了百分之百确定代理是否正常工作,我运行了它 ipinfo.io/json

    Proxy Info

    相信我,我不是坐在特拉华州,甚至不是坐在美国

        2
  •  -3
  •   Amit Basuri    7 年前

    启用HttpProxyMiddleware并在请求元中传递代理url。

    蜘蛛

    import scrapy
    
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        def start_requests(self):
            urls = [
                'http://quotes.toscrape.com/page/1/',
                 'http://quotes.toscrape.com/page/2/',
             ]
            for url in urls:
                 request = scrapy.Request(url=url, callback=self.parse)
                 request.meta['proxy'] = "http://username:password@some_proxy_server:port"
                 yield request
    
        def parse(self, response):
             pass
    

    设置

    DOWNLOADER_MIDDLEWARES = {
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 10,
       }