代码之家  ›  专栏  ›  技术社区  ›  Yuseferi

如何使用sucuri保护抓取网站

  •  1
  • Yuseferi  · 技术社区  · 8 年前

    Scrapy Documetions 我想从多个网站抓取数据,我的代码可以正常使用普通网站,但当我想用 苏库里 我没有得到任何数据,似乎sucuri防火墙阻止我访问网站标记。

    目标网站是 http://www.dwarozh.net/ 和 这是我的蜘蛛片段

    from scrapy import Spider
    from scrapy.selector import Selector
    import scrapy
    
    from Stack.items import StackItem
    from bs4 import BeautifulSoup
    from scrapy import log
    from scrapy.utils.response import open_in_browser
    
    
        class StackSpider(Spider):
            name = "stack"
            start_urls = [
                "http://www.dwarozh.net/sport/",
            ]
    
    
            def parse(self, response):
                mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
                for mItem in mItems:
                    item = StackItem()
                    item['title'] = mItem.xpath(
                        'a/h2/text()').extract_first()
                    item['url'] = mItem.xpath(
                        'viewa/@href').extract_first()
                    yield item
    

    这是我得到的回应

    <html><title>You are being redirected...</title>
    <noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
    <script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='cz0iMHNlYyIuc3Vic3RyKDAsMSkgKyAnNXlCMicuc3Vic3RyKDMsIDEpICsgJycgKycnKyIxIi5zbGljZSgwLDEpICsgJ2pQYycuY2hhckF0KDIpKyJmIiArICIiICsnbz1jJy5jaGFyQXQoMikrICcnICsgCiI0Ii5zbGljZSgwLDEpICsgJ0FvPzcnLnN1YnN0cigzLCAxKSArIjUiICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDIpICsgIiIgKycxJyArICAgJycgKyAKIjFzZWMiLnN1YnN0cigwLDEpICsgICcnICsnJysnMycgKyAgImUiLnNsaWNlKDAsMSkgKyAiIiArImZzdSIuc2xpY2UoMCwxKSArICIiICsiMnN1Y3VyIi5jaGFyQXQoMCkrICcnICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAgJycgKyI5c3UiLnNsaWNlKDAsMSkgKyAgJycgKycnKyI2IiArICdDYycuc2xpY2UoMSwyKSsiNnN1Ii5zbGljZSgwLDEpICsgJ2YnICsgICAnJyArIAonYScgKyAgIjAiICsgJ2YnICsgICI0IiArICI2c2VjIi5zdWJzdHIoMCwxKSArICAnJyArIAonWnBFMScuc3Vic3RyKDMsIDEpICsiMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzgpICsgIiIgKyI1c3VjdXIiLmNoYXJBdCgwKSsiZnN1Ii5zbGljZSgwLDEpICsgJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjJy5jaGFyQXQoMCkrICd1JysnJysnYycuY2hhckF0KDApKyd1c3VjdXInLmNoYXJBdCgwKSsgJ3JzdWMnLmNoYXJBdCgwKSsgJ3N1Y3VyaScuY2hhckF0KDUpICsgJ19zdScuY2hhckF0KDApICsnY3N1Y3VyJy5jaGFyQXQoMCkrICdsJysnbycrJ3UnLmNoYXJBdCgwKSsnZCcrJ3AnKycnKydyc3VjdScuY2hhckF0KDApICArJ3NvJy5jaGFyQXQoMSkrJ3gnKyd5JysnX3N1Y3VyaScuY2hhckF0KDApICsgJ3UnKyd1JysnaXN1Y3VyaScuY2hhckF0KDApICsgJ3N1Y3VkJy5jaGFyQXQoNCkrICdzXycuY2hhckF0KDEpKycxJysnOCcrJzEnKydzdWN1cmQnLmNoYXJBdCg1KSArICdlJy5jaGFyQXQoMCkrJzEnKydzdWN1cjEnLmNoYXJBdCg1KSArICcxc3VjdXJpJy5jaGFyQXQoMCkgKyAnMicrIj0iICsgcyArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>
    

    2 回复  |  直到 8 年前
        1
  •  4
  •   Eugene Lisitsky    8 年前

    网站使用基于cookie和用户代理的保护。你可以这样检查它。在Chrome中打开开发工具。导航到目标页面 http://www.dwarozh.net/sport/ ,然后在“网络”选项卡中,右键单击该页面的请求并“复制为卷曲” 打开控制台并运行cURL:

    $ curl 'http://www.dwarozh.net/sport/all-hawal.aspx?cor=3&Nawnishan=%D9%88%DB%95%D8%B1%D8%B2%D8%B4%DB%95%DA%A9%D8%A7%D9%86%DB%8C%20%D8%AF%DB%8C%DA%A9%DB%95' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2' -H 'Upgrade-Insecure-Requests: 1' -H 'X-Compress: null' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://www.dwarozh.net/sport/details.aspx?jimare=10505' -H 'Cookie: __cfduid=dc9867; sucuri_cloudproxy_uuid_ce28bca9c=d36ad9; ASP.NET_SessionId=wqdo0v; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c=6ab0; _gat=1; __asc=7c0b5; __auc=35; _ga=GA1.2.19688' -H 'Connection: keep-alive' --compressed
    

    您将看到正常的html代码。如果从请求中删除用户代理的cookie,则会得到cap页面。

    让我们在scrapy中查看它:

    $ scrapy shell
    >>> from scrapy import Request
    >>> cookie_str = '''here; your; cookies; from; browser; go;'''
    >>> cookies = dict(pair.split('=') for pair in cookie_str.split('; '))
    >>> cookies  # check them
    {'__auc': '999', '__cfduid': '796', '_gat': '1', '__atuvc': '1%7C49', 'sucuri_cloudproxy_uuid_0d5c97a96': '6ab007eb1
    9', 'ASP.NET_SessionId': 'u9', '_ga': 'GA1.2.1968.148', '__asc': 'sfsdf', 'sucuri_cloudproxy_uuid_ce2
    sfsdfs': 'sdfsdf'}
    >>> r = Request(url='http://www.dwarozh.net/sport/', cookies=cookies, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/56 (KHTML, like Gecko) Chrome
    /54. Safari/5'})
    >>> fetch(r)
    >>> response.xpath('//div[@class="news-more-img"]/ul/li')
    [<Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10507">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="de
    tails.aspx?jimare=10505">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10504">'>, <Selector xpath='//div[@class="news-more-img"]/
    ul/li' data='<li><a href="details.aspx?jimare=10503">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10323">'>]
    

    我修改了你的,因为我没有一些组件的源代码。

    from scrapy import Spider, Request
    from scrapy.selector import Selector
    import scrapy
    
    #from Stack.items import StackItem
    #from bs4 import BeautifulSoup
    from scrapy import log
    from scrapy.utils.response import open_in_browser
    
    
    class StackSpider(Spider):
            name = "dwarozh"
            start_urls = [
                "http://www.dwarozh.net/sport/",
            ]
            _cookie_str = '''__cfduid=dc986; sucuri_cloudproxy_uuid_ce=d36a; ASP.NET_SessionId=wq; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c97a96=6a; _gat=1; __asc=7c0b; __auc=3; _ga=GA1.2.196.14'''
            _user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/5 (KHTML, like Gecko) Chrome/54 Safari/5'
    
            def start_requests(self):
                cookies = dict(pair.split('=') for pair in self._cookie_str.split('; '))
                return [Request(url=url, cookies=cookies, headers={'User-Agent': self._user_agent})
                        for url in self.start_urls]
    
            def parse(self, response):
                mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
                for mItem in mItems:
                    item = {} # StackItem()
                    item['title'] = mItem.xpath('a/h2/text()').extract_first()
                    item['url'] = mItem.xpath('viewa/@href').extract_first()
                    yield {'url': item['url'], 'title': item['title']}
    

    让我们运行它:

    $ scrapy crawl dwarozh -o - -t csv --loglevel=DEBUG
    /Users/el/Projects/scrap_woman/.env/lib/python3.4/importlib/_bootstrap.py:321: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
      return f(*args, **kwds)
    2016-12-10 00:18:55 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrap1)
    2016-12-10 00:18:55 [scrapy] INFO: Overridden settings: {'SPIDER_MODULES': ['scrap1.spiders'], 'FEED_FORMAT': 'csv', 'BOT_NAME': 'scrap1', 'FEED_URI': 'stdout:', 'NEWSPIDER_MODULE': 'scrap1.spiders', 'ROBOTSTXT_OBEY': True}
    2016-12-10 00:18:55 [scrapy] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.feedexport.FeedExporter',
     'scrapy.extensions.logstats.LogStats']
    2016-12-10 00:18:55 [scrapy] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2016-12-10 00:18:55 [scrapy] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2016-12-10 00:18:55 [scrapy] INFO: Enabled item pipelines:
    []
    2016-12-10 00:18:55 [scrapy] INFO: Spider opened
    2016-12-10 00:18:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-12-10 00:18:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
    2016-12-10 00:18:55 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/robots.txt> (referer: None)
    2016-12-10 00:18:56 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/sport/> (referer: None)
    2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
    {'url': None, 'title': '\nلیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە'}
    2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
    {'url': None, 'title': '\nهەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید'}
    2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
    {'url': None, 'title': '\nگرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا'}
    2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
    {'url': None, 'title': '\nبەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە'}
    2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
    {'url': None, 'title': '\nكچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە'}
    2016-12-10 00:18:56 [scrapy] INFO: Closing spider (finished)
    2016-12-10 00:18:56 [scrapy] INFO: Stored csv feed (5 items) in: stdout:
    2016-12-10 00:18:56 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 950,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 15121,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2016, 12, 9, 21, 18, 56, 271371),
     'item_scraped_count': 5,
     'log_count/DEBUG': 8,
     'log_count/INFO': 8,
     'response_received_count': 2,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2016, 12, 9, 21, 18, 55, 869851)}
    2016-12-10 00:18:56 [scrapy] INFO: Spider closed (finished)
    url,title
    ,"
    لیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە"
    ,"
    هەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید"
    ,"
    گرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا"
    ,"
    بەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە"
    ,"
    كچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە"
    

    您可能需要不时更新Cookie。您可以使用PhantomJS进行此操作。

    更新 :

    如何使用PhantomJS获取Cookie。

    1. PhantomJS .

    2. 像这样制作脚本 dwarosh.js :

      var page = require('webpage').create();
      page.settings.userAgent = 'SpecialAgent';
      page.open('http://www.dwarozh.net/sport/', function(status) {
        console.log("Status: " + status);
        if(status === "success") {
          page.render('example.png');
          page.evaluate(function() {
          return document.title;
        });
        }
        for (var i=0; i<page.cookies.length; i++) {
          var c = page.cookies[i];
          console.log(c.name, c.value);
        };
        phantom.exit();
      });
      
    3.   $ phantomjs --cookies-file=cookie.txt dwarosh.js
        TypeError: undefined is not an object (evaluating  'activeElement.position().left')
      
        http://www.dwarozh.net/sport/js/script.js:5
        https://code.jquery.com/jquery-1.10.2.min.js:4 in c
        https://code.jquery.com/jquery-1.10.2.min.js:4 in fireWith
        https://code.jquery.com/jquery-1.10.2.min.js:4 in ready
        https://code.jquery.com/jquery-1.10.2.min.js:4 in q
      Status: success
      __auc 250ab0a9158ee9e73eeeac78bba
      __asc 250ab0a9158ee9e73eeeac78bba
      _gat 1
      _ga GA1.2.260482211.1481472111
      ASP.NET_SessionId vs1utb1nyblqkxprxgazh0g2
      sucuri_cloudproxy_uuid_3e07984e4 26e4ab3...
      __cfduid d9059962a4c12e0f....1
      
    4. 吃饼干 sucuri_cloudproxy_uuid_3e07984e4 然后试着用 curl 以及相同的用户代理。

      $ curl -v http://www.dwarozh.net/sport/ -b sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 -A SpecialAgent
      *   Trying 104.25.209.23...
      * Connected to www.dwarozh.net (104.25.209.23) port 80 (#0)
      > GET /sport/ HTTP/1.1
      > Host: www.dwarozh.net
      > User-Agent: SpecialAgent
      > Accept: */*
      > Cookie:     sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465
      >
      < HTTP/1.1 200 OK
      < Date: Sun, 11 Dec 2016 16:17:04 GMT
      < Content-Type: text/html; charset=utf-8
      < Transfer-Encoding: chunked
      < Connection: keep-alive
      < Set-Cookie: __cfduid=d1646515f5ba28212d4e4ca562e2966311481473024; expires=Mon, 11-Dec-17 16:17:04 GMT; path=/; domain=.dwarozh.net; HttpOnly
      < Cache-Control: private
      < Vary: Accept-Encoding
      < Set-Cookie: ASP.NET_SessionId=srxyurlfpzxaxn1ufr0dvxc2; path=/; HttpOnly
      < X-AspNet-Version: 4.0.30319
      < X-XSS-Protection: 1; mode=block
      < X-Frame-Options: SAMEORIGIN
      < X-Content-Type-Options: nosniff
      < X-Sucuri-ID: 15008
      < Server: cloudflare-nginx
      < CF-RAY: 30fa3ea1335237b0-ARN
      <
      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
      <html xmlns="http://www.w3.org/1999/xhtml">
      <head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>
      Dwarozh : Sport
      </title><meta content="دواڕۆژ سپۆرت هەواڵی ناوخۆ،هەواڵی جیهانی، وەرزشەکانی دیکە" name="description"/><meta property="fb:app_id" content="1713056075578566"/><meta content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="wene/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="wene/style.css" rel="stylesheet" type="text/css"/>
      <script src="js/jquery-2.1.1.js" type="text/javascript"></script>
      <script src="https://code.jquery.com/jquery-1.10.2.min.js" type="text/javascript"></script>
      <script src="js/script.js" type="text/javascript"></script>
      <link href="css/styles.css" rel="stylesheet"/>
      <script src="js/classie.js" type="text/javascript"></script>
      <script type="text/javascript">
      
        2
  •  3
  •   Mekanik    8 年前

    解析动态内容的一般解决方案是首先使用能够运行Javascript的东西(例如 http://phantomjs.org/ )然后保存html并将其提供给解析器。

    这也有助于绕过一些基于js的保护器。

    phantomjs 是一个单独的可执行文件,它将加载一个uri作为一个真实的浏览器,并评估所有JS。 您可以通过以下方式从Python运行它 subprocess.call([phantomJsPath, jsProgramPath, url, htmlFileToSave])

    对于jsProgram示例,您可以检查 https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js

    要从js程序保存html,请使用 fs.write(htmlFileToSave, page.content, "w");

    我在德沃兹身上测试过这个方法。net和它的工作,但您应该知道如何将其插入 scrapy 管道

    特别是对于您的示例,您可以尝试“手动”解析提供的javascript,以获取加载实际页面所需的cookie详细信息。尽管Sucuri算法可能随时改变,任何基于cookie或js解码的解决方案都会被破坏。