网站使用基于cookie和用户代理的保护。你可以这样检查它。在Chrome中打开开发工具。导航到目标页面
http://www.dwarozh.net/sport/
,然后在“网络”选项卡中,右键单击该页面的请求并“复制为卷曲”
打开控制台并运行cURL:
$ curl 'http://www.dwarozh.net/sport/all-hawal.aspx?cor=3&Nawnishan=%D9%88%DB%95%D8%B1%D8%B2%D8%B4%DB%95%DA%A9%D8%A7%D9%86%DB%8C%20%D8%AF%DB%8C%DA%A9%DB%95' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2' -H 'Upgrade-Insecure-Requests: 1' -H 'X-Compress: null' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://www.dwarozh.net/sport/details.aspx?jimare=10505' -H 'Cookie: __cfduid=dc9867; sucuri_cloudproxy_uuid_ce28bca9c=d36ad9; ASP.NET_SessionId=wqdo0v; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c=6ab0; _gat=1; __asc=7c0b5; __auc=35; _ga=GA1.2.19688' -H 'Connection: keep-alive' --compressed
您将看到正常的html代码。如果从请求中删除用户代理的cookie,则会得到cap页面。
让我们在scrapy中查看它:
$ scrapy shell
>>> from scrapy import Request
>>> cookie_str = '''here; your; cookies; from; browser; go;'''
>>> cookies = dict(pair.split('=') for pair in cookie_str.split('; '))
>>> cookies # check them
{'__auc': '999', '__cfduid': '796', '_gat': '1', '__atuvc': '1%7C49', 'sucuri_cloudproxy_uuid_0d5c97a96': '6ab007eb1
9', 'ASP.NET_SessionId': 'u9', '_ga': 'GA1.2.1968.148', '__asc': 'sfsdf', 'sucuri_cloudproxy_uuid_ce2
sfsdfs': 'sdfsdf'}
>>> r = Request(url='http://www.dwarozh.net/sport/', cookies=cookies, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/56 (KHTML, like Gecko) Chrome
/54. Safari/5'})
>>> fetch(r)
>>> response.xpath('//div[@class="news-more-img"]/ul/li')
[<Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10507">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="de
tails.aspx?jimare=10505">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10504">'>, <Selector xpath='//div[@class="news-more-img"]/
ul/li' data='<li><a href="details.aspx?jimare=10503">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10323">'>]
我修改了你的,因为我没有一些组件的源代码。
from scrapy import Spider, Request
from scrapy.selector import Selector
import scrapy
#from Stack.items import StackItem
#from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser
class StackSpider(Spider):
name = "dwarozh"
start_urls = [
"http://www.dwarozh.net/sport/",
]
_cookie_str = '''__cfduid=dc986; sucuri_cloudproxy_uuid_ce=d36a; ASP.NET_SessionId=wq; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c97a96=6a; _gat=1; __asc=7c0b; __auc=3; _ga=GA1.2.196.14'''
_user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/5 (KHTML, like Gecko) Chrome/54 Safari/5'
def start_requests(self):
cookies = dict(pair.split('=') for pair in self._cookie_str.split('; '))
return [Request(url=url, cookies=cookies, headers={'User-Agent': self._user_agent})
for url in self.start_urls]
def parse(self, response):
mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
for mItem in mItems:
item = {} # StackItem()
item['title'] = mItem.xpath('a/h2/text()').extract_first()
item['url'] = mItem.xpath('viewa/@href').extract_first()
yield {'url': item['url'], 'title': item['title']}
让我们运行它:
$ scrapy crawl dwarozh -o - -t csv --loglevel=DEBUG
/Users/el/Projects/scrap_woman/.env/lib/python3.4/importlib/_bootstrap.py:321: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
return f(*args, **kwds)
2016-12-10 00:18:55 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrap1)
2016-12-10 00:18:55 [scrapy] INFO: Overridden settings: {'SPIDER_MODULES': ['scrap1.spiders'], 'FEED_FORMAT': 'csv', 'BOT_NAME': 'scrap1', 'FEED_URI': 'stdout:', 'NEWSPIDER_MODULE': 'scrap1.spiders', 'ROBOTSTXT_OBEY': True}
2016-12-10 00:18:55 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-10 00:18:55 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-10 00:18:55 [scrapy] INFO: Spider opened
2016-12-10 00:18:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-10 00:18:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-12-10 00:18:55 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/robots.txt> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/sport/> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nÙÛØ³ØªÛ ÛØ§Ø±ÛØ²Ø§ÙØ§ÙÛ Ø±ÛØ§Úµ Ù
ÛØ¯Ø±Ûد Ø¨Û ÛØ§Ø±Û سبÛÛ ÚØ§Ú¯ÛÛÛÙØ±Ø§Ù Ù¾ÛÙØ¬ ÛØ§Ø±ÛØ²Ø§Ù Ø¯ÙØ±Ø®Ø±Ø§ÙÛÙÛ'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nÙÛÙØ§ÚµÛÚ©Û ÙØ§Ø®ÛØ´ Ø¨Û ÙØ§ÙØ¯ÛØ±Ø§ÙÛ Ø±ÛØ§Úµ Ù
ÛØ¯Ø±Ûد'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nگرÙگترÛÙ Ù
Ø§ÙØ´ÛØªÛ Ø¦ÛÙ
Ø±Û ÙÛÛÙÛ Ø±ÛÚÙØ§Ù
ÛکاÙÛ Ø¦ÛØ³Ù¾Ø§ÙÛØ§'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nبÛÙÛØ±Ù
Û ÛÛÙØ§ Ù¾ÛÙÙØ§ØªÛÛ ÙÙ
ÙÙÙÛÛ Ø¬ÛÙÙÛÛ Ø´ÛØ´ÛÙ
Ù Ú©ÛØªØ§ÛÛ ÚØ§Ù
Ù¾ÛÛÙØ³ ÙÛÚ¯Û Ø¨ÚµØ§Ù Ú©Ø±Ø¯ÛÙÛ'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nÙÚÛ ÛØ§Ø±ÛزاÙÛÙ Ø¯ÛØ¨ÛØªÛ ÙÛÛÛ Ø¯Ø±ÙØ³Øª بÙÙÙÛ ØªÛÙ¾ÛÙÛ ØªÛÙÙ
Û'}
2016-12-10 00:18:56 [scrapy] INFO: Closing spider (finished)
2016-12-10 00:18:56 [scrapy] INFO: Stored csv feed (5 items) in: stdout:
2016-12-10 00:18:56 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 950,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 15121,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 12, 9, 21, 18, 56, 271371),
'item_scraped_count': 5,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 12, 9, 21, 18, 55, 869851)}
2016-12-10 00:18:56 [scrapy] INFO: Spider closed (finished)
url,title
,"
ÙÛØ³ØªÛ ÛØ§Ø±ÛØ²Ø§ÙØ§ÙÛ Ø±ÛØ§Úµ Ù
ÛØ¯Ø±Ûد Ø¨Û ÛØ§Ø±Û سبÛÛ ÚØ§Ú¯ÛÛÛÙØ±Ø§Ù Ù¾ÛÙØ¬ ÛØ§Ø±ÛØ²Ø§Ù Ø¯ÙØ±Ø®Ø±Ø§ÙÛÙÛ"
,"
ÙÛÙØ§ÚµÛÚ©Û ÙØ§Ø®ÛØ´ Ø¨Û ÙØ§ÙØ¯ÛØ±Ø§ÙÛ Ø±ÛØ§Úµ Ù
ÛØ¯Ø±Ûد"
,"
گرÙگترÛÙ Ù
Ø§ÙØ´ÛØªÛ Ø¦ÛÙ
Ø±Û ÙÛÛÙÛ Ø±ÛÚÙØ§Ù
ÛکاÙÛ Ø¦ÛØ³Ù¾Ø§ÙÛØ§"
,"
بÛÙÛØ±Ù
Û ÛÛÙØ§ Ù¾ÛÙÙØ§ØªÛÛ ÙÙ
ÙÙÙÛÛ Ø¬ÛÙÙÛÛ Ø´ÛØ´ÛÙ
Ù Ú©ÛØªØ§ÛÛ ÚØ§Ù
Ù¾ÛÛÙØ³ ÙÛÚ¯Û Ø¨ÚµØ§Ù Ú©Ø±Ø¯ÛÙÛ"
,"
ÙÚÛ ÛØ§Ø±ÛزاÙÛÙ Ø¯ÛØ¨ÛØªÛ ÙÛÛÛ Ø¯Ø±ÙØ³Øª بÙÙÙÛ ØªÛÙ¾ÛÙÛ ØªÛÙÙ
Û"
您可能需要不时更新Cookie。您可以使用PhantomJS进行此操作。
更新
:
如何使用PhantomJS获取Cookie。
-
PhantomJS
.
-
像这样制作脚本
dwarosh.js
:
var page = require('webpage').create();
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.dwarozh.net/sport/', function(status) {
console.log("Status: " + status);
if(status === "success") {
page.render('example.png');
page.evaluate(function() {
return document.title;
});
}
for (var i=0; i<page.cookies.length; i++) {
var c = page.cookies[i];
console.log(c.name, c.value);
};
phantom.exit();
});
-
$ phantomjs --cookies-file=cookie.txt dwarosh.js
TypeError: undefined is not an object (evaluating 'activeElement.position().left')
http://www.dwarozh.net/sport/js/script.js:5
https://code.jquery.com/jquery-1.10.2.min.js:4 in c
https://code.jquery.com/jquery-1.10.2.min.js:4 in fireWith
https://code.jquery.com/jquery-1.10.2.min.js:4 in ready
https://code.jquery.com/jquery-1.10.2.min.js:4 in q
Status: success
__auc 250ab0a9158ee9e73eeeac78bba
__asc 250ab0a9158ee9e73eeeac78bba
_gat 1
_ga GA1.2.260482211.1481472111
ASP.NET_SessionId vs1utb1nyblqkxprxgazh0g2
sucuri_cloudproxy_uuid_3e07984e4 26e4ab3...
__cfduid d9059962a4c12e0f....1
-
吃饼干
sucuri_cloudproxy_uuid_3e07984e4
然后试着用
curl
以及相同的用户代理。
$ curl -v http://www.dwarozh.net/sport/ -b sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 -A SpecialAgent
* Trying 104.25.209.23...
* Connected to www.dwarozh.net (104.25.209.23) port 80 (#0)
> GET /sport/ HTTP/1.1
> Host: www.dwarozh.net
> User-Agent: SpecialAgent
> Accept: */*
> Cookie: sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465
>
< HTTP/1.1 200 OK
< Date: Sun, 11 Dec 2016 16:17:04 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: __cfduid=d1646515f5ba28212d4e4ca562e2966311481473024; expires=Mon, 11-Dec-17 16:17:04 GMT; path=/; domain=.dwarozh.net; HttpOnly
< Cache-Control: private
< Vary: Accept-Encoding
< Set-Cookie: ASP.NET_SessionId=srxyurlfpzxaxn1ufr0dvxc2; path=/; HttpOnly
< X-AspNet-Version: 4.0.30319
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
< X-Sucuri-ID: 15008
< Server: cloudflare-nginx
< CF-RAY: 30fa3ea1335237b0-ARN
<
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>
Dwarozh : Sport
</title><meta content="Ø¯ÙØ§ÚÛÚ Ø³Ù¾ÛØ±Øª ÙÛÙØ§ÚµÛ ÙØ§ÙØ®ÛØÙÛÙØ§ÚµÛ جÛÙØ§ÙÛØ ÙÛØ±Ø²Ø´ÛکاÙÛ Ø¯ÛÚ©Û" name="description"/><meta property="fb:app_id" content="1713056075578566"/><meta content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="wene/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="wene/style.css" rel="stylesheet" type="text/css"/>
<script src="js/jquery-2.1.1.js" type="text/javascript"></script>
<script src="https://code.jquery.com/jquery-1.10.2.min.js" type="text/javascript"></script>
<script src="js/script.js" type="text/javascript"></script>
<link href="css/styles.css" rel="stylesheet"/>
<script src="js/classie.js" type="text/javascript"></script>
<script type="text/javascript">