代码之家 › 专栏 › 技术社区 › Maciek

从多个进程中拼凑项目

scrapy python

Maciek · 技术社区 · 6 年前

“全部查找” 页 'xml文件中的节点
解析所有这些页面,收集数据,查找其他页面

潦草的脚本:

class test_spider(XMLFeedSpider):
 name='test'
 start_urls=['https://www.example.com'] 
 custom_settings={
  'ITEM_PIPELINES':{
   'test.test_pipe': 100,
  },
 }
 itertag='pages'  
 def parse1(self,response,node):
  yield Request('https://www.example.com/'+node.xpath('@id').extract_first()+'/xml-out',callback=self.parse2)
 def parse2(self,response):
  yield{'COLLECT1':response.xpath('/@id').extract_first()} 
  for text in string.split(response.xpath(root+'/node[@id="page"]/text()').extract_first() or '','^'):
   if text is not '':
    yield Request(
     'https://www.example.com/'+text,
     callback=self.parse3,
     dont_filter=True
    )
 def parse3(self,response):
  yield{'COLLECT2':response.xpath('/@id').extract_first()} 
class listings_pipe(object):
 def process_item(self,item,spider):
  pprint(item)

{'COLLECT1':'some data','COLLECT2':['some data','some data',…]}

在每个parse1事件之后,是否有一种调用管道的方法?把所有的东西合并起来?

1 回复 | 直到 6 年前

ThunderMind 6 年前

在你的 Parse2 方法,使用 meta collection1 到 parse3 使用 . 然后在 Parse3 获取您的 收藏1 , extract collection2

有关meta的更多信息,请阅读 here

推荐文章

gongarek · Scrapy中附加页的下一页

6 年前

fg42 · 正确安排两个for循环的结果

6 年前

Mrowkacala · 特定网页的刮壳

7 年前

Vacanito · scrapy with::在选择器之前

7 年前

TJ1 · Python Scrapy:在“href”中查找文本

7 年前

nevster · xpath有一个空值,该值会弄乱列表

7 年前

Luis Ramon Ramirez Rodriguez · 使用scrapy代理的ssl握手失败

7 年前

Deba · 无法刮取横幅图像

7 年前

Andre Rumapea · 零碎的工作环境意味着什么?

7 年前

CLPatterson · 在同一Ec2实例上运行Splash server和Scrapy Spider

7 年前