代码之家 › 专栏 › 技术社区 › Mohsin ForRealHomie

Return:第一次遇到的结果为scrapy

web-crawler scrapy web-scraping python-3.x python

Mohsin ForRealHomie · 技术社区 · 8 年前

问题陈述:

解析之后,我将每个URL发送到parse_links以从中提取电子邮件地址。

解析后,如果我从该链接中找到电子邮件地址并返回结果,我希望停止迭代。

即

在循环中,假设有2个URL: example.com/contact 和 example.com/about

如果从中找到电子邮件地址: example.com/contact 那幺我不想放弃第二个。但我从所有链接中得到了电子邮件地址。

这是我的密码:

def parse(self, response):
    urls = [
        instance.url for instance in LinkExtractor(
            allow_domains='example.com'
        ).extract_links(response)
    ]

    for url in sorted(urls, reverse=True):
        request = Request(url, callback=self.parse_links)
        yield request

def parse_links(self, response):
    item = EmailScraperItem()
    mailrex = '[\w\.-]+@[\w\.-]+'
    result = response.xpath('//a[@href]').re('%s' % mailrex)
    if result:
        item['emails'] = result  # here how can I send first value and ignore other results
    return item

运行爬虫后,我得到以下输出:

2017-01-30 20:31:27 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/contact/>
{'emails': ['abc@example.com']}  # first result

2017-01-30 20:31:29 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/about/>
{'emails': ['xyz@example.com']}  # second result

我只想要第一个。

1 回复 | 直到 5 年前

mizhgun 8 年前

由于Scrapy的异步特性,您甚至不能确定响应将以与发出响应相同的顺序到达回调。您可以做的是获取URL列表,并将其传递给 meta ,并按如下顺序访问URL:

def parse(self, response):
    urls = [
        instance.url for instance in LinkExtractor(
            allow_domains='example.com'
        ).extract_links(response)
    ]

    try:
       # take url and pass remaining to the callback
       return Request(urls.pop(), callback=self.parse_links, meta={'urls': urls})
    except IndexError:
       pass

def parse_links(self, response):
    item = EmailScraperItem()
    mailrex = '[\w\.-]+@[\w\.-]+'
    result = response.xpath('//a[@href]').re('%s' % mailrex)
    if result:
        item['emails'] = result  # here how can I send first value and ignore other results
        return item
    # if no emails found, request next url from list
    try:
       urls = response.meta['urls']
       return Request(urls.pop(), callback=self.parse_links, meta={'urls': urls})
    except IndexError:
       pass

推荐文章

Google User · Django管理员在`list_display中未显示`creation_date`字段`

4 月前

user29747013 · 如何创建一个新的数据框架,其中包含原始数据框架中列的聚合列?

4 月前

ÎÎÎ½Î· ÎÎ®Î¹Î½Î¿Ï · Python lxml.html语法错误:使用lxml find时XPATH的谓词无效

4 月前

user29715306 · from_users=和chats=电视节目中的差异

4 月前

Redshoe · 当执行numpy.genfromtxt()时,python是否会读取文件的所有行?

4 月前

RASEL MAHMUD · 为什么以及如何在is_even()函数内的IF条件中递归X变量在满足0后递增?[副本]

4 月前

prayner · 更新嵌套字典包含列表中的项

4 月前

Bringo Jr · 我可以在O(n)中解决这个问题吗?

4 月前

Dave · 如何在for循环中修改列表值

4 月前

Shukurullox Komiljonov · 从记录中获得相互和解。使用SQL

4 月前