代码之家  ›  专栏  ›  技术社区  ›  James

如何从文本中删除URL?

  •  2
  • James  · 技术社区  · 16 年前

    @突发新闻:台风莫拉克袭击台湾,中国疏散数千人 http://news.bnonews.com/u4z3

    我想删除所有超链接,返回纯文本。

    @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands
    
    3 回复  |  直到 13 年前
        1
  •  1
  •   hobodave    16 年前
    foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"
    r = foo.gsub(/http:\/\/[\w\.:\/]+/, '')
    puts r
    # @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands 
    
        2
  •  1
  •   the Tin Man    13 年前

    这是一个古老但很好的问题。以下是一个依赖于Ruby内置URI的答案:

    require 'set'
    require 'uri'
    
    text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'
    
    schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i
    
    URI.extract(text).each do |url|
      text.gsub!(url, '') if (url[schemes_regex])
    end
    
    puts text.squeeze(' ')
    

    我定义了要搜索的文本:

    irb(main):004:0* text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'
    => "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"
    

    irb(main):006:0* schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i
    => /^(?:FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO)/i
    

    irb(main):008:0* URI.extract(text).each do |url|
    irb(main):009:1*   text.gsub!(url, '') if (url[schemes_regex])
    irb(main):010:1> end
    

    这些是URL URI.extract 发现。它错误地报告了 BreakingNews: 因为尾随 : 。我认为它不太复杂,但正常使用就可以了:

    => ["BreakingNews:", "http://news.bnonews.com/u4z3"]
    

    irb(main):012:0* puts text.squeeze(' ')
    @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands 
    
        3
  •  -1
  •   vulcan_hacker    16 年前

    它可以以快速和肮脏的方式完成,也可以以复杂的方式完成。我展示了一种复杂的方式:

    require 'rubygems'
    require 'hpricot' # you may need to install this gem
    require 'open-uri'
    
    ## first getting the embeded/framed html file's url
    start_url = 'http://news.bnonews.com/u4z3'
    doc = Hpricot(open(start_url))
    news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/) 
    
    ## now getting the news text, its in the 3rd <p> tag of the framed html file
    doc2 = Hpricot(open(news_html_url.to_s))
    news_text = doc2.at('//p[3]').to_plain_text
    puts news_text
    

    http://wiki.github.com/why/hpricot/an-hpricot-showcase

    http://code.whytheluckystiff.net/doc/hpricot/

    推荐文章