代码之家 › 专栏 › 技术社区 › James

如何从文本中删除URL?

regex ruby

2

James · 技术社区 · 16 年前

@突发新闻:台风莫拉克袭击台湾,中国疏散数千人 http://news.bnonews.com/u4z3

我想删除所有超链接,返回纯文本。

@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

3 回复 | 直到 13 年前

1

hobodave 16 年前

foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"
r = foo.gsub(/http:\/\/[\w\.:\/]+/, '')
puts r
# @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

2

1

the Tin Man 13 年前

这是一个古老但很好的问题。以下是一个依赖于Ruby内置URI的答案:

require 'set'
require 'uri'

text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'

schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i

URI.extract(text).each do |url|
  text.gsub!(url, '') if (url[schemes_regex])
end

puts text.squeeze(' ')

我定义了要搜索的文本:

irb(main):004:0* text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'
=> "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"

irb(main):006:0* schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i
=> /^(?:FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO)/i

irb(main):008:0* URI.extract(text).each do |url|
irb(main):009:1*   text.gsub!(url, '') if (url[schemes_regex])
irb(main):010:1> end

这些是URL URI.extract 发现。它错误地报告了 BreakingNews: 因为尾随 : 。我认为它不太复杂,但正常使用就可以了:

=> ["BreakingNews:", "http://news.bnonews.com/u4z3"]

irb(main):012:0* puts text.squeeze(' ')
@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

3

-1

vulcan_hacker 16 年前

它可以以快速和肮脏的方式完成,也可以以复杂的方式完成。我展示了一种复杂的方式:

require 'rubygems'
require 'hpricot' # you may need to install this gem
require 'open-uri'

## first getting the embeded/framed html file's url
start_url = 'http://news.bnonews.com/u4z3'
doc = Hpricot(open(start_url))
news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/) 

## now getting the news text, its in the 3rd <p> tag of the framed html file
doc2 = Hpricot(open(news_html_url.to_s))
news_text = doc2.at('//p[3]').to_plain_text
puts news_text

http://wiki.github.com/why/hpricot/an-hpricot-showcase

http://code.whytheluckystiff.net/doc/hpricot/