代码之家  ›  专栏  ›  技术社区  ›  Katie H

解析文本并保持原始格式-Ruby/Rails

  •  1
  • Katie H  · 技术社区  · 11 年前

    我正在尝试解析URL并以原始格式返回文本。

    我正在使用这个宝石: https://github.com/cantino/ruby-readability

    以下是我所拥有的:

    require 'rubygems'
    require 'readability'
    require 'open-uri'
    
    source = open(@string).read
    @text = Readability::Document.new(source).content
    

    这只给了我一个带有html标记的文本,用于格式化。

    我尝试过:

    @text = Readability::Document.new(source, tags: []).content
    

    这只是剥离了html标记的文本。这篇课文都是拼凑在一起的。我试图抓取文本,并保持换行符和空格没有任何html标记。然后我尝试使用文本在一些附加算法下进行处理。我没有在任何视图中显示文本。如果要在视图中显示文本,我只需要调用simple_format助手。

    例如,对于此URL: http://blogs.discovermagazine.com/d-brief/2014/08/01/physicist-invents-color-changing-ice-cream/

    我希望以未更改的格式保存文本:

    @text = %Q[Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it.
    
    The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe.
    
    What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending. 
    
    “As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said.
    
    Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.]
    
    2 回复  |  直到 11 年前
        1
  •  1
  •   the Tin Man    11 年前

    最基本的是,您可以尝试:

    require 'nokogiri'
    @text = Readability::Document.new(
      Nokogiri::HTML(open('url_to_content')).text
    ).content
    

    Nokogiri 是Ruby中解析XML和HTML的实际标准。

    Nokogiri::HTML(open('url_to_content')) 是99.99%的网页解析方法的基础。 text 返回文档中的文本节点。

    也就是说,你必须深入页面,只提取包含你的文本的部分 真正地 因为页面本身具有指向其他页面、广告等的链接,这些链接都可以通过使用 文本 针对根节点。

    它看起来像一个CSS选择器 'div.entry p' 会让你接近:

    doc.search('div.entry p').text 
    

    退货:

    doc.search('div.entry p').text
    "The color-changing flavor, Xamaleon. (Credit: Manual Linares, Cocinatis)Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it.The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe.What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending. \n“As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said.Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.Interesting, but I hope he’s tested thoroughly to make sure it’s safe for consumption in the long term. It would be pretty difficult to do safety testing if you’re not revealing the recipe. And the last part, the ‘African and Peruvian herbs’ sounds so snake oil salesmanly that I’m almost calling BS on the whole article.Thanks for your expert opinion!I definitely see your point. How dare he quit his job, venture into the booming food intusdry, and make something totally amazing only to keep the recipe a secret so some asshat can’t just steal his idea. What a jerk.Yes yes the physicist is coming off all snakey oily and salesmanly. The guy is selling ice cream.There are plenty of naturally occurring aphrodisiacs. You shouldn’t judge the whole article just because you didn’t know, or don’t agree.I kinda go in the same direction with the safety testing if any synthetic ingredients are included, and transparency is not. But if the recipe is all natural with only food ingredients found in other foods, (including the accelerator elixir..) such as red cabbage, fructose, etc to mimic Ph paper, it’s be a different story….Looks like red cabbage PH indicator at work. I don’t think Mr. Linares’ physics degree came into play in “inventing” this.“Love elixir” had better not be what it sounds like…mmmm love elixer,.."
    

    打印输出看起来稍微好一点,并显示文本中嵌入了一些行尾:

    puts doc.search('div.entry p').text
    The color-changing flavor, Xamaleon. (Credit: Manual Linares, Cocinatis)Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it.The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe.What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending.
    “As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said.Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.Interesting, but I hope he’s tested thoroughly to make sure it’s safe for consumption in the long term. It would be pretty difficult to do safety testing if you’re not revealing the recipe. And the last part, the ‘African and Peruvian herbs’ sounds so snake oil salesmanly that I’m almost calling BS on the whole article.Thanks for your expert opinion!I definitely see your point. How dare he quit his job, venture into the booming food intusdry, and make something totally amazing only to keep the recipe a secret so some asshat can’t just steal his idea. What a jerk.Yes yes the physicist is coming off all snakey oily and salesmanly. The guy is selling ice cream.There are plenty of naturally occurring aphrodisiacs. You shouldn’t judge the whole article just because you didn’t know, or don’t agree.I kinda go in the same direction with the safety testing if any synthetic ingredients are included, and transparency is not. But if the recipe is all natural with only food ingredients found in other foods, (including the accelerator elixir..) such as red cabbage, fructose, etc to mimic Ph paper, it’s be a different story….Looks like red cabbage PH indicator at work. I don’t think Mr. Linares’ physics degree came into play in “inventing” this.“Love elixir” had better not be what it sounds like…mmmm love elixer,..
    

    如果您想更好地了解所显示的文本 <p> 标签尾部空白行的显示也必须适应,这在对代码稍作调整后很容易实现:

    [9] (pry) main: 0> puts doc.search('div.entry p').map(&:text)
    The color-changing flavor, Xamaleon. (Credit: Manual Linares, Cocinatis)
    Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it.
    The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe.
    What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending.
    “As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said.
    Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.
    Interesting, but I hope he’s tested thoroughly to make sure it’s safe for consumption in the long term. It would be pretty difficult to do safety testing if you’re not revealing the recipe. And the last part, the ‘African and Peruvian herbs’ sounds so snake oil salesmanly that I’m almost calling BS on the whole article.
    Thanks for your expert opinion!
    I definitely see your point. How dare he quit his job, venture into the booming food intusdry, and make something totally amazing only to keep the recipe a secret so some asshat can’t just steal his idea. What a jerk.
    Yes yes the physicist is coming off all snakey oily and salesmanly. The guy is selling ice cream.
    There are plenty of naturally occurring aphrodisiacs. You shouldn’t judge the whole article just because you didn’t know, or don’t agree.
    I kinda go in the same direction with the safety testing if any synthetic ingredients are included, and transparency is not.
    But if the recipe is all natural with only food ingredients found in other foods, (including the accelerator elixir..) such as red cabbage, fructose, etc to mimic Ph paper, it’s be a different story….
    Looks like red cabbage PH indicator at work. I don’t think Mr. Linares’ physics degree came into play in “inventing” this.
    “Love elixir” had better not be what it sounds like…
    mmmm love elixer,..
    

    正在发生的是:

    • doc.search('div.entry p') 返回一个NodeSet,它类似于数组,包含 <p> 节点。 search 是Nokogiri提供的查找文档中所有匹配节点的几种类似方法之一。
    • map(&:text) 遍历NodeSet,并为每个元素返回文本,有效地返回每个段落。
        2
  •  1
  •   Mark Silverberg    11 年前

    这是一个很长很粗糙的方法 ruby-readability 给你,让它更接近你想要的。你需要测试一下,看看它是否适用于你想刮的其他文章。

    Readability::Document.new(source, :blacklist => ".wp-caption-text", :tags => ["div","p"]).content.gsub("\n","").gsub("\r","").gsub("\t","").gsub("&#13;","").gsub("<div>","").gsub("</div>","").strip

    输出:

    => "<p>Believe it or not, it is possible to make ice cream even better. Manuel Linares, a former physicist turned cook, has invented a variant on the classic treat that changes colors as you lick it.</p><p>The new creamy concoction called Xamaleón — an homage to “chameleon” — transitions from periwinkle to pink when it touches the tongue, and tastes similar to “tutti-frutti,” Phys.org reports. The ice cream’s colorful trick relies on both changes in temperature and reactions to acids in the human mouth. However, Linares isn’t revealing any more details about his secret recipe.</p><p>What we do know is the ice cream is made with natural ingredients like strawberries, banana, vanilla and almonds. Additionally, Linares sprays what he calls a “love elixir” on the ice cream after it’s scooped to help accelerate the reaction. We probably won’t know the whole story behind Xamaleón until Linares secures a patent for his creation, which is pending. </p><p>“As a physicist I know that there are various possibilities that might work and I was delighted when I managed to crack it and create an ice cream that changes color,” Linares said.</p><p>Earlier this year Linares opened an ice cream shop in Blanes, his hometown in Spain, and has plans for more exotic ice cream flavors in the future. Up next, he says: An ice cream made with Peruvian and African medicinal plants that will provide an aphrodisiac effect.</p>

    我离开了 <p> 标记,以便您可以添加换行符或使用它们做任何事情。