代码之家  ›  专栏  ›  技术社区  ›  vijaysylvester

有人能定义正则表达式来匹配下面的html代码吗

  •  -2
  • vijaysylvester  · 技术社区  · 14 年前

    我正在做一些网页抓取,我正在寻找一些具有特定类名和标记的div元素。

    这是我的目标,我要把所有的东西都提取出来 sú规格súU盒súU盒4

    有人能用.NET术语提供正则表达式(即,可以直接传递到Regex的构造函数中)来匹配这样的div吗(如下所示)

    <div class=\"s_specs_box s_box_4\"><h3>Display</h3><ul><li><strong><span class='s_tooltip_anchor'>Display:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Display</b> - Phone's main display</p></span></strong><ul>\n<li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Type:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Type</b> - Refers to the type of the display. There are four major display types: Greyscale, Black&White, LCD:STN-color and LCD:TFT-color</p></span></strong><ul><li>Color</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Technology:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Technology</b> - Refers to the type of the color displays. There are five major types: LCD, TFT, TFD, STN and OLED</p></span></strong><ul><li>Super AMOLED</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Size:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Size</b> - Refers to the width and the height of the display</p></span></strong><ul><li><span title='Big display' class=\"s_display_rating s_size_1 s_mr_5\"><span></span></span>480 x 800 pixels</li></ul>\n</li><li class='clear clearfix'><strong>Physical Size:</strong><ul><li>4.00 inches</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Colors:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Colors</b> - Shows the number of colors that the display supports</p></span></strong><ul><li>16 777 216</li></ul>\n</li><li class='clear clearfix'><strong>Touch Screen:</strong><ul>\n<li class='clear clearfix'><strong>Type:</strong><ul><li>Capacitive</li></ul>\n</li>\n</ul></li><li class='clear clearfix'><strong>Multi-touch:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Proximity Sensor:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Light sensor:</strong><ul><li>Yes</li></ul>\n</li>\n</ul></li></ul>\n</div>

    提前谢谢你,

    维杰

    3 回复  |  直到 14 年前
        1
  •  4
  •   SLaks    14 年前

    不能使用正则表达式分析HTML。

    相反,您应该使用 HTML Agility Pack 在C#或 jQuery 在Javascript中。

    例如:

    var html = document.DocumentNode.Descendants("div")
        .First(div => div.GetAttributeValue("class", null) == "s_specs_box s_box_4")
        .InnerHtml;
    
        3
  •  0
  •   Mike Clark    14 年前

    这适用于您提供的示例数据:

    string subject = "<div class=\"s_specs_box s_box_4\"><h3>Display</h3><ul><li><strong><span class='s_tooltip_anchor'>Display:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Display</b> - Phone's main display</p></span></strong><ul>\n<li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Type:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Type</b> - Refers to the type of the display. There are four major display types: Greyscale, Black&White, LCD:STN-color and LCD:TFT-color</p></span></strong><ul><li>Color</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Technology:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Technology</b> - Refers to the type of the color displays. There are five major types: LCD, TFT, TFD, STN and OLED</p></span></strong><ul><li>Super AMOLED</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Size:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Size</b> - Refers to the width and the height of the display</p></span></strong><ul><li><span title='Big display' class=\"s_display_rating s_size_1 s_mr_5\"><span></span></span>480 x 800 pixels</li></ul>\n</li><li class='clear clearfix'><strong>Physical Size:</strong><ul><li>4.00 inches</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Colors:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Colors</b> - Shows the number of colors that the display supports</p></span></strong><ul><li>16 777 216</li></ul>\n</li><li class='clear clearfix'><strong>Touch Screen:</strong><ul>\n<li class='clear clearfix'><strong>Type:</strong><ul><li>Capacitive</li></ul>\n</li>\n</ul></li><li class='clear clearfix'><strong>Multi-touch:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Proximity Sensor:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Light sensor:</strong><ul><li>Yes</li></ul>\n</li>\n</ul></li></ul>\n</div>";
    Match match = Regex.Match(subject,
        @"<div[^>]+class\s*=\s*""s_specs_box s_box_4""[^>]*>(.*?)<\s*/\s*div\s*>",
        RegexOptions.Singleline);
    Console.WriteLine(match.Success);
    string result = match.Groups[1].Value;
    Console.WriteLine(result);
    

    <div> 有一个 <分区>

    免责声明2:不要使用regex解析生产代码中的HTML,也不要使用未知的未来输入。如果你只是用它来批量转换硬盘上的几十个HTML文件,然后手动验证结果,这是可以的。对于新的未知输入,信任它是不好的。

    推荐文章