代码之家 › 专栏 › 技术社区 › Andres SK

从HTML标记中删除所有属性

php

Andres SK · 技术社区 · 15 年前

我有这个HTML代码:

<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>

但它应该变成(对于所有可能的HTML标记):

<p>
<strong>hello</strong>
</p>

9 回复 | 直到 7 年前

136

Community CDub 8 年前

改编自 my answer on a similar question

$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';

echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $text);

// <p><strong>hello</strong></p>

regexp分解如下:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $2 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

添加一些引用,并使用替换文本 <$1$2> 它应该在标记名后面去掉任何文本,直到标记结束。 /> 或者只是 > .

请注意 这不一定能奏效所有输入,正如反html+regexp将告诉您的那样。有一些回退,最显著的是 <p style=">"> 将结束 <p>"> 还有其他一些问题…我建议你看看 Zend_Filter_StripTags 作为PHP中更完整的证明标记/属性过滤器

eozzy 8 年前

下面介绍如何使用本机DOM:

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html);                  // load HTML into it
$xpath = new DOMXPath($dom);            // create a new XPath
$nodes = $xpath->query('//*[@style]');  // Find elements with a style attribute
foreach ($nodes as $node) {              // Iterate over found elements
    $node->removeAttribute('style');    // Remove style attribute
}
echo $dom->saveHTML();                  // output cleaned HTML

如果要从所有可能的标记中删除所有可能的属性,请执行以下操作:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {
    $node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();

Yacoby 15 年前

我将避免使用regex,因为HTML不是常规语言,而是使用类似于 Simple HTML DOM

您可以通过使用 attr . 例如:

$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
  ["id"]=>
  string(5) "hello"
}
*/

foreach ( $html->find("div", 0)->attr as &$value ){
    $value = null;
}

print $html
//<div>World</div>

TobiasDeVil 11 年前

$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;

// Result
string 'Hello <b>world</b>. Its beautiful day.'

Greg K 13 年前

对于HTML解析来说,regex太脆弱了。在您的示例中,下面将删除您的属性:

echo preg_replace(
    "|<(\w+)([^>/]+)?|",
    "<$1",
    "<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);

更新

Make to second capture可选,不要从结束标记中删除“/”:

|<(\w+)([^>]+)| 到 |<(\w+)([^>/]+)?|

演示此正则表达式的工作原理:

$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>

Sp4cecat 13 年前

具体来说,安度福想要做的就是:

$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );

也就是说,他想从开头的标签上去掉除标签名以外的任何东西。当然,它不适用于自动关闭标签。

Brandon Orth 12 年前

希望这有帮助。这可能不是最快的方法,特别是对于大的HTML块。如果有人对加快速度有任何建议,请告诉我。

function StringEx($str, $start, $end)
{ 
    $str_low = strtolower($str);
    $pos_start = strpos($str_low, $start);
    $pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
    if($pos_end==0) return false;
    if ( ($pos_start !== false) && ($pos_end !== false) )
    {  
        $pos1 = $pos_start + strlen($start);
        $pos2 = $pos_end - $pos1;
        $RData = substr($str, $pos1, $pos2);
        if($RData=='') { return true; }
        return $RData;
    } 
    return false;
}

$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);

Tizón 12 年前

<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";

// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>

Greg Randall 7 年前

这是一种简单的去除属性的方法。它可以很好地处理格式错误的HTML。

<?php
  $string = '<p style="padding:0px;">
    <strong style="padding:0;margin:0;">hello</strong>
    </p>';

  //get all html elements on a line by themselves
  $string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string); 

  //find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
  $string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);

  echo $string_attribute_free;
?>