代码之家  ›  专栏  ›  技术社区  ›  user216441

正则表达式正在捕获整个字符串

  •  3
  • user216441  · 技术社区  · 16 年前

    (public|private +)?function +([a-zA-Z_$][0-9a-zA-Z_$]*) *\\(([0-9a-zA-Z_$, ]*)\\) *{(.*)}
    

    匹配以下字符串:

    public function messenger(text){
    sendMsg(text);
    }
    private function sendMsg(text){
    alert(text);
    }
    

    我想让它捕获这两个函数,但它正在捕获: $1: "" $2:“信使” $3:“文本” $4:“sendMsg(文本);}私有函数sendMsg(text){警报(text);“

    顺便说一下,我正在使用Javascript。

    4 回复  |  直到 16 年前
        1
  •  3
  •   user187291    16 年前

    因为你在另一个帖子中接受了我的(错误)答案,我觉得自己有义务发布一个合适的解决方案。这不会是快速和短暂的,但希望有点帮助。

    下面是如果必须的话,我将如何为类似c语言编写一个基于regexp的解析器。

    <script>
    /* 
    Let's start with this simple utility function. It's a
    kind of stubborn version of String.replace() - it
    checks the string over and over again, until nothing
    more can be replaced
    */
    
    function replaceAll(str, regexp, repl) {
        str = str.toString();
        while(str.match(regexp))
            str = str.replace(regexp, repl);
        return str;
    }
    
    /*
    Next, we need a function that removes specific
    constructs from the text and replaces them with
    special "markers", which are "invisible" for further
    processing. The matches are collected in a buffer so
    that they can be restored later.
    */
    
    function isolate(type, str, regexp, buf) {
        return replaceAll(str, regexp, function($0) {
            buf.push($0);
            return "<<" + type + (buf.length - 1) + ">>";
        });
    } 
    
    /*
    The following restores "isolated" strings from the
    buffer:
    */
    
    function restore(str, buf) {
        return replaceAll(str, /<<[a-z]+(\d+)>>/g, function($0, $1) {
            return buf[parseInt($1)];
        });
    }
    
    /*
    Write down the grammar. Javascript regexps are
    notoriously hard to read (there is no "comment"
    option like in perl), therefore let's use more
    readable format with spacing and substitution
    variables. Note that "$string" and "$block" rules are
    actually "isolate()" markers.
    */
    
    var grammar = {
        $nothing: "",
        $space:  "\\s",
        $access: "public $space+ | private $space+ | $nothing",
        $ident:  "[a-z_]\\w*",
        $args:   "[^()]*",
        $string: "<<string [0-9]+>>",
        $block:  "<<block [0-9]+>>",
        $fun:    "($access) function $space* ($ident) $space* \\( ($args) \\) $space* ($block)"
    }
    
    /*
    This compiles the grammar to pure regexps - one for
    each grammar rule:
    */
    
    function compile(grammar) {
        var re = {};
        for(var p in grammar)
            re[p] = new RegExp(
                replaceAll(grammar[p], /\$\w+/g, 
                        function($0) { return grammar[$0] }).
                replace(/\s+/g, ""), 
            "gi");
        return re;
    }
    
    /*
    Let's put everything together
    */
    
    function findFunctions(code, callback) {
        var buf = [];
    
        // isolate strings
        code = isolate("string", code, /"(\\.|[^\"])*"/g, buf);
    
        // isolate blocks in curly brackets {...}
        code = isolate("block",  code, /{[^{}]*}/g, buf);
    
        // compile our grammar
        var re = compile(grammar);
    
        // and perform an action for each function we can find
        code.replace(re.$fun, function() {
            var p = [];
            for(var i = 1; i < arguments.length; i++)
                p.push(restore(arguments[i], buf));
            return callback.apply(this, p)
        });
    }
    </script>
    

    现在我们准备测试。我们的解析器必须能够处理转义字符串和任意嵌套块。

    <code>
    public function blah(arg1, arg2) {
        if("some string" == "public function") {
            callAnother("{hello}")
            while(something) {
                alert("escaped \" string");
            }
        }
    }
    
    function yetAnother() { alert("blah") }
    </code>
    
    <script>
    window.onload = function() {
        var code = document.getElementsByTagName("code")[0].innerHTML;
        findFunctions(code, function(access, name, args, body) {
            document.write(
                "<br>" + 
                "<br> access= " + access +
                "<br> name= "   + name +
                "<br> args= "   + args +
                "<br> body= "   + body
            )
        });
    }
    </script> 
    
        2
  •  3
  •   outis    16 年前

    * 运算符是贪婪的,消耗尽可能多的字符。尝试 *?

    /((?:(?:public|private)\s+)?)function\s+([a-zA-Z_$][\w$]*)\s*\(([\w$, ]*)\)\s*{(.*?)}/
    

    \w [a-zA-Z0-9_] 但可以用于字符类。请注意,这不会将函数与其中的块匹配,例如:

    function foo() {
        for (p in this) {
          ...
        }
    }
    

    recursion (JS没有),这就是为什么你需要一个合适的解析器。

        3
  •  1
  •   Tim Green    16 年前

    尝试改变

    (.*)
    

    (.*?)
    
        4
  •  1
  •   Chad Birch    16 年前

    更改正则表达式的最后一部分:

    {(.*)}
    

    对此:

    {(.*?)}
    

    这使得它“不贪婪”,所以它不会捕捉到最后 } 在输入中。

    请注意,如果任何函数代码包含 但是你要处理的是嵌套,这从来都不是正则表达式擅长的。