代码之家 › 专栏 › 技术社区 › Nick Strupat

获取URL标记的正则表达式是什么?

boost regex c++

Nick Strupat · 技术社区 · 14 年前

假设我有这样的弦:

bunch of other html<a href="http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
bunch of other html<a href="http://domain.com/12345/another_token.zip" more html and stuff
bunch of other html<a href="http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff

匹配的正则表达式是什么 The_Token_I_Want , another_token , YET_ANOTHER_TOKEN ?

7 回复 | 直到 14 年前

Greg Bacon 14 年前

附录B RFC 2396 给出了一个将URI分解为其组件的正则表达式,我们可以根据您的情况对其进行调整。

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?
                                     #######

这片叶子 The_Token_I_Want 在里面 $6 ,这是上面的hashderlined子表达式。(请注意,散列不是模式的一部分。)请实况观看:

#! /usr/bin/perl

$_ = "http://domain.com/133742/The_Token_I_Want.zip";    
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) {
  print "$6\n";
}
else {
  print "no match\n";
}

输出:

$ ./prog.pl
The_Token_I_Want

更新: 我在评论中看到你正在使用 boost::regex 因此,请记住在C++程序中避免反斜杠。

#include <boost/foreach.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>

int main()
{
  boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*"
                     "/([^.]+)"
                   //  ####### I CAN HAZ HASHDERLINE PLZ
                     "[^?#]*)(\\?([^#]*))?(#(.*))?");

  const char * const urls[] = {
    "http://domain.com/133742/The_Token_I_Want.zip",
    "http://domain.com/12345/another_token.zip",
    "http://domain.com/0981723/YET_ANOTHER_TOKEN.zip",
  };

  BOOST_FOREACH(const char *url, urls) {
    std::cout << url << ":\n";

    std::string t;
    boost::cmatch m;
    if (boost::regex_match(url, m, token))
      t = m[6];
    else
      t = "<no match>";

    std::cout << "  - " << m[6] << '\n';
  }

  return 0;
}

输出:

http://domain.com/133742/The_Token_I_Want.zip:
  - The_Token_I_Want
http://domain.com/12345/another_token.zip:
  - another_token
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
  - YET_ANOTHER_TOKEN

Thomas 14 年前

/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/

可能想在[a-z a-z_uux]中添加更多字符+

Sadeq 14 年前

你可以使用:

(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+

( [[:alnum:]._-]+ ) 是匹配模式的组,在您的示例中,其值将为 The_Token_I_Want . 要访问此组,请使用\2或$2,因为 ( http|ftp ) 是第一组 ( [[Alnim:]:[-] + ) 是匹配模式的第二组。

Jet 14 年前

试试这个:

?F?HT)TPS?/{{ 2 }(?)WWW?域[^/]+.([^/]+)([^/]+)/i

或

/\W{3,5}:/{ 2 }(?)W{ 3 }?域[^/]+.([^/]+)([^/]+)/i

Quentin 14 年前

首先,使用HTML解析器并获取一个DOM。然后获取锚元素并循环它们以查找href。不要试图直接从字符串中获取令牌。

然后:

油嘴滑舌的答案是:

/(The_Token_I_Want.zip)/

您可能希望比单个示例更精确一点。

我猜你是在找:

/([^/]+)$/

Shaggy Frog 14 年前

m/The_Token_I_Want/

你必须更具体地说明它是什么类型的令牌。一个数字?一个字符串?重复一遍吗?它有它的形式或模式吗?

Jesse Collins 14 年前

最好使用比regex更聪明的东西。例如,如果您使用的是C,那么可以使用System.uri类来为您解析它。