代码之家 › 专栏 › 技术社区 › user3142695

用不同的分隔符将两个不同格式的字符串拆分为多个部分

regex javascript

user3142695 · 技术社区 · 7 年前

有一个用户输入字符串,它可以有两种不同的格式,但有一些小的变化:

Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560â564
Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564

我需要得到的是:

作者字符串部分: Some AB, Author C, Names DEF 或 Some AB, Author C, Names DEF et al
文章标题字符串: The title string 或 The title string?
日志名称字符串: T journal name
年价值: 2018
版本值: 10
页码 560-564

所以我必须用分隔符分隔字符串 . 或 (1234) , ; 和 : .

我没有一个有效的正则表达式,我也不知道如何处理这两种格式,这两种格式的年份值在不同的位置。

我从一些事情开始:

string.split(/^\(\d+\)\s*/)

但是当我得到一个数组的时候,我该怎么做呢?

3 回复 | 直到 7 年前

wp78de 7 年前

我还建议采用匹配模式:

^([^.(]+)(?:\((\d{4})\)|\.)\s*([^?!.]*.)\s*([^0-9,]+)(\d{4})?[,; ]*([^,: ]*)[,;: ]*(\d+(?:[â-]\d+)?)

或者一个更可读的版本 named capture groups ^* :

^(?<author>[^.(]+)(?:\((?<yearf1>\d{4})\)|\.)\s*(?<title>[^?!.]*.)\s*(?<journal>[^0-9,]+)(?<yearf2>\d{4})?[,; ]*(?<issue>[^,: ]*)[,;: ]*(?<pages>\d+(?:[â-]\d+)?)

我支持和schifini使用否定字符类来查找所需片段的方法。
为了区分这两种不同的格式,我为year format 1和format 2添加了两个可选的命名组,并将其余部分打包到其他捕获组中。唯一剩下的事情就是检查第二组还是第五组的年份。

Demo

代码示例:

const regex = /^([^.(]+)(?:\((\d{4})\)|\.)\s*([^?!.]*.)\s*([^0-9,]+)(\d{4})?[,; ]*([^,: ]*)[,;: ]*(\d+(?:[â-]\d+)?)/gm;
const str = `Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560â564
Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564
Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    array={};
    m.forEach((match, groupIndex) => {
        switch(groupIndex) {
        case 0:
            console.log(`Full match: ${match}`);
            break;
        case 1:
            array['author'] = match.trim();
            break;
        case 2:
            if(match)
                array['year'] = match;
            break;
        case 3:
            array['title'] = match.trim();
            break;
        case 4:
            array['journal'] = match.trim();
            break;
        case 5:
            if(match)
                array['year'] = match.trim();
            break;
        case 6:
            array['issue'] = match.trim();
            break;
        case 7:
            array['pages'] = match.trim();
            break;        
        default:
            console.log(`Unknown match, group ${groupIndex}: ${match}`);
        }
    });
    console.log(JSON.stringify(array));
}

_{^*
中的命名捕获组

Javascript are not supported

在所有主要浏览器中。把它们取下来或者用

Steve Levithan's XRegExp library

解决了这些问题。}

R. Schifini 7 年前

由于没有特定的分隔符,所以在大多数情况下,必须逐个提取所需的零件。

对于这些示例,可以使用以下命令获取作者、文章名和日志:

str.match(/^([^.(]*)[^ ]*([^?.]*.)([^0-9,]*)/)

^([^.(]*) 从一开始捕捉到 ( 或 .
[^ ]* 跳过可能的年份 (2018) 在文章之前。
([^?.]*.) 捕获项目名称
和 ([^0-9,]*) 捕获日志名称

匹配将返回一个包含四个元素的数组。这三个捕获位于索引1到3。

见 Regex101 .

数字匹配是可行的。尝试使用另一个单独的regexp来捕获它们。这一年可能很棘手,因为一个四位数也可以是一个页码。

xxxmatko 7 年前

您可以编写函数来解析字符串,而不是试图找出复杂的正则表达式(在本例中imho是不可能的)。根据你的样本数据,它可以是这样的:

var str = [
  "Some AB, Author C, Names DEF,(2018) The title string. T journal name, 10, 560â564",
  "Some AB, Author C, Names DEF (2018) The title string? T journal name 10:560-564",
  "Some AB, Author C, Names DEF et al (2018) The title string? T journal name 10:560-564",
  "Some AB, Author C, Names DEF. The title string. T journal name 2018; 10: 560-564",
  "Some AB, Author C, Names DEF. The title string. T journal name 2018;10:560-564"
];

function parse(str) {
  var result = [];
  var tmp = "";
  for (var i = 0; i < str.length; i++) {
    var c = str.charAt(i);
  	
    if(c === ",") {
      if(str.charAt(i + 1) === "(") {
          result.push(tmp.trim());
          i++;
          tmp = "";
          continue;
      }
      
      if((str.charAt(i + 1) === " ") && !isNaN(str.charAt(i + 2))) {
        result.push(tmp.trim());
        i++;
        tmp = "";
        continue;
      }
    }
    
    if((c === ".") || (c === "?") || (c === ":")) {
    	if(str.charAt(i + 1) === " ") {
          result.push(tmp.trim());
          i++;
          tmp = "";
          continue;
      }
    }    

    if((c === "(") || (c === ")") || (c === ";")  || (c === ":")) {
      result.push(tmp.trim());
      tmp = "";
      if(str.charAt(i + 1) === " ") {
      	i++;
      }
      continue;
    }
    
    if((c === " ") && !isNaN(str.charAt(i + 1))){
      result.push(tmp.trim());
      tmp = "";
      continue;
    }
    
    tmp += c;
  }
  result.push(tmp.trim());
  
  if(!isNaN(result[3])) {
  	result = [result[0], result[3], result[1], result[2], result[4], result[5]];
  }
  
	return result;
}

for(var j = 0; j < str.length; j++) {
	console.info(parse(str[j]));
}