我正在寻找一种根据空格或标点符号标记或拆分的解决方案。结果中只能保留标点符号。它将用于识别语言(python、java、html、c…)
输入 string 可以是:
string
class Foldermanagement(): def __init__(self): self.today = invoicemng.gettoday() ...
我期望的输出是一个列表/标记,如下所述:
['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_', 'init', ... ,'self', '.', 'today', '=', ...]
欢迎任何解决方案,提前感谢。
我想这就是你想要的:
import string, re, itertools text = """ class Foldermanagement(): def __init__(self): self.today = invoicemng.gettoday() """ separators = string.punctuation + string.whitespace separators_re = "|".join(re.escape(x) for x in separators) tokens = zip(re.split(separators_re, text), re.findall(separators_re, text)) flattened = itertools.chain.from_iterable(tokens) cleaned = [x for x in flattened if x and not x.isspace()] # ['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_', # 'init', '_', '_', '(', 'self', ')', ':', 'self', '.', 'today', '=', # 'invoicemng', '.', 'gettoday', '(', ')']