由于对特殊字符的处理,标记化器不会将其标记映射到易于循环的对象。天真地说:
subst = " ".join(mytext.split(" ")[0:MAX_LEN])
START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
START = START + MAX_LEN
i = i + 1
tokens = tokenizer(text)
然而,
" ".join(mytext.split(" ")[0:MAX_LEN])
不等于
tokenizer(text)
您可以看到以下区别:
>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens
>>> len(mytext.split(" "))
10001
>>> encoded_input = tokenizer(mytext)
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors
函数参数是什么
tokenizer
或者,如果没有可用的,一般接受的较长文档的迭代过程?