代码之家 › 专栏 › 技术社区 › Mittenchops

使用余数遍历Huggingface标记器

huggingface-tokenizers

0

Mittenchops · 技术社区 · 4 年前

由于对特殊字符的处理,标记化器不会将其标记映射到易于循环的对象。天真地说:

subst = " ".join(mytext.split(" ")[0:MAX_LEN])

START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
  substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
  START = START + MAX_LEN
  i = i + 1
  tokens = tokenizer(text)

然而, " ".join(mytext.split(" ")[0:MAX_LEN]) 不等于 tokenizer(text)

您可以看到以下区别:

>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens

>>> len(mytext.split(" "))
10001

>>> encoded_input = tokenizer(mytext) 
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors

函数参数是什么 tokenizer 或者,如果没有可用的,一般接受的较长文档的迭代过程?

0 回复 | 直到 4 年前