代码之家  ›  专栏  ›  技术社区  ›  Mittenchops

使用余数遍历Huggingface标记器

  •  0
  • Mittenchops  · 技术社区  · 4 年前

    由于对特殊字符的处理,标记化器不会将其标记映射到易于循环的对象。天真地说:

    subst = " ".join(mytext.split(" ")[0:MAX_LEN])
    

    START = 0
    i = 0
    substr = []
    while START+MAX_LEN < len(mytext.split(" ")):
      substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
      START = START + MAX_LEN
      i = i + 1
      tokens = tokenizer(text)
    

    然而, " ".join(mytext.split(" ")[0:MAX_LEN]) 不等于 tokenizer(text)

    您可以看到以下区别:

    >>> from transformers import LongformerTokenizer
    >>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
    
    >>> mytext = "This is a long sentence. " * 2000 # about 10k tokens
    
    >>> len(mytext.split(" "))
    10001
    
    >>> encoded_input = tokenizer(mytext) 
    Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors
    

    函数参数是什么 tokenizer 或者,如果没有可用的,一般接受的较长文档的迭代过程?

    0 回复  |  直到 4 年前
    推荐文章