代码之家  ›  专栏  ›  技术社区  ›  Olexander Korenyuk

实现自定义GPT-NEO模型的do_sampling

  •  0
  • Olexander Korenyuk  · 技术社区  · 4 年前
    import numpy as np
    from transformers import GPTNeoForCausalLM, GPT2Tokenizer 
    import coremltools as ct
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    
    sentence_fragment = "The Oceans are"
    
    class NEO(torch.nn.Module):
        def __init__(self, model):
            super(NEO, self).__init__()
            self.next_token_predictor = model
        
        def forward(self, x):
            sentence = x
            predictions, _ = self.next_token_predictor(sentence)
            token = torch.argmax(predictions[-1, :], dim=0, keepdim=True)
            sentence = torch.cat((sentence, token), 0)
            return sentence
    
    token_predictor = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M", torchscript=True).eval()
    
    context = torch.tensor(tokenizer.encode(sentence_fragment))
    random_tokens = torch.randint(10000, (5,))
    traced_token_predictor = torch.jit.trace(token_predictor, random_tokens)
    
    model = NEO(model=traced_token_predictor)
    scripted_model = torch.jit.script(model)
    
    # Custom model
    
    sentence_fragment = "The Oceans are"
    
    for i in range(10):
        context = torch.tensor(tokenizer.encode(sentence_fragment))
        torch_out = scripted_model(context)
        sentence_fragment = tokenizer.decode(torch_out)
    print("Custom model: {}".format(sentence_fragment))
    
    # Stock model
    
    model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M", torchscript=True).eval()
    
    sentence_fragment = "The Oceans are"
    
    input_ids = tokenizer(sentence_fragment, return_tensors="pt").input_ids
    gen_tokens = model.generate(input_ids, do_sample=True, max_length=20)
    gen_text = tokenizer.batch_decode(gen_tokens)[0]
    print("Stock model: "+gen_text)
    

    运行1

    输出:


    Custom model: The Oceans are the most important source of water for the entire world
    
    Stock model: The Oceans are on the rise. The American Southwest is thriving, but the southern United States still
    

    跑步2

    输出:


    Custom model: The Oceans are the most important source of water for the entire world. 
    
    Stock model: The Oceans are the land of man
    
    This is a short video of the Australian government
    

    自定义模型总是返回相同的输出。然而,随着 do_sampling = True 股票 model.generate 每次调用时返回不同的结果。我花了很多时间弄清楚do_sampling是如何为变压器工作的,所以我需要你们的帮助,谢谢。

    如何对自定义模型进行编码,使每次调用都有不同的结果?

    谢谢!

    0 回复  |  直到 4 年前
        1
  •  0
  •   Olexander Korenyuk    4 年前

    因此,答案是实现采样:D

    class NEO(torch.nn.Module):
        def __init__(self, model):
            super(NEO, self).__init__()
            self.next_token_predictor = model
        
        def forward(self, x):
            sentence = x
            predictions, _ = self.next_token_predictor(sentence)
            # get top K (k=2) indicies of highest probs of tokens
            # 2 indicies would be enough, anyway you will got 2 in a power of N variations
            _, topK = torch.topk(predictions[-1, :], 2, dim=0)
            # get one of two of those indicies randomly, and concat sentence
            perm = torch.randperm(topK.size(0))
            idx = perm[:1]
            token = topK[idx.long()]
            sentence = torch.cat((sentence, token), 0)
            return sentence
    
    推荐文章