代码之家  ›  专栏  ›  技术社区  ›  DAR

长输入文本导致LangChain失败

  •  2
  • DAR  · 技术社区  · 1 年前

    我正在使用LangChain的LLMChain来处理大型文档。当输入文本超过语言模型的标记限制时,我看到了很多问题。对于长输入文本,它会显著减慢速度,甚至有时会因令牌溢出而失败。

    这是示例代码,

    from langchain.chains import LLMChain
    from langchain.prompts import PromptTemplate
    from langchain.llms import OpenAI
    
    
    prompt = PromptTemplate("Analyze the following text and summarize it: {text}")
    llm_chain = LLMChain(llm=OpenAI(model="gpt-3.5-turbo"), prompt=prompt)
    
    long_text = "This is a very long document..."  # Assume this text is extremely long
    
    response = llm_chain.run(text=long_text)
    print(response)
    
    

    问题是,当long_text太长时,链要么因令牌限制原因而失败,要么长时间工作。我的意思是,像GPT-3.5-turbo这样的型号确实有一个令牌限制集,但我仍然不想在每次使用我的功能时都手动这样做。

    LangChain的输入标记化是否可以动态调整,以便在考虑处理长文本时保持高效?

    2 回复  |  直到 1 年前
        1
  •  1
  •   Lisan Al Gaib    1 年前

    为了在LangChain中处理长输入文本,动态管理标记化工作,您可以通过考虑提示的长度和可能是输入一部分的任何其他固定文本来计算文本的可用标记。

    KarolZmijewski的回答很好,但我想补充一点 process_dynamic_tokenization 额外功能以解决该问题。

    from transformers import GPT2Tokenizer
    from langchain.chains import LLMChain
    from langchain.prompts import PromptTemplate
    from langchain.llms import OpenAI
    
    
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    llm_chain = LLMChain(
        llm=OpenAI(model="gpt-3.5-turbo"),
        prompt=PromptTemplate("Analyze the following text and summarize it: {text}")
    )
    
    def split_text_into_chunks(text, max_tokens):
        tokens = tokenizer.tokenize(text)
        chunks = []
        for i in range(0, len(tokens), max_tokens):
            chunk_tokens = tokens[i:i + max_tokens]
            chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
            chunks.append(chunk_text)
        return chunks
    
    def process_long_text(text, max_tokens):
        chunks = split_text_into_chunks(text, max_tokens)
        summaries = []
        for chunk in chunks:
            response = llm_chain.run(text=chunk)
            summaries.append(response)
        return " ".join(summaries)
    
    def process_dynamic_tokenization(text, prompt_template, llm_chain, max_model_tokens=4096):
        prompt_tokens = len(tokenizer.tokenize(prompt_template.template.format(text="")))
        available_tokens = max_model_tokens - prompt_tokens
        
        if available_tokens <= 0:
            raise ValueError("Prompt template is too long for the model's token limit.")
        
        return process_long_text(text, max_tokens=available_tokens)
    
    long_text = "This is a very long document..."  
    
    final_summary = process_dynamic_tokenization(
        text=long_text,
        prompt_template=PromptTemplate("Analyze the following text and summarize it: {text}"),
        llm_chain=llm_chain,
        max_model_tokens=4096
    )
    
    print(final_summary)
    
        2
  •  0
  •   Karol Zmijewski    1 年前

    您可以尝试使用内置实用程序:

    from langchain.chains import LLMChain
    from langchain.llms import OpenAI
    from langchain.prompts import PromptTemplate
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    prompt = PromptTemplate("Analyze the following text and summarize it: {text}")
    llm_chain = LLMChain(llm=OpenAI(model="gpt-3.5-turbo"), prompt=prompt)
    
    long_text = "This is a very long document..."  # Assume this text is extremely long
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    text_chunks = splitter.split_text(long_text)
    
    response_chunks = []
    for text_chunk in text_chunks :
        response_chunk = llm_chain.run(text=text_chunk)
        response_chunks.append(chunk_response)
    
    response = "\n".join(response_chunks)
    print(response)