4, Apr 2024 by .
To run this blog, you will need the following:
Linux: see supported Linux distributions
ROCm: see the installation instructions
AMD GPU: see the list of compatible GPUs
Large Language Models (LLMs), such as ChatGPT, are powerful tools capable of performing many complex writing tasks. However, they do have limitations, notably:
Lack of access to up-to-date information: LLMs are “frozen in time” as their training data is inherently outdated, which means they cannot access the latest news or information.
Limited applicability for domain-specific tasks: LLMs are not specifically trained for domain-specific tasks using domain-specific data, which can result in less relevant or incorrect responses for specialized use cases.
To address these limitations, there are two primary approaches to introduce up-to-date and domain-specific data:
Fine-tuning: This involves providing the LLM with up-to-date, domain-specific prompt-and-completion text pairs. However, this approach can be costly, particularly if the data used for fine-tuning changes frequently, requiring frequent updates.
Contextual prompting: This involves inserting up-to-date data as context into the prompt, which the LLM can then use as additional information when generating a response. However, this approach has limitations, as not all up-to-date, domain-specific documents may fit into the context of the prompt.
To overcome these obstacles, Retrieval Augmented Generation (RAG) can be used. RAG is a technique that enhances the accuracy and reliability of an LLM by exposing it to up-to-date, relevant information. It works by automatically splitting external documents into chunks of a specified size, retrieving the most relevant chunks based on the query, and augmenting the input prompts to use these chunks as context to answer the user’s query. This approach allows for the creation of domain-specific applications without the need for fine-tuning or manual information insertion into contextual prompts.
A popular framework used by the AI community for RAG is LlamaIndex. It’s a framework for building LLM applications that focus on ingesting, structuring, and accessing private or domain-specific data. Its tools facilitate the integration of custom out-of-distribution data into LLMs.
To get started, install the transformers, accelerate, and llama-index that you’ll need for RAG:
Then, import the LlamaIndex libraries:
Many open-source LLMs require a preamble before each prompt or a specific structuring of the prompt, which you can encode using system_prompt or messages_to_prompt before generation. Additionally, queries may need an additional wrapper, which you can specify with the query_wrapper_prompt. All this information is typically available on the Hugging Face model card for the model you’re using. In this case, you’ll use zephyr-7b-alpha for RAG, so you can pull the expected prompt format here.
LlamaIndex supports using LLMs from Hugging Face directly by passing the model name to the HuggingFaceLLM class. You can specify model parameters, such as which device to use and how much quantization using model_kwargs. You can specify parameters that control the LLM generation strategy, like top_k, top_p, and temperature, in generate_kwargs. You can also specify parameters that control the length of the output, such as max_new_tokens, directly in the class. To learn more about these parameters and how they affect generation, take a look at Hugging Face’s text generation documentation.
To demonstrate the shortcomings mentioned earlier, prompt your LLM and ask it how does Paul Graham recommend to work hard.
At first glance the generated response looks accurate and reasonable. The LLM knows we’re talking about Paul Graham and working hard. The recommended steps for working hard even look reasonable. However, these are not Paul Graham’s suggestions for working hard. LLMs are known to ‘hallucinate’ when there’s a gap in their knowledge, making false, yet plausible, statements.
A simple way to overcome ‘hallucination’ of facts is to engineer the prompt to include external contextual information. Let’s copy the body of text from this essay on How to Work Hard.
You can automatically copy the text using the BeautifulSoup library in Python:
Now, modify the original question and include the updated information when asking the question:
Now, input this prompt into your LLM and note the response:
By prompting the LLM to use the essay as context, you’re constraining the LLM to generate content using information within the prompt, producing an accurate response. Now, try generating a response with RAG and compare it to the contextual prompting.
To build a RAG application, you first need to call ServiceContext, which establishes the language and embedding models to use, as well as key parameters (such as chunk_size and chunk_overlap) that determine how the documents are parsed.
When performing RAG, documents are broken into smaller chunks. The chunk_size parameter specifies how many tokens in length each of the chunks should be, while chunk_overlap specifies how many tokens each chunk should overlap with its adjacent chunks.
Set the llm parameter using the llm variable you used in the preceding experiments. For the embedding model, use bge-base (shown to be top-performing for retrieval tasks) to embed the document chunks.
Next, build your vector index using VectorStoreIndex, which takes your documents and passes them to the embedding model for chunking and embedding. Then call query_engine to prepare the index for queries, specifying similarity_top_k to return the top eight most similar chunks of the document to the input query.
Your RAG application is now ready to be queried. Let’s query with the original question:
The response is quite similar to the contextual prompt engineered example. This isn’t surprising as it’s using the same contextual information to generate the response. Prompt engineering requires manually specifying the context, while you can think of RAG as an advanced and automated form of prompt engineering that leverages databases of documents to retrieve the most optimal context to guide the generation process.
Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.