26 июня 2024

How to Use Retrieval-Augmented Generation (RAG) locally

In this blog post, we'll explore how to use Retrieval-Augmented Generation (RAG) for building more effective and engaging conversational AI applications. We'll cover the basics of RAG, its benefits, and provide step-by-step instructions on how to develop your own RAG mechanism for local use.


What is RAG?

RAG (Reinforcement-based Generation) combines the strengths of two prominent approaches in natural language processing (NLP): retrieval-based models and generation-based models. In traditional generation-based methods, AI systems generate text from scratch using pre-trained patterns and rules. However, this approach often leads to limited creativity, a lack of context-specific knowledge, and poor coherence.

In contrast, retrieval-based models leverage pre-trained language models to retrieve relevant information from large corpora, custom datasets, or databases. While these models excel at providing accurate responses based on existing text, they can struggle with novel or ambiguous contexts.


RAG vs Fine tuning LLM models.

Fine-tuning an LLM (Large Language Model) involves adjusting its pre-training for a specific task. Initially, an LLM is trained on a massive dataset to learn general language patterns. This process is followed by further training on a narrower dataset tailored to a particular application, such as customer service or code generation [1].

In contrast, RAG (Reinforcement-based Generation) is useful when you need your LLM to generate responses based on large amounts of updated and context-specific data. For instance, enriching the response of an LLM model with datasets from your datalake or archive documents.


RAG offers a cost-efficient alternative to fine-tuning due to its minimal resource and technical knowledge requirements. With just 10 lines of Python code, you can ensure that your model has access to the most relevant and up-to-date data.

Many Web and Desktop client applications, such as Open WebUI or AnythingLLM, already incorporate RAG features. However, for private companies, building a custom RAG system is necessary to upload and embed data for use by their LLM models.

In this blog post, I will provide a step-by-step guide on how to implement RAG in Python using the LLama3 model.


The RAG mechanism can be summarised as follows:


To build a local RAG system, you'll need the following components.

  1. Sources. Source documents, it might be .doc, txt or pdf files located your network.

  2. Load. A loader which will load and split the documents into chunks

  3. Transform. Transform the chunk for embedding.

  4. Embedding model. Embedding model takes the input as a chunk and outputs an embedding as a vector representation.

  5. Vector DB. Vector database for storing embedding.

  6. LLM model. Pre-trained model which will use the embedding to answer the user query.

To get started, let me summarize the key components that I will be using next.

  1. LLM server: Ollama local server

  2. LLM model: LLama 3 8b

  3. Embedding model: all-MiniLM-L6-v2

  4. Vector database: SQLiteVSS (sqlite3)

  5. Framework: LangChain

  6. Os: Macos

  7. Programming language: Python 3.11.3


The instructions and example I'll be using are available in the resource section [2]. To run the author's example, you should have good expertise in Python. Here, I'll walk you through each necessary step to save time and effort.

If you're comfortable with Python, feel free to skip steps prior to 7.


Step 1. Download and install Python

Don't use your system Python to run the application. Install Python 3 with homebrew.

homebrew install python

Step 2. Install pip

python -m ensurepip or python3 -m ensurepip

Step 3. Install jupyter note book and lab

python3 -m pip install jupyterlab
python3 -m pip install notebook

Step 4. Run the jupyter lab or notebook

python3 -m jupyterlab

The preceding command will launch a web-based interface that enables you to build, run, and manage your Python applications at the specified URL http://localhost:8888/lab


Step 5. Install sqlite3

brew install sqlite3

After installation the sqlite3 database, add the following lines in your .bashprofile or .zshrc file

export PATH="/usr/local/opt/sqlite/bin:$PATH"
export LDFLAGS="-L/usr/local/opt/sqlite/lib"
export CPPFLAGS="-I/usr/local/opt/sqlite/include"
export PYTHON_CONFIGURE_OPTS="--enable-loadable-sqlite-extensions"

Reopen the terminal.


Step 6.

Add a new launcher or notebook in you local jupyter lab. Add the first cell with the following code which will install all the necessary packages.

#install necessary package 
pip install --upgrade langchain
pip install -U langchain-community
pip install -U langchain-huggingface
pip install sentence-transformers
pip install --upgrade --quiet  sqlite-vss

Step 7.

Import all the packages

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import SQLiteVSS
from langchain.document_loaders import TextLoader

Download the document from the link and place it somewhere in your workspace.


Step 8. Load the source document from the directory.

# Load the document using a LangChain text loader
loader = TextLoader("PATH_TO_THE_FILE/stateoftheunion2023.txt")
documents = loader.load()

Step 9. Split the documents into chunk

# Split the document into chunks
text_splitter = CharacterTextSplitter (chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
texts = [doc.page_content for doc in docs]

Here, you can print the chunks into console to be sure of parsing the document.

print (texts)

Step 10. Embedding the chunks


# Use the sentence transformer package with the all-MiniLM-L6-v2 embedding model embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")  

Step 11. Load the embeddings into the sqlite3 database.

# Load the text embeddings in SQLiteVSS in a table named state_union
db = SQLiteVSS.from_texts(
    texts = texts,
    embedding = embedding_function,
    table = "state_union",
    db_file = "/tmp/vss.db"
)

Step 12. Optional. Query the database and study the embeddings.

Use DBeaver or Visual studio code to query the local sqlite3 database. Database file of the local db is "/tmp/vss.db".


Step 13. Query a semantic (similarity) search

# First, we will do a simple retrieval using similarity search
# Query
question = "What did the president say about Nancy Pelosi?"
data = db.similarity_search(question)
# print results
print(data[0].page_content)

The result should be something similar

State of The Union 2023: Biden's Full Speech

Mr. Speaker. Madam Vice President. Our First Lady and Second Gentleman.

Members of Congress and the Cabinet. Leaders of our military.

Mr. Chief Justice, Associate Justices, and retired Justices of the Supreme Court.

And you, my fellow Americans.

I start tonight by congratulating the members of the 118th Congress and the new Speaker of the House, Kevin McCarthy.

Mr. Speaker, I look forward to working together.

I also want to congratulate the new leader of the House Democrats and the first Black House Minority Leader in history, Hakeem Jeffries.

Congratulations to the longest serving Senate Leader in history, Mitch McConnell.

And congratulations to Chuck Schumer for another term as Senate Majority Leader, this time with an even bigger majority.

And I want to give special recognition to someone who I think will be considered the greatest Speaker in the history of this country, Nancy Pelosi.

Step 14. Run the local Ollama server.


ollama run llama3

Step 15. Import the langchain LLM package and connect to the local server

# LLM
from langchain.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = Ollama(
    model = "llama3",
    verbose = True,
    callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]),
)

Step 16. Use the LangChain prompt to ask a question

# QA chain
from langchain.chains import RetrievalQA
from langchain import hub
# LangChain Hub is a repository of LangChain prompts shared by the community
QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-llama")
qa_chain = RetrievalQA.from_chain_type(
    llm,
    # we create a retriever to interact with the db using an augmented context
    retriever = db.as_retriever(), 
    chain_type_kwargs = {"prompt": QA_CHAIN_PROMPT},
)

Step 17. Print the result.

result = qa_chain({"query": question})

This print the query result. The query result should be something as follows:

The president referred to Nancy Pelosi as "someone who I think will be considered the greatest Speaker in the history of this country."

Note that it may take a few minutes to respond, depending on your local computer resources.

In this case, LLM generates a concise answer for the query based on the embeddings. During semantic similarity search, we issue a query to the vector database, which returns a similarity score for the answer.


Resources:

  1. Retrieval-Augmented Generation vs Fine-Tuning: What’s Right for You?

  2. https://www.infoworld.com/article/3715181/fully-local-retrieval-augmented-generation-step-by-step.html