Introduction
Large Language Models (LLMs) are great tools that can answer any question in a convincing way. Unfortunately, this ability to convince is a double-edged sword. Indeed, under the guise of apparently logical reasoning, the facts stated can be totally false. This can be caused by the absence of documents in the learning corpus, because of their novelty, their rarity or even because of the confidential aspect.
In addition, in certain cases of use involving the responsibility of the LLM user, it is necessary to be able to provide the sources used to answer the question. This is the case with contractual clauses, for example, where the answer is less important than the way to get there.
These use cases are grouped under the term “knowledge-intensive application”. In this context, we mainly want to access knowledge as opposed to translation or reformulation applications, for example. We therefore want to use the power of understanding and synthesis of LLMs while guaranteeing the exploitation of a controlled knowledge base, all while citing these sources. This best of all worlds exists, it is based on RAG: “Retrieval Augmented Generators”
A RAG is a system based on LLMs to search for information in a user-controlled corpus and then synthesize a response from selected elements.
Main components
An RAG is an architecture involving several components. Before presenting the complete architecture, let's zoom in on the main parts.
Embedding
The first concept of an RAG is embedding. Indeed, a computer cannot manipulate words or sentences directly. It is necessary to transform them into numerical values on which it is possible to calculate. It is important that this transformation maintains semantic proximity, that is, two concepts that are semantically close are numerically close.
Fortunately, there are pre-trained models adapted to this task, the “sentences embedding” derived from BERT. This operation can require quite a bit of calculation, even if the models are smaller than LLMs. Recently, models have even been proposed specifically to improve inference times.
Document Chunking
The corpus containing the knowledge to be exploited by the RAG cannot be used directly. It must be cut into small pieces, the chunks (with potentially an overlap). As we have seen, these chunks cannot be used directly and must be transformed via embedding. In addition, it is important to keep the metadata around the chunks, for example the source file from which it is taken, the chapter within the document, the last update date, etc.
The size of the corpus can be large and the storage and retrieval of these chunks poses new problems, which is where vector DBs come in.
Specialized information system: Vector databases
To answer a given question, a RAG calculates an embedding of the question and looks for the relevant chunks. As embeddings preserve the concept of semantic distance, finding relevant documents is in fact the same as finding documents that are close, in terms of distance, in the embedding space. We can therefore formalize the problem of finding relevant documents by “find the k closest chunks in the embedding space”. This operation should be inexpensive, even with a large corpus. That's where VectorDBs come in.
In order to solve these problems, it is no longer possible to use relational databases such as MySQL, or No-SQL, such as REDIS. The way in which information is stored is not adapted to the types of requests made in an RAG.
Fortunately, there are databases designed specifically for certain tasks, such as TimescaleDB databases for time series, or PostGIS for geographic data. Here, it is therefore the vector DBs that will answer our problem. They store embeddings in an optimized manner so that it is possible to find the L vectors closest to a given vector. So here we find actors like ChromaDB, Qdrant, or PgVector.
So at the end of this stage, the RAG found in the database, the k chunks that were most relevant to the question asked. They are therefore transferred to the LLM to provide the final answer.
LLM
The user's question and the chunks are assembled in order to constitute a prompt which is therefore provided as input to an LLM, type LLama 2, Falcon, etc. in a conventional manner. It should be noted that the task of creating a response from items provided to the LLM is easier than having to generate everything. Thus, even with a “small” LLM (7b), very relevant results are already obtained.
The RAG architecture
With the above elements, we can present the complete architecture of an RAG. First of all, the VectorDB is populated with the embeddings and the metadata of the chunks from the document database. The request arrives and its embedding is calculated. The most relevant K chunks are taken from VectorDB. The question and the chunks are combined to create a prompt. This prompt is passed to an LLM, which provides the response.
This response can also be increased with the sources used, thanks to the metadata associated with the chunks.
Technically, this general architecture is classical and we have been able to identify the various bricks necessary for its construction. In practice, Python libraries such as Langchain or LlamaIndex make it possible, in an efficient way, to choose the LLM, the embedding model, the Vector DB, as well as their parameters, and to combine them effectively.
Conclusion
The RAG architecture therefore offers numerous advantages compared to the use of an LLM alone. We can list the following elements:
- Reliability: by using a knowledge base controlled by the user, the probability of hallucination is greatly reduced.
- Trust: The sources used for the generation are given. If a strong responsibility is committed on the part of the user, he can directly refer to the sources.
- Efficiency: The LLM used to generate the response may be much smaller in size than a GPT-4 for results. This architecture avoids finetuning a model on its corpus. Moreover, even with few documents, it is possible to have a relevant RAG.
- Flexibility: The modification of the knowledge base is simply done by adding documents to the VectorDB.
This architecture is already very effective with generic building blocks. It is possible to further improve their performance through fine tuning. You can re-train the LLM on a question-documents-answer database, which can for example be generated by a larger model (such as GPT-4) and thus improve the quality of the response generated.
In addition, it is possible to modify this architecture in order to incorporate the ability to use tools. We then speak of “agents”. In this context, before answering the question directly, an LLM can choose to use a tool, for example an online search, an API, a calculator, etc. Thus, the LLM can itself choose the query to be made in the Vector DB. In this context, it can combine the RAG with other tools, such as online research.
While RAGs offer many advantages, they are still machine learning systems. Their performances must be well measured and their uses monitored constantly, in particular because of the dynamicity of the knowledge base. The metrics to be observed are the subject of in-depth study, but we will find, for example, performance metrics on the quality of results and also metrics on safety such as those around toxicity.
Want to know more about how to deploy an LLM? domain-specific, and do an RAG test with your own data? Make an appointment here.