Retrieval-Augmented Generation (RAG) Workflow

Standard Workflow of RAG

User Prompt
- The system receives a user query or prompt as input.
Embedding Generation
- The user prompt is tokenized and transformed into a vector representation using an embedding model (e.g., Sentence-BERT or a transformer encoder).
- This vector captures the semantic meaning of the query.
Similarity Search in FAISS
- The query vector is compared against stored vectors in the FAISS index to find the most similar documents based on semantic similarity (e.g., cosine similarity or Euclidean distance).
- FAISS returns the top-N most relevant results, which include document IDs and their similarity scores.
Retrieval of Raw Text
- The document IDs returned by FAISS are used to fetch corresponding raw text or content from an external database or data store (e.g., Elasticsearch, MongoDB).
Contextual Generation
- The retrieved text and the original user prompt are fed into the generative model (e.g., T5, GPT).
- The model conditions its output on the combined input of the query and retrieved documents to produce the final response.

Semantic Matching: Ensures the generative model is provided with semantically relevant content, enhancing response quality.
Efficiency: Narrows down large datasets to a smaller set of relevant documents, optimizing processing.

Supplementary Retrieval: May complement the vector-based search to apply specific filters or search for keywords.
Hybrid Models:
- Combines vector similarity search with full-text search for a balance between semantic and lexical matching.
- Refines or filters results based on complex conditions after the initial vector search.

Primary Approach: RAG begins with a similarity search against a vector database (e.g., FAISS) to identify semantically relevant documents.
Text Search: Used as a complementary step for more complex retrieval strategies or filtering.