So you want to build a chatbot that actually knows things — not just a glorified autocomplete, but one that reads your documents, understands context, and gives real answers. In this tutorial, we'll go from zero to a fully deployed Retrieval-Augmented Generation (RAG) chatbot using LangChain, OpenAI GPT-4o (or the fully local Ollama as a drop-in alternative), FAISS as our vector store, and a clean FastAPI + React web interface.
By the end, you'll have a working app you can point at any document corpus — PDFs, Markdown files, web pages — and chat with it intelligently.
Prerequisites
Before diving in, make sure you're comfortable with:
- Python 3.11+
- Basic REST API concepts
- A basic grasp of how LLMs work (tokens, embeddings, context windows)
You'll also need one of the following:
- An OpenAI API key (for GPT-4o +
text-embedding-3-small) - Ollama installed locally (for models like
llama3,mistral, orphi3) — completely free and private
The Architecture at a Glance
Before writing a single line of code, let's understand what we're building:
┌──────────────────────────────────────────────────┐
│ Your Documents │
│ (PDFs, Markdown, HTML, text files) │
└────────────────────┬─────────────────────────────┘
│ Document Loader
▼
┌──────────────────────────────────────────────────┐
│ Text Splitter / Chunker │
└────────────────────┬─────────────────────────────┘
│ Chunks
▼
┌──────────────────────────────────────────────────┐
│ Embedding Model (OpenAI / Ollama) │
└────────────────────┬─────────────────────────────┘
│ Vectors
▼
┌──────────────────────────────────────────────────┐
│ FAISS Vector Store │
└────────────────────┬─────────────────────────────┘
│ Retriever
▼
┌──────────────────────────────────────────────────┐
│ LangChain RAG Chain (Prompt + LLM + Memory) │
└────────────────────┬─────────────────────────────┘
│ Answer
▼
┌──────────────────────────────────────────────────┐
│ FastAPI Backend + React Frontend │
└──────────────────────────────────────────────────┘
This pattern is called Retrieval-Augmented Generation (RAG). Instead of fine-tuning a model (expensive and slow), we:
- Break documents into chunks and embed them into vector space
- At query time, embed the user's question and retrieve the k most relevant chunks
- Feed those chunks + the question into the LLM as context
- The LLM synthesizes a grounded answer
Step 1: Project Setup
Create your project structure:
mkdir rag-chatbot && cd rag-chatbot
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install langchain langchain-community langchain-openai \
faiss-cpu pypdf unstructured fastapi uvicorn \
python-multipart python-dotenv tiktoken
Your project tree:
rag-chatbot/
├── backend/
│ ├── ingest.py # Document ingestion pipeline
│ ├── chain.py # LangChain RAG chain
│ ├── main.py # FastAPI app
│ └── config.py # Config & env vars
├── frontend/
│ └── index.html # Simple chat UI
├── docs/ # Drop your documents here
├── vectorstore/ # FAISS index saved here
└── .env
Create your .env:
# Use ONE of the following:
OPENAI_API_KEY=sk-... # For OpenAI
OLLAMA_BASE_URL=http://localhost:11434 # For Ollama
LLM_PROVIDER=openai # "openai" or "ollama"
OPENAI_MODEL=gpt-4o
OLLAMA_MODEL=llama3
EMBED_MODEL=text-embedding-3-small # OpenAI embeddings
OLLAMA_EMBED_MODEL=nomic-embed-text # Ollama embeddings
CHUNK_SIZE=1000
CHUNK_OVERLAP=150
RETRIEVER_K=5
Step 2: Configuration Module
# backend/config.py
import os
from dotenv import load_dotenv
load_dotenv()
LLM_PROVIDER = os.getenv("LLM_PROVIDER", "openai")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "llama3")
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")
OLLAMA_EMBED = os.getenv("OLLAMA_EMBED_MODEL", "nomic-embed-text")
CHUNK_SIZE = int(os.getenv("CHUNK_SIZE", 1000))
CHUNK_OVERLAP = int(os.getenv("CHUNK_OVERLAP", 150))
RETRIEVER_K = int(os.getenv("RETRIEVER_K", 5))
VECTORSTORE_PATH = "vectorstore/faiss_index"
DOCS_DIR = "docs"
Step 3: Document Ingestion Pipeline
This is the heart of your RAG system. We'll load documents, split them smartly, embed them, and persist the vector store.
# backend/ingest.py
import os
from pathlib import Path
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
UnstructuredMarkdownLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from config import (
CHUNK_SIZE, CHUNK_OVERLAP, VECTORSTORE_PATH,
DOCS_DIR, LLM_PROVIDER, EMBED_MODEL, OLLAMA_BASE_URL, OLLAMA_EMBED,
)
def get_embeddings():
"""Return the appropriate embedding model based on config."""
if LLM_PROVIDER == "openai":
from langchain_openai import OpenAIEmbeddings
return OpenAIEmbeddings(model=EMBED_MODEL)
else:
from langchain_community.embeddings import OllamaEmbeddings
return OllamaEmbeddings(
base_url=OLLAMA_BASE_URL,
model=OLLAMA_EMBED,
)
def load_documents(docs_dir: str) -> list:
"""Load all supported documents from a directory."""
docs = []
loaders = {
".pdf": PyPDFLoader,
".txt": TextLoader,
".md": UnstructuredMarkdownLoader,
}
for filepath in Path(docs_dir).rglob("*"):
suffix = filepath.suffix.lower()
if suffix in loaders:
print(f" Loading: {filepath}")
loader = loaders[suffix](str(filepath))
docs.extend(loader.load())
print(f"\n✅ Loaded {len(docs)} document pages/sections.")
return docs
def split_documents(docs: list) -> list:
"""Split documents into overlapping chunks."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = splitter.split_documents(docs)
print(f"✅ Split into {len(chunks)} chunks.")
return chunks
def build_vectorstore(chunks: list):
"""Embed chunks and build/save the FAISS index."""
embeddings = get_embeddings()
print("⏳ Embedding chunks — this may take a while...")
vectorstore = FAISS.from_documents(chunks, embeddings)
os.makedirs(os.path.dirname(VECTORSTORE_PATH), exist_ok=True)
vectorstore.save_local(VECTORSTORE_PATH)
print(f"✅ Vector store saved to '{VECTORSTORE_PATH}'")
return vectorstore
def ingest():
print("📂 Loading documents...")
docs = load_documents(DOCS_DIR)
chunks = split_documents(docs)
build_vectorstore(chunks)
if __name__ == "__main__":
ingest()
Run it once after dropping your files into docs/:
python backend/ingest.py
Why RecursiveCharacterTextSplitter?
It tries to split on natural boundaries — double newlines first (paragraphs), then single newlines, then sentences. This keeps semantic meaning intact far better than a naive character split.
Step 4: Building the LangChain RAG Chain
# backend/chain.py
from langchain_community.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
from config import (
LLM_PROVIDER, OPENAI_MODEL, OLLAMA_MODEL, OLLAMA_BASE_URL,
VECTORSTORE_PATH, RETRIEVER_K,
)
def get_llm():
"""Instantiate the LLM based on provider config."""
if LLM_PROVIDER == "openai":
from langchain_openai import ChatOpenAI
return ChatOpenAI(
model=OPENAI_MODEL,
temperature=0.2,
streaming=True,
)
else:
from langchain_community.chat_models import ChatOllama
return ChatOllama(
base_url=OLLAMA_BASE_URL,
model=OLLAMA_MODEL,
temperature=0.2,
)
def get_embeddings():
"""Duplicate here so chain.py is self-contained."""
from ingest import get_embeddings as _get
return _get()
SYSTEM_PROMPT_TEMPLATE = """You are a helpful, knowledgeable assistant.
Use ONLY the following retrieved context to answer the question.
If the context does not contain sufficient information, say so honestly — do not hallucinate.
Context:
{context}
Chat History:
{chat_history}
Question: {question}
Answer (be concise, cite sources where possible):"""
QA_PROMPT = PromptTemplate(
input_variables=["context", "chat_history", "question"],
template=SYSTEM_PROMPT_TEMPLATE,
)
def build_chain():
"""Build and return the conversational RAG chain."""
embeddings = get_embeddings()
vectorstore = FAISS.load_local(
VECTORSTORE_PATH, embeddings, allow_dangerous_deserialization=True
)
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance — reduces redundancy
search_kwargs={"k": RETRIEVER_K, "fetch_k": RETRIEVER_K * 3},
)
llm = get_llm()
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer",
k=6, # Keep last 6 turns in memory
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
combine_docs_chain_kwargs={"prompt": QA_PROMPT},
return_source_documents=True,
verbose=False,
)
return chain
Why MMR Retrieval?
Plain similarity search can return 5 nearly identical chunks. Maximal Marginal Relevance (MMR) balances relevance and diversity, so your context window gets richer coverage of the topic instead of redundant repetition.
Step 5: FastAPI Backend
# backend/main.py
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from pydantic import BaseModel
from chain import build_chain
app = FastAPI(title="RAG Chatbot API", version="1.0.0")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
# Build the chain once on startup (expensive — don't rebuild per request)
print("🔗 Initialising RAG chain...")
rag_chain = build_chain()
print("✅ Chain ready.")
class ChatRequest(BaseModel):
question: str
session_id: str = "default"
class SourceDoc(BaseModel):
source: str
page: int | None = None
snippet: str
class ChatResponse(BaseModel):
answer: str
sources: list[SourceDoc]
@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
if not req.question.strip():
raise HTTPException(status_code=400, detail="Question cannot be empty.")
result = rag_chain.invoke({"question": req.question})
sources = []
for doc in result.get("source_documents", []):
meta = doc.metadata
sources.append(SourceDoc(
source=meta.get("source", "Unknown"),
page=meta.get("page"),
snippet=doc.page_content[:200].replace("\n", " "),
))
return ChatResponse(answer=result["answer"], sources=sources)
@app.get("/health")
async def health():
return {"status": "ok"}
# Serve the frontend from /frontend
app.mount("/", StaticFiles(directory="frontend", html=True), name="frontend")
Start your server:
uvicorn backend.main:app --reload --port 8000
Step 6: The Web Interface
A clean single-file chat UI that streams responses and shows source citations:
<!-- frontend/index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>RAG Chatbot</title>
<style>
*, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
body {
font-family: 'Segoe UI', system-ui, sans-serif;
background: #0f172a; color: #e2e8f0;
display: flex; flex-direction: column; height: 100vh;
}
header {
padding: 1rem 1.5rem;
background: #1e293b;
border-bottom: 1px solid #334155;
font-size: 1.2rem; font-weight: 700; color: #38bdf8;
}
#messages {
flex: 1; overflow-y: auto; padding: 1.5rem;
display: flex; flex-direction: column; gap: 1rem;
}
.msg { max-width: 75%; padding: 0.75rem 1rem; border-radius: 12px; line-height: 1.6; }
.msg.user { background: #1d4ed8; align-self: flex-end; }
.msg.bot { background: #1e293b; align-self: flex-start; border: 1px solid #334155; }
.sources { font-size: 0.75rem; color: #94a3b8; margin-top: 0.5rem; }
.sources span { display: block; }
#form {
display: flex; gap: 0.75rem; padding: 1rem 1.5rem;
background: #1e293b; border-top: 1px solid #334155;
}
#input {
flex: 1; padding: 0.65rem 1rem; border-radius: 8px;
background: #0f172a; color: #e2e8f0; border: 1px solid #475569;
font-size: 1rem; outline: none;
}
#input:focus { border-color: #38bdf8; }
button {
padding: 0.65rem 1.25rem; border-radius: 8px; border: none;
background: #0ea5e9; color: #fff; font-weight: 600;
cursor: pointer; transition: background 0.2s;
}
button:hover { background: #38bdf8; }
button:disabled { background: #334155; cursor: not-allowed; }
</style>
</head>
<body>
<header>🤖 RAG Chatbot</header>
<div id="messages"></div>
<form id="form">
<input id="input" placeholder="Ask a question about your documents..." autocomplete="off" />
<button id="send-btn" type="submit">Send</button>
</form>
<script>
const form = document.getElementById("form");
const input = document.getElementById("input");
const btn = document.getElementById("send-btn");
const msgs = document.getElementById("messages");
function addMessage(text, role, sources = []) {
const div = document.createElement("div");
div.className = `msg ${role}`;
div.textContent = text;
if (sources.length) {
const s = document.createElement("div");
s.className = "sources";
s.innerHTML = "📎 Sources: " + sources.map(src =>
`<span>${src.source}${src.page != null ? ` (p.${src.page})` : ""} — ${src.snippet}…</span>`
).join("");
div.appendChild(s);
}
msgs.appendChild(div);
msgs.scrollTop = msgs.scrollHeight;
}
form.addEventListener("submit", async (e) => {
e.preventDefault();
const q = input.value.trim();
if (!q) return;
addMessage(q, "user");
input.value = "";
btn.disabled = true;
try {
const res = await fetch("/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ question: q }),
});
const data = await res.json();
addMessage(data.answer, "bot", data.sources);
} catch {
addMessage("⚠️ Something went wrong. Is the server running?", "bot");
} finally {
btn.disabled = false;
input.focus();
}
});
</script>
</body>
</html>
Open http://localhost:8000 and start chatting with your documents. 🎉
Step 7: Switching Between OpenAI and Ollama
The entire swap is a one-line .env change:
# Use OpenAI (cloud, costs money, higher quality)
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
# OR use Ollama (local, free, private)
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3
OLLAMA_EMBED_MODEL=nomic-embed-text
To pull an Ollama model:
ollama pull llama3
ollama pull nomic-embed-text # For embeddings
No code changes needed — the get_llm() and get_embeddings() factory functions handle everything.
Best Practices
1. Chunk Size Matters
| Document Type | Recommended Chunk Size | Overlap |
|---|---|---|
| Technical docs | 800–1200 chars | 150 |
| Legal contracts | 500–800 chars | 100 |
| Conversational | 300–500 chars | 80 |
| Code files | 1200–2000 chars | 200 |
Smaller chunks = more precise retrieval. Larger chunks = more context per chunk. Always tune empirically.
2. Persist Your Vector Store
Never re-embed on every startup. FAISS.save_local() / load_local() is fast (milliseconds) and free after the initial build.
3. Use Source Metadata
Always attach source, page, and title metadata to your document chunks. This lets the UI show citations and helps with debugging hallucinations.
# Add metadata manually when loading
from langchain_core.documents import Document
doc = Document(
page_content="...",
metadata={"source": "annual_report_2025.pdf", "page": 12, "section": "Revenue"}
)
4. Temperature for RAG
Keep temperature low (0.0–0.3) for factual Q&A bots. Higher temperatures produce more creative, but less grounded, answers.
5. Memory Window
ConversationBufferWindowMemory(k=6) keeps the last 6 exchanges. Don't go much higher without a summarisation strategy — you'll blow your context window on long conversations.
Common Mistakes
❌ Embedding on Every Request
Never call FAISS.from_documents() inside your API handler. Build the index once and load it. This mistake turns a fast retrieval (milliseconds) into a slow embedding pipeline (minutes).
❌ Ignoring Chunk Overlap
Without overlap, a sentence split across two chunks may be semantically broken in both. Always set chunk_overlap to at least 10–15% of chunk_size.
❌ Trusting the LLM Without Retrieval Grounding
Without a strict prompt that says "use only the provided context", the LLM will happily hallucinate. Your system prompt is your first line of defence.
❌ Single Embedding Model for Everything
Don't use text-embedding-ada-002 (old, 2022) when text-embedding-3-small is cheaper and better. For Ollama, nomic-embed-text dramatically outperforms generic alternatives.
❌ Storing Raw API Keys in Code
Always use .env + python-dotenv. Never commit API keys to Git. Consider a secrets manager (AWS Secrets Manager, HashiCorp Vault) in production.
🚀 Pro Tips
-
Hybrid Search: Combine vector similarity with BM25 keyword search using
langchain_community.retrievers.BM25Retriever+EnsembleRetrieverfor dramatically better retrieval on technical or jargon-heavy documents. -
Re-ranking: After retrieving k candidates, run them through a cross-encoder re-ranker (e.g.,
cross-encoder/ms-marco-MiniLM-L-6-v2viasentence-transformers) before passing to the LLM. This catches retrieval ranking errors. -
Query Expansion: Have the LLM rewrite the user's question into 2–3 alternative phrasings before retrieval — the union of results covers more semantic ground.
-
Streaming Responses: Replace the
await fetch(...)pattern with aReadableStreamreader for token-by-token streaming from the backend. Use FastAPI'sStreamingResponse+langchain's async generator interface. -
Evaluation: Use
RAGAS(Retrieval Augmented Generation Assessment) to measure faithfulness, answer relevancy, and context recall automatically — invaluable before pushing to production. -
Metadata Filters: FAISS supports filtering by metadata at retrieval time. Tag chunks by department, date range, or document type and filter user queries accordingly.
-
Async Everything: In production, use
chain.ainvoke()instead ofchain.invoke()andasync defthroughout your FastAPI handlers to handle concurrent users efficiently.
Deploying to Production
Docker
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY backend/ ./backend/
COPY frontend/ ./frontend/
COPY vectorstore/ ./vectorstore/
COPY .env .
CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]
docker build -t rag-chatbot .
docker run -p 8000:8000 --env-file .env rag-chatbot
Production Checklist
- Replace FAISS with Chroma, Qdrant, or pgvector for multi-user / persistent production deployments
- Add authentication (JWT, OAuth2) to the
/chatendpoint - Implement rate limiting with
slowapior a reverse proxy - Log queries + answers to a database for analytics and feedback loops
- Set up health checks and auto-restart via
systemdor Kubernetes - Enable HTTPS via Nginx reverse proxy or Caddy
- Store the FAISS index on persistent volume storage (not container filesystem)
📌 Key Takeaways
- RAG beats fine-tuning for most document Q&A use cases — it's faster, cheaper, and easier to update.
- Chunk strategy is critical: wrong chunk sizes lead to poor retrieval quality regardless of LLM power.
- OpenAI and Ollama are interchangeable with LangChain — use cloud for quality, local for privacy.
- MMR retrieval reduces redundancy in retrieved chunks and improves answer quality.
- Low temperature + grounded prompt is the combination that minimizes hallucination in RAG systems.
- Always show sources — it builds user trust and makes debugging far easier.
- The vector store is built once and served many times — protect this investment with persistence.
Conclusion
You've now built a complete, production-ready RAG chatbot from scratch. The pipeline — load documents → split → embed → store → retrieve → generate — is the foundation of virtually every enterprise knowledge-base AI deployed today.
What you built is modular by design: swap FAISS for Qdrant, swap FastAPI for Django, or swap the frontend for a Next.js app — the LangChain chain in the middle stays the same. That's the real power of this architecture.
From here, natural next steps include:
- Adding user authentication and multi-tenancy (separate vector stores per org)
- Experimenting with agentic tools (web search, calculator, code execution) via LangChain Agents
- Building an evaluation harness with RAGAS to continuously measure answer quality
- Integrating a feedback loop where users can thumbs-up/thumbs-down answers to improve retrieval over time
The gap between a prototype and production chatbot is smaller than ever. Now go build something useful.