Building Your Own Chatbot with LangChain and OpenAI (or Ollama)

So you want to build a chatbot that actually knows things — not just a glorified autocomplete, but one that reads your documents, understands context, and gives real answers. In this tutorial, we'll go from zero to a fully deployed Retrieval-Augmented Generation (RAG) chatbot using LangChain, OpenAI GPT-4o (or the fully local Ollama as a drop-in alternative), FAISS as our vector store, and a clean FastAPI + React web interface.

By the end, you'll have a working app you can point at any document corpus — PDFs, Markdown files, web pages — and chat with it intelligently.

Prerequisites

Before diving in, make sure you're comfortable with:

Python 3.11+
Basic REST API concepts
A basic grasp of how LLMs work (tokens, embeddings, context windows)

You'll also need one of the following:

An OpenAI API key (for GPT-4o + text-embedding-3-small)
Ollama installed locally (for models like llama3, mistral, or phi3) — completely free and private

The Architecture at a Glance

Before writing a single line of code, let's understand what we're building:

  ┌──────────────────────────────────────────────────┐
  │                 Your Documents                   │
  │        (PDFs, Markdown, HTML, text files)        │
  └────────────────────┬─────────────────────────────┘
                       │  Document Loader
                       ▼
  ┌──────────────────────────────────────────────────┐
  │            Text Splitter / Chunker               │
  └────────────────────┬─────────────────────────────┘
                       │  Chunks
                       ▼
  ┌──────────────────────────────────────────────────┐
  │         Embedding Model (OpenAI / Ollama)        │
  └────────────────────┬─────────────────────────────┘
                       │  Vectors
                       ▼
  ┌──────────────────────────────────────────────────┐
  │              FAISS Vector Store                  │
  └────────────────────┬─────────────────────────────┘
                       │  Retriever
                       ▼
  ┌──────────────────────────────────────────────────┐
  │    LangChain RAG Chain (Prompt + LLM + Memory)   │
  └────────────────────┬─────────────────────────────┘
                       │  Answer
                       ▼
  ┌──────────────────────────────────────────────────┐
  │          FastAPI Backend + React Frontend        │
  └──────────────────────────────────────────────────┘

This pattern is called Retrieval-Augmented Generation (RAG). Instead of fine-tuning a model (expensive and slow), we:

Break documents into chunks and embed them into vector space
At query time, embed the user's question and retrieve the k most relevant chunks
Feed those chunks + the question into the LLM as context
The LLM synthesizes a grounded answer

Step 1: Project Setup

Create your project structure:

mkdir rag-chatbot && cd rag-chatbot
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

pip install langchain langchain-community langchain-openai \
            faiss-cpu pypdf unstructured fastapi uvicorn \
            python-multipart python-dotenv tiktoken

Your project tree:

rag-chatbot/
├── backend/
│   ├── ingest.py          # Document ingestion pipeline
│   ├── chain.py           # LangChain RAG chain
│   ├── main.py            # FastAPI app
│   └── config.py          # Config & env vars
├── frontend/
│   └── index.html         # Simple chat UI
├── docs/                  # Drop your documents here
├── vectorstore/           # FAISS index saved here
└── .env

Create your .env:

# Use ONE of the following:
OPENAI_API_KEY=sk-...       # For OpenAI
OLLAMA_BASE_URL=http://localhost:11434  # For Ollama

LLM_PROVIDER=openai         # "openai" or "ollama"
OPENAI_MODEL=gpt-4o
OLLAMA_MODEL=llama3
EMBED_MODEL=text-embedding-3-small   # OpenAI embeddings
OLLAMA_EMBED_MODEL=nomic-embed-text  # Ollama embeddings
CHUNK_SIZE=1000
CHUNK_OVERLAP=150
RETRIEVER_K=5

Step 2: Configuration Module

# backend/config.py
import os
from dotenv import load_dotenv

load_dotenv()

LLM_PROVIDER     = os.getenv("LLM_PROVIDER", "openai")
OPENAI_API_KEY   = os.getenv("OPENAI_API_KEY", "")
OPENAI_MODEL     = os.getenv("OPENAI_MODEL", "gpt-4o")
OLLAMA_BASE_URL  = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
OLLAMA_MODEL     = os.getenv("OLLAMA_MODEL", "llama3")
EMBED_MODEL      = os.getenv("EMBED_MODEL", "text-embedding-3-small")
OLLAMA_EMBED     = os.getenv("OLLAMA_EMBED_MODEL", "nomic-embed-text")
CHUNK_SIZE       = int(os.getenv("CHUNK_SIZE", 1000))
CHUNK_OVERLAP    = int(os.getenv("CHUNK_OVERLAP", 150))
RETRIEVER_K      = int(os.getenv("RETRIEVER_K", 5))
VECTORSTORE_PATH = "vectorstore/faiss_index"
DOCS_DIR         = "docs"

Step 3: Document Ingestion Pipeline

This is the heart of your RAG system. We'll load documents, split them smartly, embed them, and persist the vector store.

# backend/ingest.py
import os
from pathlib import Path

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

from config import (
    CHUNK_SIZE, CHUNK_OVERLAP, VECTORSTORE_PATH,
    DOCS_DIR, LLM_PROVIDER, EMBED_MODEL, OLLAMA_BASE_URL, OLLAMA_EMBED,
)


def get_embeddings():
    """Return the appropriate embedding model based on config."""
    if LLM_PROVIDER == "openai":
        from langchain_openai import OpenAIEmbeddings
        return OpenAIEmbeddings(model=EMBED_MODEL)
    else:
        from langchain_community.embeddings import OllamaEmbeddings
        return OllamaEmbeddings(
            base_url=OLLAMA_BASE_URL,
            model=OLLAMA_EMBED,
        )


def load_documents(docs_dir: str) -> list:
    """Load all supported documents from a directory."""
    docs = []
    loaders = {
        ".pdf": PyPDFLoader,
        ".txt": TextLoader,
        ".md":  UnstructuredMarkdownLoader,
    }
    for filepath in Path(docs_dir).rglob("*"):
        suffix = filepath.suffix.lower()
        if suffix in loaders:
            print(f"  Loading: {filepath}")
            loader = loaders[suffix](str(filepath))
            docs.extend(loader.load())

    print(f"\n✅ Loaded {len(docs)} document pages/sections.")
    return docs


def split_documents(docs: list) -> list:
    """Split documents into overlapping chunks."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ". ", " ", ""],
        length_function=len,
    )
    chunks = splitter.split_documents(docs)
    print(f"✅ Split into {len(chunks)} chunks.")
    return chunks


def build_vectorstore(chunks: list):
    """Embed chunks and build/save the FAISS index."""
    embeddings = get_embeddings()
    print("⏳ Embedding chunks — this may take a while...")
    vectorstore = FAISS.from_documents(chunks, embeddings)
    os.makedirs(os.path.dirname(VECTORSTORE_PATH), exist_ok=True)
    vectorstore.save_local(VECTORSTORE_PATH)
    print(f"✅ Vector store saved to '{VECTORSTORE_PATH}'")
    return vectorstore


def ingest():
    print("📂 Loading documents...")
    docs   = load_documents(DOCS_DIR)
    chunks = split_documents(docs)
    build_vectorstore(chunks)


if __name__ == "__main__":
    ingest()

Run it once after dropping your files into docs/:

python backend/ingest.py

Why `RecursiveCharacterTextSplitter`?

It tries to split on natural boundaries — double newlines first (paragraphs), then single newlines, then sentences. This keeps semantic meaning intact far better than a naive character split.

Step 4: Building the LangChain RAG Chain

# backend/chain.py
from langchain_community.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate

from config import (
    LLM_PROVIDER, OPENAI_MODEL, OLLAMA_MODEL, OLLAMA_BASE_URL,
    VECTORSTORE_PATH, RETRIEVER_K,
)


def get_llm():
    """Instantiate the LLM based on provider config."""
    if LLM_PROVIDER == "openai":
        from langchain_openai import ChatOpenAI
        return ChatOpenAI(
            model=OPENAI_MODEL,
            temperature=0.2,
            streaming=True,
        )
    else:
        from langchain_community.chat_models import ChatOllama
        return ChatOllama(
            base_url=OLLAMA_BASE_URL,
            model=OLLAMA_MODEL,
            temperature=0.2,
        )


def get_embeddings():
    """Duplicate here so chain.py is self-contained."""
    from ingest import get_embeddings as _get
    return _get()


SYSTEM_PROMPT_TEMPLATE = """You are a helpful, knowledgeable assistant.
Use ONLY the following retrieved context to answer the question.
If the context does not contain sufficient information, say so honestly — do not hallucinate.

Context:
{context}

Chat History:
{chat_history}

Question: {question}

Answer (be concise, cite sources where possible):"""

QA_PROMPT = PromptTemplate(
    input_variables=["context", "chat_history", "question"],
    template=SYSTEM_PROMPT_TEMPLATE,
)


def build_chain():
    """Build and return the conversational RAG chain."""
    embeddings   = get_embeddings()
    vectorstore  = FAISS.load_local(
        VECTORSTORE_PATH, embeddings, allow_dangerous_deserialization=True
    )
    retriever = vectorstore.as_retriever(
        search_type="mmr",          # Maximal Marginal Relevance — reduces redundancy
        search_kwargs={"k": RETRIEVER_K, "fetch_k": RETRIEVER_K * 3},
    )
    llm    = get_llm()
    memory = ConversationBufferWindowMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key="answer",
        k=6,                        # Keep last 6 turns in memory
    )
    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        memory=memory,
        combine_docs_chain_kwargs={"prompt": QA_PROMPT},
        return_source_documents=True,
        verbose=False,
    )
    return chain

Why MMR Retrieval?

Plain similarity search can return 5 nearly identical chunks. Maximal Marginal Relevance (MMR) balances relevance and diversity, so your context window gets richer coverage of the topic instead of redundant repetition.

Step 5: FastAPI Backend

# backend/main.py
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from pydantic import BaseModel

from chain import build_chain

app = FastAPI(title="RAG Chatbot API", version="1.0.0")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

# Build the chain once on startup (expensive — don't rebuild per request)
print("🔗 Initialising RAG chain...")
rag_chain = build_chain()
print("✅ Chain ready.")


class ChatRequest(BaseModel):
    question: str
    session_id: str = "default"


class SourceDoc(BaseModel):
    source: str
    page: int | None = None
    snippet: str


class ChatResponse(BaseModel):
    answer: str
    sources: list[SourceDoc]


@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
    if not req.question.strip():
        raise HTTPException(status_code=400, detail="Question cannot be empty.")

    result = rag_chain.invoke({"question": req.question})

    sources = []
    for doc in result.get("source_documents", []):
        meta = doc.metadata
        sources.append(SourceDoc(
            source=meta.get("source", "Unknown"),
            page=meta.get("page"),
            snippet=doc.page_content[:200].replace("\n", " "),
        ))

    return ChatResponse(answer=result["answer"], sources=sources)


@app.get("/health")
async def health():
    return {"status": "ok"}


# Serve the frontend from /frontend
app.mount("/", StaticFiles(directory="frontend", html=True), name="frontend")

Start your server:

uvicorn backend.main:app --reload --port 8000

Step 6: The Web Interface

A clean single-file chat UI that streams responses and shows source citations:

<!-- frontend/index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>RAG Chatbot</title>
  <style>
    *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
    body {
      font-family: 'Segoe UI', system-ui, sans-serif;
      background: #0f172a; color: #e2e8f0;
      display: flex; flex-direction: column; height: 100vh;
    }
    header {
      padding: 1rem 1.5rem;
      background: #1e293b;
      border-bottom: 1px solid #334155;
      font-size: 1.2rem; font-weight: 700; color: #38bdf8;
    }
    #messages {
      flex: 1; overflow-y: auto; padding: 1.5rem;
      display: flex; flex-direction: column; gap: 1rem;
    }
    .msg { max-width: 75%; padding: 0.75rem 1rem; border-radius: 12px; line-height: 1.6; }
    .msg.user { background: #1d4ed8; align-self: flex-end; }
    .msg.bot  { background: #1e293b; align-self: flex-start; border: 1px solid #334155; }
    .sources  { font-size: 0.75rem; color: #94a3b8; margin-top: 0.5rem; }
    .sources span { display: block; }
    #form {
      display: flex; gap: 0.75rem; padding: 1rem 1.5rem;
      background: #1e293b; border-top: 1px solid #334155;
    }
    #input {
      flex: 1; padding: 0.65rem 1rem; border-radius: 8px;
      background: #0f172a; color: #e2e8f0; border: 1px solid #475569;
      font-size: 1rem; outline: none;
    }
    #input:focus { border-color: #38bdf8; }
    button {
      padding: 0.65rem 1.25rem; border-radius: 8px; border: none;
      background: #0ea5e9; color: #fff; font-weight: 600;
      cursor: pointer; transition: background 0.2s;
    }
    button:hover { background: #38bdf8; }
    button:disabled { background: #334155; cursor: not-allowed; }
  </style>
</head>
<body>
  <header>🤖 RAG Chatbot</header>
  <div id="messages"></div>
  <form id="form">
    <input id="input" placeholder="Ask a question about your documents..." autocomplete="off" />
    <button id="send-btn" type="submit">Send</button>
  </form>

  <script>
    const form    = document.getElementById("form");
    const input   = document.getElementById("input");
    const btn     = document.getElementById("send-btn");
    const msgs    = document.getElementById("messages");

    function addMessage(text, role, sources = []) {
      const div = document.createElement("div");
      div.className = `msg ${role}`;
      div.textContent = text;
      if (sources.length) {
        const s = document.createElement("div");
        s.className = "sources";
        s.innerHTML = "📎 Sources: " + sources.map(src =>
          `<span>${src.source}${src.page != null ? ` (p.${src.page})` : ""} — ${src.snippet}…</span>`
        ).join("");
        div.appendChild(s);
      }
      msgs.appendChild(div);
      msgs.scrollTop = msgs.scrollHeight;
    }

    form.addEventListener("submit", async (e) => {
      e.preventDefault();
      const q = input.value.trim();
      if (!q) return;

      addMessage(q, "user");
      input.value = "";
      btn.disabled = true;

      try {
        const res  = await fetch("/chat", {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify({ question: q }),
        });
        const data = await res.json();
        addMessage(data.answer, "bot", data.sources);
      } catch {
        addMessage("⚠️ Something went wrong. Is the server running?", "bot");
      } finally {
        btn.disabled = false;
        input.focus();
      }
    });
  </script>
</body>
</html>

Open http://localhost:8000 and start chatting with your documents. 🎉

Step 7: Switching Between OpenAI and Ollama

The entire swap is a one-line .env change:

# Use OpenAI (cloud, costs money, higher quality)
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

# OR use Ollama (local, free, private)
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3
OLLAMA_EMBED_MODEL=nomic-embed-text

To pull an Ollama model:

ollama pull llama3
ollama pull nomic-embed-text   # For embeddings

No code changes needed — the get_llm() and get_embeddings() factory functions handle everything.

Best Practices

1. Chunk Size Matters

Document Type	Recommended Chunk Size	Overlap
Technical docs	800–1200 chars	150
Legal contracts	500–800 chars	100
Conversational	300–500 chars	80
Code files	1200–2000 chars	200

Smaller chunks = more precise retrieval. Larger chunks = more context per chunk. Always tune empirically.

2. Persist Your Vector Store

Never re-embed on every startup. FAISS.save_local() / load_local() is fast (milliseconds) and free after the initial build.

3. Use Source Metadata

Always attach source, page, and title metadata to your document chunks. This lets the UI show citations and helps with debugging hallucinations.

# Add metadata manually when loading
from langchain_core.documents import Document
doc = Document(
    page_content="...",
    metadata={"source": "annual_report_2025.pdf", "page": 12, "section": "Revenue"}
)

4. Temperature for RAG

Keep temperature low (0.0–0.3) for factual Q&A bots. Higher temperatures produce more creative, but less grounded, answers.

5. Memory Window

ConversationBufferWindowMemory(k=6) keeps the last 6 exchanges. Don't go much higher without a summarisation strategy — you'll blow your context window on long conversations.

Common Mistakes

❌ Embedding on Every Request

Never call FAISS.from_documents() inside your API handler. Build the index once and load it. This mistake turns a fast retrieval (milliseconds) into a slow embedding pipeline (minutes).

❌ Ignoring Chunk Overlap

Without overlap, a sentence split across two chunks may be semantically broken in both. Always set chunk_overlap to at least 10–15% of chunk_size.

❌ Trusting the LLM Without Retrieval Grounding

Without a strict prompt that says "use only the provided context", the LLM will happily hallucinate. Your system prompt is your first line of defence.

❌ Single Embedding Model for Everything

Don't use text-embedding-ada-002 (old, 2022) when text-embedding-3-small is cheaper and better. For Ollama, nomic-embed-text dramatically outperforms generic alternatives.

❌ Storing Raw API Keys in Code

Always use .env + python-dotenv. Never commit API keys to Git. Consider a secrets manager (AWS Secrets Manager, HashiCorp Vault) in production.

🚀 Pro Tips

Hybrid Search: Combine vector similarity with BM25 keyword search using langchain_community.retrievers.BM25Retriever + EnsembleRetriever for dramatically better retrieval on technical or jargon-heavy documents.
Re-ranking: After retrieving k candidates, run them through a cross-encoder re-ranker (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2 via sentence-transformers) before passing to the LLM. This catches retrieval ranking errors.
Query Expansion: Have the LLM rewrite the user's question into 2–3 alternative phrasings before retrieval — the union of results covers more semantic ground.
Streaming Responses: Replace the await fetch(...) pattern with a ReadableStream reader for token-by-token streaming from the backend. Use FastAPI's StreamingResponse + langchain's async generator interface.
Evaluation: Use RAGAS (Retrieval Augmented Generation Assessment) to measure faithfulness, answer relevancy, and context recall automatically — invaluable before pushing to production.
Metadata Filters: FAISS supports filtering by metadata at retrieval time. Tag chunks by department, date range, or document type and filter user queries accordingly.
Async Everything: In production, use chain.ainvoke() instead of chain.invoke() and async def throughout your FastAPI handlers to handle concurrent users efficiently.

Deploying to Production

Docker

# Dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY backend/ ./backend/
COPY frontend/ ./frontend/
COPY vectorstore/ ./vectorstore/
COPY .env .

CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]

docker build -t rag-chatbot .
docker run -p 8000:8000 --env-file .env rag-chatbot

Production Checklist

Replace FAISS with Chroma, Qdrant, or pgvector for multi-user / persistent production deployments
Add authentication (JWT, OAuth2) to the /chat endpoint
Implement rate limiting with slowapi or a reverse proxy
Log queries + answers to a database for analytics and feedback loops
Set up health checks and auto-restart via systemd or Kubernetes
Enable HTTPS via Nginx reverse proxy or Caddy
Store the FAISS index on persistent volume storage (not container filesystem)

📌 Key Takeaways

RAG beats fine-tuning for most document Q&A use cases — it's faster, cheaper, and easier to update.
Chunk strategy is critical: wrong chunk sizes lead to poor retrieval quality regardless of LLM power.
OpenAI and Ollama are interchangeable with LangChain — use cloud for quality, local for privacy.
MMR retrieval reduces redundancy in retrieved chunks and improves answer quality.
Low temperature + grounded prompt is the combination that minimizes hallucination in RAG systems.
Always show sources — it builds user trust and makes debugging far easier.
The vector store is built once and served many times — protect this investment with persistence.

Conclusion

You've now built a complete, production-ready RAG chatbot from scratch. The pipeline — load documents → split → embed → store → retrieve → generate — is the foundation of virtually every enterprise knowledge-base AI deployed today.

What you built is modular by design: swap FAISS for Qdrant, swap FastAPI for Django, or swap the frontend for a Next.js app — the LangChain chain in the middle stays the same. That's the real power of this architecture.

From here, natural next steps include:

Adding user authentication and multi-tenancy (separate vector stores per org)
Experimenting with agentic tools (web search, calculator, code execution) via LangChain Agents
Building an evaluation harness with RAGAS to continuously measure answer quality
Integrating a feedback loop where users can thumbs-up/thumbs-down answers to improve retrieval over time

The gap between a prototype and production chatbot is smaller than ever. Now go build something useful.

Prerequisites

The Architecture at a Glance

Step 1: Project Setup

Step 2: Configuration Module

Step 3: Document Ingestion Pipeline

Why `RecursiveCharacterTextSplitter`?

Step 4: Building the LangChain RAG Chain

Why MMR Retrieval?

Step 5: FastAPI Backend

Step 6: The Web Interface

Step 7: Switching Between OpenAI and Ollama

Best Practices

1. Chunk Size Matters

2. Persist Your Vector Store

3. Use Source Metadata

4. Temperature for RAG

5. Memory Window

Common Mistakes

❌ Embedding on Every Request

❌ Ignoring Chunk Overlap

❌ Trusting the LLM Without Retrieval Grounding

❌ Single Embedding Model for Everything

❌ Storing Raw API Keys in Code

🚀 Pro Tips

Deploying to Production

Docker

Production Checklist

📌 Key Takeaways

Conclusion

References

Related articles

LangChain vs. LlamaIndex: Which One Should You Use in 2026?

Mastering RAG in Practice: Build a Production-Ready AI Search Feature for Your App

Graph‑RAG Explained: Building Smarter AI Agents with Knowledge Graphs

Building Your Own Chatbot with LangChain and OpenAI (or Ollama)

Prerequisites

The Architecture at a Glance

Step 1: Project Setup

Step 2: Configuration Module

Step 3: Document Ingestion Pipeline

Why RecursiveCharacterTextSplitter?

Step 4: Building the LangChain RAG Chain

Why MMR Retrieval?

Step 5: FastAPI Backend

Step 6: The Web Interface

Step 7: Switching Between OpenAI and Ollama

Best Practices

1. Chunk Size Matters

2. Persist Your Vector Store

3. Use Source Metadata

4. Temperature for RAG

5. Memory Window

Common Mistakes

❌ Embedding on Every Request

❌ Ignoring Chunk Overlap

❌ Trusting the LLM Without Retrieval Grounding

❌ Single Embedding Model for Everything

❌ Storing Raw API Keys in Code

🚀 Pro Tips

Deploying to Production

Docker

Production Checklist

📌 Key Takeaways

Conclusion

References

Related articles

LangChain vs. LlamaIndex: Which One Should You Use in 2026?

Mastering RAG in Practice: Build a Production-Ready AI Search Feature for Your App

Graph‑RAG Explained: Building Smarter AI Agents with Knowledge Graphs

Why `RecursiveCharacterTextSplitter`?