Skip to main content
Back to Blog
OllamaLLMLocal AINode.jsPythonOpen SourceDeveloper ToolsMachine LearningPrivacyAI Integration

Ollama in Practice: Running Local LLMs on Your Dev Machine

A complete developer guide to installing Ollama, pulling and managing open-source LLMs, running in-terminal chat sessions, and integrating local models with Node.js and Python backends — all without sending a single token to the cloud.

May 10, 202618 min readNiraj Kumar

TL;DR: Ollama lets you download, run, and query open-source large language models entirely on your local machine — no API keys, no cloud costs, no data leaving your network. This guide walks you through everything from installation to production-ready backend integrations in Node.js and Python.


Introduction

The landscape of AI development has shifted dramatically. In 2026, you no longer need a cloud subscription, a corporate credit card, or a network connection to harness the power of large language models. Tools like Ollama have made it trivially easy to download state-of-the-art open-source models and run them directly on your development machine — whether that's a MacBook Pro with Apple Silicon, a Linux workstation with an NVIDIA GPU, or even a modest laptop running CPU inference.

Why does this matter? Because local LLMs unlock a new category of use cases:

  • Privacy-first AI features: Process sensitive documents, source code, or personal data without it ever leaving your machine.
  • Zero-latency prototyping: No API rate limits, no cold starts, no waiting on network round-trips.
  • Cost-free experimentation: Run thousands of inference calls while you iterate — no bill at the end of the month.
  • Offline capability: Build AI-powered apps that work on a plane, in a bunker, or anywhere without reliable internet.

This guide is structured for developers who are comfortable with the terminal and have some experience building backend services. If you've never touched an LLM API before, that's fine — we'll cover the concepts you need as we go. By the end, you'll have Ollama running locally, understand how to manage models, and have working code integrations in both Node.js and Python.

Let's get into it.


What Is Ollama?

Ollama is an open-source tool that packages the llama.cpp inference engine, a model management system, and a REST API server into one clean, cross-platform binary. Think of it as Docker, but for language models.

Here's what it handles for you under the hood:

  • Model downloads and versioning: Pull models from the Ollama model registry (similar to Docker Hub) with a single command.
  • Hardware acceleration: Automatically detects and uses Apple Metal (on macOS), CUDA (NVIDIA GPUs), or ROCm (AMD GPUs), falling back gracefully to CPU.
  • Quantization management: Models come pre-quantized in formats like Q4_K_M or Q8_0, dramatically reducing memory requirements without crippling quality.
  • OpenAI-compatible API: Ollama's REST API mirrors the OpenAI chat completions format, making it a drop-in replacement in many projects.
  • Model file system: You can create custom model variants using Modelfile — Ollama's equivalent of a Dockerfile — to bake in system prompts, temperature settings, and more.

How It Compares to Alternatives

ToolEase of UseGPU SupportOpenAI-Compatible APIModel Library
Ollama⭐⭐⭐⭐⭐✅ Full✅ YesLarge
LM Studio⭐⭐⭐⭐✅ Full✅ YesLarge
llama.cpp (raw)⭐⭐✅ Full❌ NoManual
Jan.ai⭐⭐⭐⭐✅ Full✅ YesMedium

Ollama wins on developer ergonomics. Its CLI, REST API, and growing ecosystem of SDKs make it the go-to choice for integrating local LLMs into real applications.


System Requirements

Before installing, check that your machine meets the minimum requirements:

Memory (RAM)

  • 7B parameter models: at least 8 GB RAM (16 GB recommended)
  • 13B parameter models: at least 16 GB RAM
  • 70B parameter models: 64 GB RAM or a GPU with enough VRAM

GPU (Optional but Highly Recommended)

  • Apple Silicon (M1/M2/M3/M4): Excellent unified memory means even M1 MacBooks handle 7B-13B models comfortably.
  • NVIDIA GPU: CUDA-accelerated inference is significantly faster than CPU. A 3090 or 4090 handles 70B models in 4-bit quantization.
  • AMD GPU: ROCm support has matured considerably; most modern Radeon cards work well.

Storage

  • Models range from ~4 GB (7B, 4-bit) to ~40 GB (70B, 4-bit). Set aside at least 20-50 GB for a comfortable model collection.

Operating Systems

  • macOS 12 Monterey or later
  • Linux (Ubuntu 22.04+, Fedora 38+, and others)
  • Windows 10/11 (via native installer or WSL2)

Installing Ollama

macOS and Linux

The quickest installation method is the official one-liner:

curl -fsSL https://ollama.com/install.sh | sh

This script detects your OS and architecture, downloads the appropriate binary, installs it to /usr/local/bin, and sets up a background service. On macOS, it also installs a menu bar app.

Verify the installation:

ollama --version
# ollama version 0.6.x

Windows

Download the .exe installer from ollama.com and run it. The installer configures Ollama as a Windows service and adds ollama to your system PATH.

Running as a Service

On Linux, Ollama runs as a systemd service automatically after installation:

# Check service status
sudo systemctl status ollama

# Start / Stop / Restart
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama

# View logs
journalctl -u ollama -f

On macOS, the background process starts automatically. You can also start it manually:

ollama serve

This launches the Ollama API server at http://localhost:11434.


Pulling and Managing Models

The Ollama Model Library

Ollama hosts a growing library of curated open-source models at ollama.com/library. Popular choices as of mid-2026 include:

ModelParametersBest For
llama3.370BGeneral purpose, reasoning
llama3.23B / 1BEdge devices, fast inference
mistral7BInstruction following, coding
qwen2.5-coder7B / 32BCode generation, debugging
deepseek-r17B–70BChain-of-thought reasoning
phi414BEfficient, high-quality reasoning
nomic-embed-textText embeddings / RAG
llava7B / 13BVision + language (multimodal)

Pulling a Model

# Pull a model (downloads to ~/.ollama/models)
ollama pull llama3.2

# Pull a specific variant/quantization
ollama pull llama3.3:70b-instruct-q4_K_M

# Pull a code-focused model
ollama pull qwen2.5-coder:7b

Model tags follow the format name:size-variant-quantization. When you omit the tag, Ollama pulls the recommended default for that model.

Listing and Removing Models

# List downloaded models
ollama list

# NAME                         ID              SIZE    MODIFIED
# llama3.2:latest              a80c4f17acd5    2.0 GB  2 days ago
# qwen2.5-coder:7b             2b0496514337    4.7 GB  5 hours ago
# nomic-embed-text:latest      0a109f422b47    274 MB  1 week ago

# Remove a model
ollama rm llama3.2

# Show model details and Modelfile
ollama show llama3.2

Checking Running Models

# See what's currently loaded in memory
ollama ps

# NAME              ID              SIZE      PROCESSOR    UNTIL
# llama3.2:latest   a80c4f17acd5    3.5 GB    100% GPU     4 minutes from now

Ollama keeps models loaded in memory for a short period after their last use, improving response times for follow-up calls.


Running In-Terminal Chat

Once a model is pulled, you can start an interactive chat session directly in your terminal:

ollama run llama3.2

You'll see a prompt:

>>> Send a message (/? for help)

Type your message and hit Enter. Ollama streams the response token-by-token, giving you that familiar streaming feel. Exit with /bye or Ctrl+D.

Useful In-Terminal Commands

# Multi-line input (use """ to open/close)
>>> """
... Write a Python function that
... implements binary search
... """

# Load a file as context
>>> /load ./my_document.txt

# Set the system prompt mid-session
>>> /set system "You are a senior Go engineer. Be concise."

# Show current model info
>>> /show info

# Clear conversation history
>>> /clear

Single-Shot Queries from the Shell

For scripting and automation, you can pipe input directly:

# Ask a question non-interactively
echo "Explain the CAP theorem in two sentences." | ollama run llama3.2

# Pass a file as context
cat my_code.py | ollama run qwen2.5-coder:7b "Review this code for bugs."

# Capture the output
RESULT=$(ollama run llama3.2 "Generate a UUID and nothing else.")
echo "Generated: $RESULT"

This makes Ollama scriptable in bash pipelines, CI steps, and shell utilities — a powerful pattern for automation.


Understanding the Ollama REST API

Ollama exposes a REST API at http://localhost:11434. While you can use it directly with curl or any HTTP client, it's the foundation for all SDK integrations.

Key Endpoints

POST /api/generate        — Raw completion (non-chat)
POST /api/chat            — Chat completions (OpenAI-style)
POST /api/embeddings      — Generate text embeddings
GET  /api/tags            — List available models
GET  /api/ps              — List running models
POST /api/pull            — Pull a model programmatically
DELETE /api/delete        — Delete a model

Quick Test with curl

# Test the generate endpoint
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Why is the sky blue?",
    "stream": false
  }'

# Test chat completions (OpenAI-compatible endpoint)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      { "role": "user", "content": "What is Rust?" }
    ]
  }'

The /v1/ prefix indicates the OpenAI-compatible endpoint — any library built for the OpenAI API can be pointed at this URL instead.


Integrating Ollama with Node.js

Setup

Install the official Ollama JavaScript SDK:

npm install ollama
# or
pnpm add ollama

If you prefer using the OpenAI SDK (useful for drop-in replacement scenarios):

npm install openai

Basic Chat Completion

// chat.mjs
import ollama from "ollama";

const response = await ollama.chat({
  model: "llama3.2",
  messages: [
    {
      role: "user",
      content: "Explain async/await in JavaScript in plain English.",
    },
  ],
});

console.log(response.message.content);

Run it:

node chat.mjs

Streaming Responses

For real-time output (critical for UIs and long responses):

// stream-chat.mjs
import ollama from "ollama";

const stream = await ollama.chat({
  model: "llama3.2",
  messages: [{ role: "user", content: "Write a short poem about compilers." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}
console.log(); // newline after stream ends

Multi-Turn Conversation with History

// conversation.mjs
import ollama from "ollama";
import * as readline from "readline/promises";

const rl = readline.createInterface({ input: process.stdin, output: process.stdout });
const messages = [];

console.log("Chat with Ollama (type 'exit' to quit)\n");

while (true) {
  const userInput = await rl.question("You: ");
  if (userInput.toLowerCase() === "exit") break;

  messages.push({ role: "user", content: userInput });

  const stream = await ollama.chat({
    model: "llama3.2",
    messages,
    stream: true,
  });

  process.stdout.write("Assistant: ");
  let fullResponse = "";

  for await (const chunk of stream) {
    const text = chunk.message.content;
    process.stdout.write(text);
    fullResponse += text;
  }

  console.log("\n");
  messages.push({ role: "assistant", content: fullResponse });
}

rl.close();

Express.js Streaming API Endpoint

Here's a production-style Express endpoint that streams Ollama responses to a browser client:

// server.mjs
import express from "express";
import ollama from "ollama";

const app = express();
app.use(express.json());

app.post("/api/chat", async (req, res) => {
  const { message, history = [] } = req.body;

  if (!message) {
    return res.status(400).json({ error: "message is required" });
  }

  const messages = [
    { role: "system", content: "You are a helpful coding assistant." },
    ...history,
    { role: "user", content: message },
  ];

  // Set headers for Server-Sent Events
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  try {
    const stream = await ollama.chat({
      model: "qwen2.5-coder:7b",
      messages,
      stream: true,
    });

    for await (const chunk of stream) {
      const content = chunk.message.content;
      res.write(`data: ${JSON.stringify({ content })}\n\n`);
    }

    res.write("data: [DONE]\n\n");
    res.end();
  } catch (err) {
    console.error("Ollama error:", err);
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  }
});

app.listen(3000, () => console.log("Server running at http://localhost:3000"));

Using Embeddings in Node.js

// embeddings.mjs
import ollama from "ollama";

async function getEmbedding(text) {
  const response = await ollama.embeddings({
    model: "nomic-embed-text",
    prompt: text,
  });
  return response.embedding;
}

function cosineSimilarity(a, b) {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (magA * magB);
}

const queryEmbedding = await getEmbedding("How do I handle errors in async code?");
const docEmbedding = await getEmbedding("Error handling with try/catch in async/await");

const similarity = cosineSimilarity(queryEmbedding, docEmbedding);
console.log(`Semantic similarity: ${similarity.toFixed(4)}`);
// e.g., Semantic similarity: 0.9241

Integrating Ollama with Python

Setup

Install the official Python SDK:

pip install ollama
# or with uv (recommended in 2026)
uv add ollama

For OpenAI-compatible usage:

pip install openai

Basic Chat Completion

# chat.py
import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "What is the difference between a process and a thread?"}
    ]
)

print(response["message"]["content"])

Streaming with Python

# stream_chat.py
import ollama

stream = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain MapReduce step by step."}],
    stream=True,
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

print()  # final newline

Async Chat with FastAPI

# main.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import ollama
import json

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    model: str = "llama3.2"
    history: list[dict] = []

@app.post("/chat")
async def chat(request: ChatRequest):
    messages = [
        {"role": "system", "content": "You are a helpful technical assistant."},
        *request.history,
        {"role": "user", "content": request.message},
    ]

    def generate():
        try:
            stream = ollama.chat(
                model=request.model,
                messages=messages,
                stream=True,
            )
            for chunk in stream:
                content = chunk["message"]["content"]
                yield f"data: {json.dumps({'content': content})}\n\n"
            yield "data: [DONE]\n\n"
        except Exception as e:
            yield f"data: {json.dumps({'error': str(e)})}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

@app.get("/models")
async def list_models():
    models = ollama.list()
    return {"models": [m["name"] for m in models["models"]]}

Run it:

uvicorn main:app --reload
# semantic_search.py
import ollama
import numpy as np
from dataclasses import dataclass

@dataclass
class Document:
    id: str
    content: str
    embedding: list[float] = None

def embed(text: str) -> list[float]:
    result = ollama.embeddings(model="nomic-embed-text", prompt=text)
    return result["embedding"]

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Sample knowledge base
documents = [
    Document(id="1", content="Python decorators wrap functions to add behavior."),
    Document(id="2", content="Docker containers isolate application environments."),
    Document(id="3", content="Async/await allows non-blocking I/O in Python."),
    Document(id="4", content="Git branches enable parallel feature development."),
]

# Pre-compute embeddings
for doc in documents:
    doc.embedding = embed(doc.content)

def search(query: str, top_k: int = 2) -> list[tuple[Document, float]]:
    query_embedding = embed(query)
    scored = [
        (doc, cosine_similarity(query_embedding, doc.embedding))
        for doc in documents
    ]
    return sorted(scored, key=lambda x: x[1], reverse=True)[:top_k]

# Run a search
results = search("How does Python handle concurrency?")
for doc, score in results:
    print(f"Score: {score:.4f} | {doc.content}")

Creating Custom Models with Modelfile

One of Ollama's most powerful features is Modelfile, which lets you create derivative models with baked-in behavior:

# Modelfile
FROM llama3.2

# Set a persistent system prompt
SYSTEM """
You are CodeReviewer, an expert software engineer specializing in Python and JavaScript.
When reviewing code:
- Identify bugs, anti-patterns, and security issues
- Suggest specific improvements with code examples
- Be concise but thorough
- Format your output with clear sections: Issues, Suggestions, and Verdict
"""

# Tune generation parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

Build and run your custom model:

# Build the model
ollama create code-reviewer -f ./Modelfile

# Run it
ollama run code-reviewer

# Or use it in the API
ollama run code-reviewer "Review this: def divide(a,b): return a/b"

Using OpenAI SDK as Drop-in Replacement

Because Ollama exposes an OpenAI-compatible /v1/ endpoint, you can use the OpenAI SDK with zero API key:

# openai_compat.py
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by the SDK but not validated
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Summarize the TCP handshake."}],
)

print(response.choices[0].message.content)

This is invaluable for migrating existing codebases to local inference with minimal changes.


🚀 Pro Tips

1. Preload models to eliminate cold start latency

Ollama unloads models after inactivity. For latency-sensitive workflows, keep a model warm by sending periodic keep-alive requests, or set OLLAMA_KEEP_ALIVE=-1 to keep it loaded indefinitely:

OLLAMA_KEEP_ALIVE=-1 ollama serve

2. Use smaller quantizations for speed, larger for quality

  • q4_K_M — Best speed-quality trade-off for most use cases
  • q8_0 — Near full-precision quality, needs more VRAM
  • f16 — Full precision, only use if you have the hardware

3. Run multiple models simultaneously

Ollama supports parallel model loading. Set OLLAMA_MAX_LOADED_MODELS to allow more than one model in memory:

OLLAMA_MAX_LOADED_MODELS=3 ollama serve

4. Expose Ollama on your local network

By default, Ollama only listens on localhost. To share it across your LAN:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

5. Use num_ctx wisely

Context window size directly impacts VRAM usage. Don't set it larger than your actual workload needs:

# Override context length at runtime
ollama run llama3.2 --num_ctx 4096

6. Pipe Ollama into your git workflow

# Auto-generate commit messages from your diff
git diff --staged | ollama run llama3.2 \
  "Write a conventional commit message for this diff. Output only the message."

7. Benchmark before choosing a model

# Run a quick benchmark
ollama run llama3.2 --verbose "Count from 1 to 10." 2>&1 | grep "eval rate"
# eval rate:    47.23 tokens/s

Best Practices

Security Considerations

  • Never expose Ollama to the public internet without authentication. The API has no built-in auth — use a reverse proxy (Nginx, Caddy) with basic auth or JWT tokens.
  • Validate model outputs before using them in security-critical paths (SQL queries, shell commands, HTML rendering).
  • Pin model versions in production using explicit tags (llama3.2:3b-instruct-q4_K_M) rather than :latest.

Performance Optimization

  • Co-locate CPU and GPU tasks: Use GPU-accelerated models for generation, but CPU for fast embedding lookups.
  • Batch embedding requests: Instead of making one embedding call per document, batch them where possible.
  • Profile your inference: Use --verbose mode to understand token generation speed and adjust model choice accordingly.

Production Readiness

  • Health checks: Poll GET /api/tags to confirm Ollama is up before handling requests.
  • Graceful degradation: If local inference is overloaded, consider queuing requests or falling back to a cloud provider.
  • Logging: Structure your logs to include model name, prompt token count, response time, and completion token count for observability.
// Example structured logging wrapper
async function trackedChat(model, messages) {
  const start = Date.now();
  const response = await ollama.chat({ model, messages });
  const duration = Date.now() - start;

  console.log(JSON.stringify({
    level: "info",
    event: "llm_inference",
    model,
    prompt_tokens: response.prompt_eval_count,
    completion_tokens: response.eval_count,
    duration_ms: duration,
  }));

  return response;
}

Common Mistakes to Avoid

❌ Not accounting for context window limits Models have a maximum context length (e.g., 8192 tokens). Sending prompts + history that exceed this silently truncates earlier messages. Always track token usage and trim history strategically.

❌ Treating model output as trusted input LLM outputs should never be directly interpolated into SQL queries, shell commands, or HTML without sanitization. Apply the same trust level as user input.

❌ Using a 70B model when a 7B will do Bigger isn't always better. For simple classification, extraction, or templated generation tasks, a well-prompted 7B model is often as accurate and 5-10x faster. Always benchmark both options.

❌ Blocking the event loop in Node.js Ollama calls are I/O-bound. Always use await with the SDK's async methods and never call them in a synchronous context.

❌ Forgetting to set a system prompt Without a system prompt, models behave inconsistently. Always define the model's role and constraints explicitly for production workloads.

❌ Hardcoding the model name Put model names in environment variables or configuration files. This makes it trivial to swap models across environments:

# .env
OLLAMA_MODEL=qwen2.5-coder:7b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_BASE_URL=http://localhost:11434

📌 Key Takeaways

  • Ollama makes local LLM inference accessible — one install command gets you a fully functional model server with a REST API.
  • The OpenAI-compatible /v1/ endpoint means you can drop Ollama into most existing AI projects with minimal code changes.
  • Model choice matters: 7B models are fast and cheap on consumer hardware; 70B models need significant resources but shine on complex reasoning.
  • Modelfiles let you create reusable, configured model variants — think of them as versioned AI personas for your team.
  • Both the Node.js and Python SDKs support streaming out of the box, enabling responsive UIs and real-time terminal output.
  • Security is your responsibility: Ollama has no built-in auth — treat it like an internal service and protect it accordingly.
  • Local inference excels at privacy-sensitive workloads, high-frequency automation, and development iteration — not as a universal replacement for cloud APIs on every task.

Conclusion

We've covered a lot of ground: from installing Ollama and understanding its architecture, to managing models via CLI, to building streaming chat endpoints in both Node.js and Python. We've also touched on embeddings, semantic search, custom Modelfiles, and the production considerations that separate a hobby project from a reliable service.

What makes Ollama genuinely exciting isn't just the technology — it's what it enables. A world where every developer can run a capable language model on their laptop changes how we prototype, how we think about data privacy, and how we build AI-native features without vendor lock-in.

The models keep getting better, the hardware keeps getting more capable, and the tooling — as Ollama demonstrates — keeps getting more ergonomic. Local AI isn't a compromise. For many use cases, it's the right choice.

Now go pull a model and build something.


References

All Articles
OllamaLLMLocal AINode.jsPythonOpen SourceDeveloper ToolsMachine LearningPrivacyAI Integration

Written by

Niraj Kumar

Software Developer — building scalable systems for businesses.