Build a Local RAG Pipeline with n8n and Ollama: Query Your Company Documents with AI

Every company has a knowledge problem. Policies live in PDFs nobody reads. Process documentation sits in shared drives. New employees ask the same questions that were answered in a Confluence page three years ago. RAG --- Retrieval-Augmented Generation --- solves this by letting an AI model answer questions using your actual documents as source material, not its training data.

This tutorial shows you how to build a complete RAG pipeline that runs entirely on local hardware. No cloud API keys. No per-query costs. No data leaving your network.

What RAG Is in Plain Terms

A large language model knows what it learned during training. It does not know your vacation policy, your deployment checklist, or your client onboarding process. RAG fixes this by adding a retrieval step before generation:

Your question arrives
The system searches your documents for relevant passages
Those passages are injected into the prompt as context
The LLM generates an answer grounded in your actual documentation

The result: accurate, source-backed answers instead of hallucinated generalities.

Architecture Overview

Here is the full pipeline from document ingestion to answer generation:

graph LR
    A[Documents<br/>PDF, DOCX, TXT] --> B[Chunker<br/>Split into passages]
    B --> C[Embedding Model<br/>nomic-embed-text]
    C --> D[Vector Database<br/>ChromaDB]

    E[User Question] --> F[Embed Question]
    F --> G[Vector Search<br/>Top 5 matches]
    G --> H[Build Prompt<br/>Context + Question]
    D --> G
    H --> I[Ollama LLM<br/>Llama 3.1 8B]
    I --> J[Answer with Sources]

    style A fill:#0B1628,color:#FAFAFA
    style J fill:#F5A623,color:#0B1628

What You Need

Component	Purpose	Install
n8n	Workflow orchestration	`docker run -d -p 5678:5678 n8nio/n8n`
Ollama	Local LLM + embedding inference	`brew install ollama`
ChromaDB	Vector database	`pip install chromadb`
Llama 3.1 8B	Answer generation model	`ollama pull llama3.1:8b`
nomic-embed-text	Embedding model	`ollama pull nomic-embed-text`

Hardware requirement: a Mac Mini M4 with 16GB+ RAM handles all of this comfortably. For the full hardware guide, see our edge AI hardware recommendations.

Step 1: Ingest and Chunk Documents

The ingestion workflow in n8n watches a folder for new documents, splits them into chunks, embeds each chunk, and stores the vectors in ChromaDB.

Create an n8n workflow with a Schedule Trigger that runs every 15 minutes:

{
  "nodes": [
    {
      "name": "Schedule Trigger",
      "type": "n8n-nodes-base.scheduleTrigger",
      "parameters": {
        "rule": {
          "interval": [{ "field": "minutes", "minutesInterval": 15 }]
        }
      }
    },
    {
      "name": "Read Files",
      "type": "n8n-nodes-base.readWriteFile",
      "parameters": {
        "operation": "list",
        "folderPath": "/data/company-docs/"
      }
    }
  ]
}

For each document, split the text into overlapping chunks of 500 tokens with 50-token overlap. This ensures no information is lost at chunk boundaries.

Embed Each Chunk via Ollama

Use an HTTP Request node to call the Ollama embeddings endpoint:

{
  "url": "http://localhost:11434/api/embed",
  "method": "POST",
  "body": {
    "model": "nomic-embed-text",
    "input": "{{ $json.chunk_text }}"
  }
}

The response contains a 768-dimensional vector. Store it in ChromaDB along with the original text and metadata (filename, page number, chunk index).

Step 2: Query Pipeline

When a user asks a question, the query pipeline embeds the question, searches ChromaDB for the top 5 most similar chunks, and passes them as context to Ollama for answer generation.

{
  "url": "http://localhost:11434/api/embed",
  "method": "POST",
  "body": {
    "model": "nomic-embed-text",
    "input": "What is our vacation policy?"
  }
}

Vector Search

Query ChromaDB with the question embedding to retrieve the most relevant document chunks:

import chromadb

client = chromadb.PersistentClient(path="/data/chromadb")
collection = client.get_collection("company_docs")

results = collection.query(
    query_embeddings=[question_embedding],
    n_results=5,
    include=["documents", "metadatas", "distances"]
)

Generate Answer with Ollama

Build the prompt with retrieved context and send it to Ollama:

{
  "url": "http://localhost:11434/api/chat",
  "method": "POST",
  "body": {
    "model": "llama3.1:8b",
    "messages": [
      {
        "role": "system",
        "content": "Answer using ONLY the provided context. If the context does not contain the answer, say so. Cite the source document."
      },
      {
        "role": "user",
        "content": "Context:\n{{ $json.retrieved_chunks }}\n\nQuestion: What is our vacation policy?"
      }
    ],
    "stream": false
  }
}

Step 3: Practical Example

An employee asks: “What is our vacation policy?”

The pipeline:

Embeds the question using nomic-embed-text (2ms)
Searches ChromaDB and finds 5 relevant chunks from HR-Policy-2026.pdf (8ms)
Builds a prompt with those chunks as context
Ollama generates: “According to the HR Policy document (Section 4.2), employees receive 23 working days of paid vacation per year. Requests must be submitted 15 days in advance through the HR portal. Unused days can be carried over to Q1 of the following year.”

Total response time on a Mac Mini M4: under 3 seconds. Cost per query: zero.

Comparison: Local RAG vs Cloud RAG

Factor	Local (Ollama + ChromaDB)	Cloud (OpenAI + Pinecone)
Cost per query	EUR 0.00	EUR 0.01-0.05
Monthly cost (1000 queries/day)	EUR 19 electricity	EUR 300-1,500
Latency	1-3 seconds	2-5 seconds
Data privacy	Full --- never leaves network	Requires DPA + trust
GDPR compliance	Built-in	Requires processor agreement
Setup complexity	Medium	Low
Model quality (general)	Good (8B models)	Excellent (GPT-4o)
Model quality (domain)	Excellent after fine-tuning	Good with prompt engineering

For a deeper cost analysis, see our cloud vs local AI cost breakdown.

Performance Tuning

Three settings that make the biggest difference:

Chunk size: 500 tokens works for most documents. Use 300 for dense technical manuals, 800 for conversational content.
Overlap: 10% of chunk size prevents information loss at boundaries.
Top-K retrieval: Start with 5. Increase to 8-10 for complex questions that span multiple document sections.

For model selection guidance, our best local LLM models comparison covers the tradeoffs between Llama, Mistral, and Qwen for different use cases.

Automation with n8n

The real power comes from connecting this pipeline to your existing tools. n8n can trigger the RAG query from:

A Slack message in a #ask-hr channel
A form submission on your intranet
An email to a designated address
A scheduled digest that answers the top 10 unanswered questions from the week

For more n8n automation patterns, see our n8n AI automation tutorial.

Sources

n8n Documentation: AI Workflows --- Official guide for building AI-powered workflows in n8n
Ollama API Reference --- Complete API documentation for embeddings and chat endpoints
LangChain RAG Tutorial --- Reference architecture for RAG pipeline design patterns

A RAG pipeline turns your static company documents into an interactive knowledge base that any employee can query in natural language. With n8n orchestrating the workflow and Ollama handling inference locally, the entire system runs on a single Mac Mini with no recurring costs and no data leaving your network. If you want help deploying a RAG system for your organization, reach out to us. We build these pipelines for Spanish SMEs every week.

Build a Local RAG Pipeline with n8n and Ollama: Query Your Company Documents with AI

Build a Local RAG Pipeline with n8n and Ollama: Query Your Company Documents with AI

What RAG Is in Plain Terms

Architecture Overview

What You Need

Step 1: Ingest and Chunk Documents

Embed Each Chunk via Ollama

Step 2: Query Pipeline

Vector Search

Generate Answer with Ollama

Step 3: Practical Example

Comparison: Local RAG vs Cloud RAG

Performance Tuning

Automation with n8n

Sources

Blog

VORLUX AI Launch Day: We're Open for Business

The VORLUX AI Stack: Every Tool We Use, Nothing Hidden

Access exclusive resources

15 minutes to evaluate your case

VORLUX AI

Build a Local RAG Pipeline with n8n and Ollama: Query Your Company Documents with AI

What RAG Is in Plain Terms

Architecture Overview

What You Need

Step 1: Ingest and Chunk Documents

Embed Each Chunk via Ollama

Step 2: Query Pipeline

Vector Search

Generate Answer with Ollama

Step 3: Practical Example

Comparison: Local RAG vs Cloud RAG

Performance Tuning

Automation with n8n

Related reading

Sources

Blog

VORLUX AI Launch Day: We're Open for Business

The VORLUX AI Stack: Every Tool We Use, Nothing Hidden

Access exclusive resources

15 minutes to evaluate your case

VORLUX AI