View all articles
tutorialn8nragollama

Build a Local RAG Pipeline with n8n and Ollama: Query Your Company Documents with AI

JG
Jacobo Gonzalez Jaspe
|

Build a Local RAG Pipeline with n8n and Ollama: Query Your Company Documents with AI

Every company has a knowledge problem. Policies live in PDFs nobody reads. Process documentation sits in shared drives. New employees ask the same questions that were answered in a Confluence page three years ago. RAG --- Retrieval-Augmented Generation --- solves this by letting an AI model answer questions using your actual documents as source material, not its training data.

This tutorial shows you how to build a complete RAG pipeline that runs entirely on local hardware. No cloud API keys. No per-query costs. No data leaving your network.

What RAG Is in Plain Terms

A large language model knows what it learned during training. It does not know your vacation policy, your deployment checklist, or your client onboarding process. RAG fixes this by adding a retrieval step before generation:

  1. Your question arrives
  2. The system searches your documents for relevant passages
  3. Those passages are injected into the prompt as context
  4. The LLM generates an answer grounded in your actual documentation

The result: accurate, source-backed answers instead of hallucinated generalities.

Architecture Overview

Here is the full pipeline from document ingestion to answer generation:

graph LR
    A[Documents<br/>PDF, DOCX, TXT] --> B[Chunker<br/>Split into passages]
    B --> C[Embedding Model<br/>nomic-embed-text]
    C --> D[Vector Database<br/>ChromaDB]

    E[User Question] --> F[Embed Question]
    F --> G[Vector Search<br/>Top 5 matches]
    G --> H[Build Prompt<br/>Context + Question]
    D --> G
    H --> I[Ollama LLM<br/>Llama 3.1 8B]
    I --> J[Answer with Sources]

    style A fill:#0B1628,color:#FAFAFA
    style J fill:#F5A623,color:#0B1628

What You Need

ComponentPurposeInstall
n8nWorkflow orchestrationdocker run -d -p 5678:5678 n8nio/n8n
OllamaLocal LLM + embedding inferencebrew install ollama
ChromaDBVector databasepip install chromadb
Llama 3.1 8BAnswer generation modelollama pull llama3.1:8b
nomic-embed-textEmbedding modelollama pull nomic-embed-text

Hardware requirement: a Mac Mini M4 with 16GB+ RAM handles all of this comfortably. For the full hardware guide, see our edge AI hardware recommendations.

Step 1: Ingest and Chunk Documents

The ingestion workflow in n8n watches a folder for new documents, splits them into chunks, embeds each chunk, and stores the vectors in ChromaDB.

Create an n8n workflow with a Schedule Trigger that runs every 15 minutes:

{
  "nodes": [
    {
      "name": "Schedule Trigger",
      "type": "n8n-nodes-base.scheduleTrigger",
      "parameters": {
        "rule": {
          "interval": [{ "field": "minutes", "minutesInterval": 15 }]
        }
      }
    },
    {
      "name": "Read Files",
      "type": "n8n-nodes-base.readWriteFile",
      "parameters": {
        "operation": "list",
        "folderPath": "/data/company-docs/"
      }
    }
  ]
}

For each document, split the text into overlapping chunks of 500 tokens with 50-token overlap. This ensures no information is lost at chunk boundaries.

Embed Each Chunk via Ollama

Use an HTTP Request node to call the Ollama embeddings endpoint:

{
  "url": "http://localhost:11434/api/embed",
  "method": "POST",
  "body": {
    "model": "nomic-embed-text",
    "input": "{{ $json.chunk_text }}"
  }
}

The response contains a 768-dimensional vector. Store it in ChromaDB along with the original text and metadata (filename, page number, chunk index).

Step 2: Query Pipeline

When a user asks a question, the query pipeline embeds the question, searches ChromaDB for the top 5 most similar chunks, and passes them as context to Ollama for answer generation.

{
  "url": "http://localhost:11434/api/embed",
  "method": "POST",
  "body": {
    "model": "nomic-embed-text",
    "input": "What is our vacation policy?"
  }
}

Query ChromaDB with the question embedding to retrieve the most relevant document chunks:

import chromadb

client = chromadb.PersistentClient(path="/data/chromadb")
collection = client.get_collection("company_docs")

results = collection.query(
    query_embeddings=[question_embedding],
    n_results=5,
    include=["documents", "metadatas", "distances"]
)

Generate Answer with Ollama

Build the prompt with retrieved context and send it to Ollama:

{
  "url": "http://localhost:11434/api/chat",
  "method": "POST",
  "body": {
    "model": "llama3.1:8b",
    "messages": [
      {
        "role": "system",
        "content": "Answer using ONLY the provided context. If the context does not contain the answer, say so. Cite the source document."
      },
      {
        "role": "user",
        "content": "Context:\n{{ $json.retrieved_chunks }}\n\nQuestion: What is our vacation policy?"
      }
    ],
    "stream": false
  }
}

Step 3: Practical Example

An employee asks: “What is our vacation policy?”

The pipeline:

  1. Embeds the question using nomic-embed-text (2ms)
  2. Searches ChromaDB and finds 5 relevant chunks from HR-Policy-2026.pdf (8ms)
  3. Builds a prompt with those chunks as context
  4. Ollama generates: “According to the HR Policy document (Section 4.2), employees receive 23 working days of paid vacation per year. Requests must be submitted 15 days in advance through the HR portal. Unused days can be carried over to Q1 of the following year.”

Total response time on a Mac Mini M4: under 3 seconds. Cost per query: zero.

Comparison: Local RAG vs Cloud RAG

FactorLocal (Ollama + ChromaDB)Cloud (OpenAI + Pinecone)
Cost per queryEUR 0.00EUR 0.01-0.05
Monthly cost (1000 queries/day)EUR 19 electricityEUR 300-1,500
Latency1-3 seconds2-5 seconds
Data privacyFull --- never leaves networkRequires DPA + trust
GDPR complianceBuilt-inRequires processor agreement
Setup complexityMediumLow
Model quality (general)Good (8B models)Excellent (GPT-4o)
Model quality (domain)Excellent after fine-tuningGood with prompt engineering

For a deeper cost analysis, see our cloud vs local AI cost breakdown.

Performance Tuning

Three settings that make the biggest difference:

  1. Chunk size: 500 tokens works for most documents. Use 300 for dense technical manuals, 800 for conversational content.
  2. Overlap: 10% of chunk size prevents information loss at boundaries.
  3. Top-K retrieval: Start with 5. Increase to 8-10 for complex questions that span multiple document sections.

For model selection guidance, our best local LLM models comparison covers the tradeoffs between Llama, Mistral, and Qwen for different use cases.

Automation with n8n

The real power comes from connecting this pipeline to your existing tools. n8n can trigger the RAG query from:

  • A Slack message in a #ask-hr channel
  • A form submission on your intranet
  • An email to a designated address
  • A scheduled digest that answers the top 10 unanswered questions from the week

For more n8n automation patterns, see our n8n AI automation tutorial.

Sources

  1. n8n Documentation: AI Workflows --- Official guide for building AI-powered workflows in n8n
  2. Ollama API Reference --- Complete API documentation for embeddings and chat endpoints
  3. LangChain RAG Tutorial --- Reference architecture for RAG pipeline design patterns

A RAG pipeline turns your static company documents into an interactive knowledge base that any employee can query in natural language. With n8n orchestrating the workflow and Ollama handling inference locally, the entire system runs on a single Mac Mini with no recurring costs and no data leaving your network. If you want help deploying a RAG system for your organization, reach out to us. We build these pipelines for Spanish SMEs every week.

Share: LinkedIn X
Newsletter

Access exclusive resources

Subscribe to unlock 230+ workflows, 43 agents, and 26 professional templates. Weekly insights, no spam.

Bonus: Free EU AI Act checklist when you subscribe
Once a week No spam Unsubscribe anytime
EU AI Act: 99 days to deadline

15 minutes to evaluate your case

No-commitment initial consultation. We analyze your infrastructure and recommend the optimal hybrid architecture.

No commitment 15 minutes Custom proposal

136 pages of free resources · 26 compliance templates · 22 certified devices