Build a Local RAG Pipeline with n8n and Ollama: Query Your Company Documents with AI
Build a Local RAG Pipeline with n8n and Ollama: Query Your Company Documents with AI
Every company has a knowledge problem. Policies live in PDFs nobody reads. Process documentation sits in shared drives. New employees ask the same questions that were answered in a Confluence page three years ago. RAG --- Retrieval-Augmented Generation --- solves this by letting an AI model answer questions using your actual documents as source material, not its training data.
This tutorial shows you how to build a complete RAG pipeline that runs entirely on local hardware. No cloud API keys. No per-query costs. No data leaving your network.
What RAG Is in Plain Terms
A large language model knows what it learned during training. It does not know your vacation policy, your deployment checklist, or your client onboarding process. RAG fixes this by adding a retrieval step before generation:
- Your question arrives
- The system searches your documents for relevant passages
- Those passages are injected into the prompt as context
- The LLM generates an answer grounded in your actual documentation
The result: accurate, source-backed answers instead of hallucinated generalities.
Architecture Overview
Here is the full pipeline from document ingestion to answer generation:
graph LR
A[Documents<br/>PDF, DOCX, TXT] --> B[Chunker<br/>Split into passages]
B --> C[Embedding Model<br/>nomic-embed-text]
C --> D[Vector Database<br/>ChromaDB]
E[User Question] --> F[Embed Question]
F --> G[Vector Search<br/>Top 5 matches]
G --> H[Build Prompt<br/>Context + Question]
D --> G
H --> I[Ollama LLM<br/>Llama 3.1 8B]
I --> J[Answer with Sources]
style A fill:#0B1628,color:#FAFAFA
style J fill:#F5A623,color:#0B1628
What You Need
| Component | Purpose | Install |
|---|---|---|
| n8n | Workflow orchestration | docker run -d -p 5678:5678 n8nio/n8n |
| Ollama | Local LLM + embedding inference | brew install ollama |
| ChromaDB | Vector database | pip install chromadb |
| Llama 3.1 8B | Answer generation model | ollama pull llama3.1:8b |
| nomic-embed-text | Embedding model | ollama pull nomic-embed-text |
Hardware requirement: a Mac Mini M4 with 16GB+ RAM handles all of this comfortably. For the full hardware guide, see our edge AI hardware recommendations.
Step 1: Ingest and Chunk Documents
The ingestion workflow in n8n watches a folder for new documents, splits them into chunks, embeds each chunk, and stores the vectors in ChromaDB.
Create an n8n workflow with a Schedule Trigger that runs every 15 minutes:
{
"nodes": [
{
"name": "Schedule Trigger",
"type": "n8n-nodes-base.scheduleTrigger",
"parameters": {
"rule": {
"interval": [{ "field": "minutes", "minutesInterval": 15 }]
}
}
},
{
"name": "Read Files",
"type": "n8n-nodes-base.readWriteFile",
"parameters": {
"operation": "list",
"folderPath": "/data/company-docs/"
}
}
]
}
For each document, split the text into overlapping chunks of 500 tokens with 50-token overlap. This ensures no information is lost at chunk boundaries.
Embed Each Chunk via Ollama
Use an HTTP Request node to call the Ollama embeddings endpoint:
{
"url": "http://localhost:11434/api/embed",
"method": "POST",
"body": {
"model": "nomic-embed-text",
"input": "{{ $json.chunk_text }}"
}
}
The response contains a 768-dimensional vector. Store it in ChromaDB along with the original text and metadata (filename, page number, chunk index).
Step 2: Query Pipeline
When a user asks a question, the query pipeline embeds the question, searches ChromaDB for the top 5 most similar chunks, and passes them as context to Ollama for answer generation.
{
"url": "http://localhost:11434/api/embed",
"method": "POST",
"body": {
"model": "nomic-embed-text",
"input": "What is our vacation policy?"
}
}
Vector Search
Query ChromaDB with the question embedding to retrieve the most relevant document chunks:
import chromadb
client = chromadb.PersistentClient(path="/data/chromadb")
collection = client.get_collection("company_docs")
results = collection.query(
query_embeddings=[question_embedding],
n_results=5,
include=["documents", "metadatas", "distances"]
)
Generate Answer with Ollama
Build the prompt with retrieved context and send it to Ollama:
{
"url": "http://localhost:11434/api/chat",
"method": "POST",
"body": {
"model": "llama3.1:8b",
"messages": [
{
"role": "system",
"content": "Answer using ONLY the provided context. If the context does not contain the answer, say so. Cite the source document."
},
{
"role": "user",
"content": "Context:\n{{ $json.retrieved_chunks }}\n\nQuestion: What is our vacation policy?"
}
],
"stream": false
}
}
Step 3: Practical Example
An employee asks: “What is our vacation policy?”
The pipeline:
- Embeds the question using nomic-embed-text (2ms)
- Searches ChromaDB and finds 5 relevant chunks from
HR-Policy-2026.pdf(8ms) - Builds a prompt with those chunks as context
- Ollama generates: “According to the HR Policy document (Section 4.2), employees receive 23 working days of paid vacation per year. Requests must be submitted 15 days in advance through the HR portal. Unused days can be carried over to Q1 of the following year.”
Total response time on a Mac Mini M4: under 3 seconds. Cost per query: zero.
Comparison: Local RAG vs Cloud RAG
| Factor | Local (Ollama + ChromaDB) | Cloud (OpenAI + Pinecone) |
|---|---|---|
| Cost per query | EUR 0.00 | EUR 0.01-0.05 |
| Monthly cost (1000 queries/day) | EUR 19 electricity | EUR 300-1,500 |
| Latency | 1-3 seconds | 2-5 seconds |
| Data privacy | Full --- never leaves network | Requires DPA + trust |
| GDPR compliance | Built-in | Requires processor agreement |
| Setup complexity | Medium | Low |
| Model quality (general) | Good (8B models) | Excellent (GPT-4o) |
| Model quality (domain) | Excellent after fine-tuning | Good with prompt engineering |
For a deeper cost analysis, see our cloud vs local AI cost breakdown.
Performance Tuning
Three settings that make the biggest difference:
- Chunk size: 500 tokens works for most documents. Use 300 for dense technical manuals, 800 for conversational content.
- Overlap: 10% of chunk size prevents information loss at boundaries.
- Top-K retrieval: Start with 5. Increase to 8-10 for complex questions that span multiple document sections.
For model selection guidance, our best local LLM models comparison covers the tradeoffs between Llama, Mistral, and Qwen for different use cases.
Automation with n8n
The real power comes from connecting this pipeline to your existing tools. n8n can trigger the RAG query from:
- A Slack message in a
#ask-hrchannel - A form submission on your intranet
- An email to a designated address
- A scheduled digest that answers the top 10 unanswered questions from the week
For more n8n automation patterns, see our n8n AI automation tutorial.
Related reading
- Automate Code Reviews with AI: n8n + Ollama Workflow Tutorial
- Google Gemma 3: The First Multimodal Open Model That Fits on a Mac Mini
- Google Gemma 4: The Open Model Family That Changed Our Entire Stack
Sources
- n8n Documentation: AI Workflows --- Official guide for building AI-powered workflows in n8n
- Ollama API Reference --- Complete API documentation for embeddings and chat endpoints
- LangChain RAG Tutorial --- Reference architecture for RAG pipeline design patterns
A RAG pipeline turns your static company documents into an interactive knowledge base that any employee can query in natural language. With n8n orchestrating the workflow and Ollama handling inference locally, the entire system runs on a single Mac Mini with no recurring costs and no data leaving your network. If you want help deploying a RAG system for your organization, reach out to us. We build these pipelines for Spanish SMEs every week.