Building LocalRAG — a fully local AI document search

Written by

AI Agents & RAG, AI Engineering, Lessons from Production

LocalRAG is a fully local Retrieval-Augmented Generation application I built to answer one question: how much of a useful enterprise RAG can you run without sending a single byte to a cloud LLM?

The problem

Most “build a chatbot over your documents” tutorials assume an OpenAI key, a managed vector database and a cloud orchestrator. That’s fine for prototypes — and a dead end the moment you talk to a customer in regulated banking, healthcare or government. They want answers on their data, on their hardware, with no egress.

The shape of the solution

LocalRAG uses local Ollama models for both embeddings and generation, FAISS for the vector index, and a content-type-aware ingestion pipeline that handles PDF, DOCX, CSV, Excel, XML and images. Everything runs on a laptop. The full demo is on YouTube.

Ingestion: multi-format extractors that preserve enough structure to chunk intelligently — tables stay together, lists stay together, headings become metadata.
Indexing: FAISS index with content-type tags so retrieval can prefer the right shape of content for the question.
Retrieval: semantic top-k with rate-limited retries and a simple fallback when a model is overloaded.
Generation: a local Ollama model with grounded prompts and source citations.

What I’d do differently next time

Two things. First, evaluation should be a first-class subsystem from day one, not bolted on later — even a small golden-question set saves you from regression panic during refactors. Second, content-type awareness is more important than fancy reranking; a boring extractor that respects document structure beats a clever reranker that received bad chunks.

Repo: github.com/PowerAI-Labs/LocalRAG. Feedback and PRs welcome.

Building LocalRAG — a fully local AI document search

The problem

The shape of the solution

What I’d do differently next time

Share this:

Comments

Leave a comment Cancel reply

More posts

Deploying, evaluating, and calling models in Microsoft Foundry: a production guide for architects

The case of the 6,000 orphaned contacts: debugging GAB dual-write in Dynamics 365

Copilot Cowork: the agent that does the work — and the extensibility model architects should actually study

Microsoft IQ: the intelligence layer your agents inherit — and what it actually changes for enterprise AI builders