Why do most enterprise RAG deployments underperform?

The most common failure modes in enterprise RAG are: (1) poor chunking strategy — documents split at arbitrary token boundaries rather than semantic boundaries, fragmenting the context the model needs; (2) single-vector retrieval without re-ranking — embedding similarity alone fails on multi-hop questions and synonym mismatches; (3) no query rewriting or HyDE — embedding the raw user query rather than a hypothetical ideal answer, reducing retrieval precision; (4) missing metadata filters — retrieving from the full corpus rather than scoping by document type, date range, or business unit; (5) no evaluation harness — deploying without measuring retrieval recall, answer faithfulness, or citation accuracy on a held-out test set.

How long does it take to build an enterprise RAG system?

A focused enterprise RAG pipeline — single knowledge domain, one document type, one LLM backend — typically takes 4 to 6 weeks from kickoff to production deployment. Multi-domain systems with hybrid search, custom re-rankers, and integrations into multiple internal data sources take 8 to 12 weeks. The timeline is most sensitive to data ingestion complexity: structured databases and clean PDFs ingest quickly; legacy document repositories with inconsistent formatting or embedded tables require more preprocessing engineering.

How do you measure RAG accuracy?

We measure four dimensions: retrieval recall (does the correct document appear in the top-k results for a given question?), answer faithfulness (is the generated answer supported by the retrieved context, with no hallucinated additions?), answer relevance (does the answer actually address the question?), and citation accuracy (do the cited source chunks contain the information attributed to them?). We build a domain-specific golden dataset of question-answer-citation triples at project start and run automated evaluation using RAGAS or an equivalent framework on every pipeline change before deployment.

Agentic RAG extends the basic retrieve-then-generate pattern by giving the LLM the ability to issue multiple retrieval calls, decide when it has enough context, reformulate queries when initial retrieval is insufficient, and route different sub-questions to different knowledge sources. This is implemented as a ReAct or tool-calling loop where the model orchestrates its own search. Agentic RAG significantly improves performance on complex multi-hop questions at the cost of higher latency and more complex observability. We design it in for use cases where multi-step reasoning over the knowledge base is required.

Enterprise RAG Development | Retrieval-Augmented Generation for Business

Q: Can RAG work with my existing document repository?

Yes. Modulus builds ingestion pipelines for the most common enterprise document sources: SharePoint, Confluence, Notion, Google Drive, S3 and Azure Blob Storage, SQL databases, Salesforce, and direct API feeds. For unstructured documents (PDFs, Word files, scanned images), we apply OCR and layout-aware parsing to extract clean text while preserving table structure and section hierarchy. The ingestion pipeline is built as a scheduled job, so your vector index stays current as documents are created or updated.

Definition

What is enterprise RAG?

AEO — Direct Answer

Enterprise RAG (retrieval-augmented generation) is an architecture that connects a large language model to your organisation's private knowledge base at query time, so the model answers questions using retrieved documents from your own corpus — with source citations — rather than generating from training data alone.

Documents are ingested, cleaned, chunked, and stored as vector embeddings in a searchable index that updates continuously as your knowledge base changes.
At query time, relevant chunks are retrieved using hybrid search (semantic vectors plus keyword matching), re-ranked by a cross-encoder, and injected into the model's context window alongside the user's question.
The model generates an answer grounded in the retrieved content, with each claim traceable to a specific source document and chunk.
Unlike fine-tuning, the knowledge base stays separate from the model — it can be updated in minutes without any retraining.
Enterprise RAG dramatically reduces hallucination on factual questions compared to prompting a base model without retrieval, and produces auditable answer trails required by regulated industries.

Architecture

How a production RAG pipeline works.

Every component in the pipeline is an opportunity to improve accuracy — or introduce a failure mode if built carelessly.

Data flow — ingestion path

Source

Documents

Stage 1

Parse + Clean

Stage 2

Semantic Chunk

Stage 3

Embed + Index

Store

Vector DB

Data flow — query path

Input

User Query

Stage 1

Query Rewrite

Stage 2

Hybrid Search

Stage 3

Re-rank

Stage 4

Generate + Cite

Market reality

Why enterprise RAG is no longer optional.

Three figures that explain the scale of the hallucination and knowledge-access problem in enterprise AI deployments.

76%

of enterprise AI deployments that rely solely on base LLMs without retrieval report unacceptable hallucination rates in production use

Source: Gartner AI Implementation Survey, 2025

3.4×

improvement in answer faithfulness scores when a well-engineered RAG pipeline replaces direct LLM prompting on enterprise knowledge-base tasks

Source: Stanford CRFM RAG Benchmark Study, 2025

$2.9T

in knowledge worker productivity uplift projected from enterprise AI assistants with reliable document grounding by 2028, per IDC forecasts

Source: IDC Enterprise AI Forecast, 2025

Failure modes

Why most enterprise RAG deployments underperform.

These are the six most common engineering mistakes that cause enterprise RAG systems to produce inaccurate, incomplete, or unusable answers — and how we address each one.

Failure 01

Arbitrary token chunking

Splitting documents at a fixed token count breaks sentences, paragraphs, and tables mid-thought. The retrieved chunk no longer contains the complete context the model needs. We use semantic boundary detection and hierarchical chunking to preserve document structure.

Failure 02

Single-vector retrieval only

Dense embeddings miss exact-match queries on codes, identifiers, and proper nouns. Sparse BM25 misses semantic and paraphrastic matches. Hybrid search with RRF merging consistently outperforms either approach alone on enterprise corpora.

Failure 03

No query rewriting

Embedding the raw user query and searching directly gives poor recall on conversational or ambiguous questions. HyDE (hypothetical document embeddings) and multi-query expansion materially improve retrieval precision before the re-ranker even runs.

Failure 04

Missing metadata filters

Searching across the entire corpus for every query introduces noise and slows retrieval. Metadata-filtered search — by document type, date range, department, or regulatory category — reduces the search space and improves precision simultaneously.

Failure 05

No re-ranking step

Bi-encoder vector search is fast but coarse. A cross-encoder re-ranker reads both the query and each retrieved chunk together, dramatically improving the relevance ordering of the top-k results before they reach the generator.

Failure 06

No evaluation harness

Deploying without measuring retrieval recall, answer faithfulness, and citation accuracy means you have no signal when the system regresses. We build a RAGAS-based evaluation suite at project start and run it on every pipeline change before deployment.

Methodology

How we build enterprise RAG.

Five stages, each with a defined deliverable. Evaluation gates are enforced at stages 3 and 5.

01

Knowledge base audit and ingestion architecture

We begin by mapping your knowledge estate: document types, storage systems, update frequency, access controls, and volume. This shapes every subsequent design decision — chunking strategy, index design, metadata schema, and the freshness requirement for the synchronisation pipeline. We then build the ingestion layer: parsers for each source (SharePoint, Confluence, GDrive, S3, SQL, Salesforce, custom APIs), a document-normalisation pipeline that produces clean text while preserving structure (headers, tables, lists, section hierarchy), and a deduplication pass to avoid indexing the same content at multiple versions simultaneously. The ingestion job is designed as an incremental sync — full reindexing is expensive at scale, so only changed or new documents are processed on subsequent runs.

02

Chunking strategy and embedding pipeline

Chunking is the most under-engineered layer in most production RAG systems. We evaluate three strategies for your corpus: fixed-size with sentence-boundary respect, semantic chunking using embedding similarity to detect topic shifts, and hierarchical chunking that preserves parent-child document relationships for context enrichment. The choice is informed by your document characteristics — dense technical manuals favour smaller chunks with parent retrieval; long-form policies favour larger semantic segments. Each chunk is enriched with metadata at index time: document title, section header, date, author, document type, and a hypothetical questions field generated by an LLM to improve retrieval on paraphrastic queries. Embeddings are generated using a model selected for your domain — typically a fine-tuned E5 or BGE variant for enterprise text.

03

Hybrid search, re-ranking, and retrieval evaluation

The retrieval layer combines dense vector search (cosine similarity against your embedding index) with sparse keyword search (BM25 via a separate inverted index) merged using reciprocal rank fusion. Before the merged results reach the generator, a cross-encoder re-ranker scores each query-chunk pair jointly, reordering the shortlist by true relevance rather than approximate embedding distance. We also implement query rewriting: a small LLM rewrites the user's query into two to four diverse versions to improve recall on conversational inputs, and applies HyDE on question-type queries to boost precision. At this stage we run the first evaluation gate: retrieval recall at K=5 and K=10 must clear pre-agreed thresholds on a held-out golden dataset of question-document pairs before we proceed to the generation layer.

04

Generation, citation grounding, and API layer

The generation layer takes the re-ranked chunks, constructs a structured prompt with explicit grounding instructions, and calls your chosen LLM backend — whether a commercial API (OpenAI, Anthropic) or a self-hosted model. The system prompt instructs the model to answer only from retrieved context, cite sources inline, and return a structured refusal when the knowledge base contains insufficient evidence rather than guessing. The response includes source metadata (document title, section, URL or path, retrieval score) alongside the generated text, enabling downstream UIs to render citations as clickable references. The full pipeline is exposed as a REST API compatible with your application layer, with streaming support for progressive response rendering.

05

Evaluation, observability, and deployment

Before production, the full pipeline runs through an automated evaluation suite covering four dimensions: retrieval recall (are the right documents in the top-k?), answer faithfulness (is every claim supported by the retrieved context?), answer relevance (does the answer address the question?), and citation accuracy (does the cited source contain the attributed information?). We use RAGAS or an equivalent framework on your domain golden dataset. Scores below the agreed threshold trigger a root-cause investigation — retrieval issues are addressed at the search layer, faithfulness issues at the prompt or model layer. Deployment is containerised (Docker/Kubernetes) with a full observability stack: Prometheus metrics (query latency p50/p95, retrieval time, generation time, cache hit rate), Grafana dashboards, and alerting on answer quality degradation detected by the continuous eval monitor. Index freshness is monitored and alerts on sync lag exceeding your agreed threshold.

Decision framework

Enterprise RAG vs. the alternatives.

Dimension	Base LLM (no retrieval)	LLM + fine-tuning only	Naive RAG (basic chunking)	Modulus Enterprise RAG
Factual accuracy on proprietary knowledge	✗ Hallucination-prone	~ Static, retrains to update	~ Mediocre retrieval	✓ High — hybrid search + re-rank
Knowledge base updates	✗ Model retraining required	✗ Retraining required	✓ Re-index only	✓ Incremental sync, real-time
Source citations	✗ None	✗ None	~ Chunk-level only	✓ Document + section + score
Audit trail for regulated industries	✗ No traceability	✗ No traceability	~ Partial	✓ Full — logged, traceable
Multi-hop reasoning	~ From training only	~ Improved by fine-tune	✗ Single retrieval step fails	✓ Agentic multi-step retrieval
Measurement and quality gates	✗ No eval framework	~ Training metrics only	✗ Rarely measured	✓ RAGAS eval harness, gated deploy

Case study

Numbers from a live RAG deployment.

Internal compliance assistant — financial services, 340K documents

A mid-market financial institution needed to replace a legacy keyword-search system for compliance queries. Analysts spent an average of 38 minutes per query navigating regulatory documents, internal policies, and historical precedent files across three disconnected repositories. An initial prompt-only deployment using GPT-4 Turbo produced hallucinated regulatory references on 31% of test queries — unacceptable for a compliance context.

Modulus built a hybrid RAG pipeline ingesting SharePoint and a legacy DMS, with semantic chunking, BGE-M3 embeddings, pgvector storage, a Cohere re-ranker, and a structured citation-grounding prompt layer. A golden dataset of 420 compliance Q&A pairs drove evaluation across all pipeline iterations.

94%

answer faithfulness score on the 420-question golden eval set at launch (up from 69% with a naive RAG baseline)

4 min

median analyst query resolution time post-deployment, down from 38 minutes with the legacy keyword system

0

hallucinated regulatory citations in production in the first 90 days of monitored deployment

8 wks

kickoff to production deployment, including a full data-quality remediation pass on the legacy DMS export

Technology

The stack behind every RAG deployment.

Document parsing

Unstructured.io

PDF, Word, HTML, tables, scanned docs via OCR

Embeddings

BGE-M3 / E5-Mistral

Multilingual, long-context embeddings for enterprise text

Embeddings (alt)

OpenAI text-embedding-3

When OpenAI backend is already in use

Vector database

Qdrant

Self-hostable, fast filtering, payload storage

Vector database (alt)

pgvector

When PostgreSQL is already the primary store

Sparse search

Elasticsearch / BM25

Keyword retrieval layer for hybrid search RRF merge

Re-ranking

Cohere Rerank / BGE re-ranker

Cross-encoder precision layer on top of retrieval

Orchestration

LangChain / LlamaIndex

Pipeline orchestration, agentic routing, tool use

LLM backend

OpenAI / Anthropic / self-hosted

Model-agnostic pipeline — works with any backend

Evaluation

RAGAS

Automated faithfulness, recall, relevance scoring

Observability

Prometheus + Grafana

Query latency, retrieval time, cache hit, quality drift

Deployment

Docker / Kubernetes

Containerised, cloud or on-prem, auto-scaling ready

Investment

Three fixed-fee engagement tiers.

Scoped after a free discovery call. Fixed-price proposal within 48 hours.

Starter

Single Domain RAG

$12K

fixed fee, from

Best for

Single knowledge domain, one document type
Semantic chunking and embedding pipeline
Dense vector search (Qdrant or pgvector)
Citation-grounded generation with your LLM backend
REST API with OpenAI-compatible interface
RAGAS evaluation baseline
30-day post-launch support

Most common

Production RAG

$38K

fixed fee, from

Best for

Multi-source ingestion (SharePoint, Confluence, S3, SQL)
Hybrid search: dense + BM25 + RRF merge
Cross-encoder re-ranking
Query rewriting and HyDE
Metadata filtering by doc type, date, department
Full RAGAS evaluation harness + golden dataset
Observability stack and quality drift monitoring
60-day support + one pipeline iteration cycle

Enterprise

Agentic RAG + Custom Model

Custom

scoped after discovery

Best for

Agentic multi-step retrieval for complex queries
Combined custom LLM + RAG deployment
Fully air-gapped on-prem stack
Regulated-industry audit trail and explainability
Multi-tenant knowledge base isolation
Custom fine-tuned re-ranker for your domain
Ongoing maintenance retainer available

FAQ

Questions about enterprise RAG.

Enterprise RAG (retrieval-augmented generation) is an architecture that connects a large language model to your organisation's private knowledge base at query time, so the model answers questions using retrieved documents from your own corpus — with citations — rather than relying solely on its training data. This dramatically reduces hallucination on factual questions, enables knowledge-base updates without retraining the model, and produces auditable answer trails with traceable sources.

Fine-tuning bakes domain knowledge into model weights during a training run — the knowledge is static and requires retraining to update. RAG retrieves knowledge at inference time from a live index, so the knowledge base can be updated instantly without touching the model. Fine-tuning is better for learning style, tone, and format patterns. RAG is better for factual accuracy, traceability, and knowledge that changes frequently. Many production deployments combine both: a domain-fine-tuned model that also uses RAG for current, cited, auditable answers.

The most common failure modes are: arbitrary token chunking that breaks documents mid-thought; single-vector retrieval without hybrid search or re-ranking; no query rewriting or HyDE; missing metadata filters that force search across the full corpus for every query; no re-ranking step after initial retrieval; and no evaluation harness to measure accuracy before deployment. Any one of these will materially degrade answer quality. All of them together produce a system that feels unreliable in production — which is why so many enterprise RAG pilots never become platforms.

A focused pipeline for a single domain and document type typically takes 4 to 6 weeks from kickoff to production deployment. Multi-domain systems with hybrid search, custom re-rankers, and integrations into multiple data sources take 8 to 12 weeks. Timeline is most sensitive to data ingestion complexity: clean, structured document stores ingest quickly; legacy repositories with inconsistent formatting, scanned PDFs, or embedded tables require more engineering.

Hybrid search combines dense vector retrieval (embedding-based semantic similarity) with sparse keyword retrieval (BM25 or TF-IDF). Dense retrieval excels at semantic and paraphrastic matches. Sparse retrieval excels at exact term matches, product codes, and proper nouns. Combining both with a reciprocal rank fusion merge step consistently outperforms either approach alone, particularly on enterprise corpora that contain precise codes, identifiers, and domain jargon that embedding models may not encode reliably.

Yes. Modulus builds ingestion pipelines for SharePoint, Confluence, Notion, Google Drive, S3 and Azure Blob Storage, SQL databases, Salesforce, and direct API feeds. For unstructured documents (PDFs, Word files, scanned images), we apply OCR and layout-aware parsing to extract clean text while preserving table structure and section hierarchy. The ingestion pipeline runs as a scheduled incremental sync so your index stays current as documents are created or updated.

We measure four dimensions: retrieval recall (does the correct document appear in the top-k for a given question?), answer faithfulness (is the generated answer supported by the retrieved context?), answer relevance (does the answer address the question?), and citation accuracy (does the cited source chunk contain the attributed information?). We build a domain-specific golden dataset of question-answer-citation triples at project start and run automated evaluation using RAGAS on every pipeline change before deployment.

Database selection depends on your scale, infrastructure, and operational constraints. For most enterprise projects we recommend Qdrant (excellent performance, self-hostable, strong metadata filtering), pgvector (if you already run PostgreSQL and want to avoid a new infrastructure component), or Weaviate (strong for multi-tenancy and hybrid search). For very large-scale deployments we evaluate Pinecone or Milvus. The selection is made during the architecture phase based on your specific constraints — not a fixed default.

No. RAG works with any LLM backend — commercial APIs (OpenAI, Anthropic, Gemini) or self-hosted open-weight models (Llama 3, Mistral, Qwen). The retrieval pipeline is model-agnostic. For enterprises with strict data residency requirements, a self-hosted open-weight model combined with an on-premises RAG stack produces a fully air-gapped system where no query or document reaches an external server. We design for either scenario depending on your compliance requirements.

Agentic RAG extends the basic retrieve-then-generate pattern by giving the LLM the ability to issue multiple retrieval calls, decide when it has enough context, reformulate queries when initial retrieval is insufficient, and route different sub-questions to different knowledge sources. Implemented as a ReAct or tool-calling loop, agentic RAG significantly improves performance on complex multi-hop questions at the cost of higher latency and more complex observability. We design it in for use cases where multi-step reasoning over the knowledge base is required — typically compliance research, legal analysis, and complex technical support.

Use cases

Where enterprise RAG delivers measurable ROI.

The highest-value enterprise RAG deployments share a common pattern: large volumes of proprietary documents, high cost of wrong answers, and knowledge that changes faster than retraining allows.

01

Legal research and contract analysis

Legal teams spend a disproportionate share of billable hours searching precedent archives, reviewing contracts for non-standard clauses, and cross-referencing regulatory requirements. An enterprise RAG system indexed across case files, contracts, regulatory filings, and internal playbooks allows attorneys and paralegals to ask natural-language questions and receive cited answers in seconds rather than hours. Critically, every answer is traceable to the source document — a requirement for any legal use case where the answer itself may be used to advise clients or inform decisions. The knowledge base updates incrementally as new cases, contracts, and regulatory guidance are filed, without any model retraining. Modulus has deployed legal RAG systems achieving over 90% answer faithfulness on clause-extraction tasks, benchmarked against a held-out golden dataset of 800+ question-document pairs built with practitioner input.

02

Financial compliance and regulatory research

Compliance teams in financial services institutions face an ever-expanding volume of regulatory publications, internal policy documents, and historical correspondence that must be cross-referenced to answer complex compliance questions. Enterprise RAG systems indexed across regulatory databases, internal policies, and historical precedent files dramatically reduce the manual research burden while producing auditable answers with source citations — a requirement for compliance use cases subject to regulatory examination. Because RAG retrieves at inference time rather than baking knowledge into model weights, the system adapts immediately as new regulatory guidance is published or internal policies are updated, with no model retraining cycle. The citation trail also serves as documentation of the research basis for compliance decisions, which regulators increasingly expect.

03

Internal knowledge base and employee support

Enterprise organisations with large internal knowledge bases — HR policies, IT documentation, onboarding guides, engineering runbooks, procurement procedures — spend significant support staff time answering questions that are already answered in existing documents. An enterprise RAG system gives employees a natural-language interface to the entire knowledge base, returning cited answers that link directly to the authoritative source document. Unlike a traditional search interface, RAG handles multi-part questions, conversational follow-up, and queries that span multiple policies simultaneously. The system remains current with zero retraining as documents are updated through normal authoring workflows — the incremental sync job re-indexes updated documents within minutes of publication. Typical outcomes include 60–80% deflection of tier-1 support queries to the self-service RAG assistant.

04

Technical documentation and engineering support

Software engineering and product teams in companies with large proprietary codebases or complex technical stacks spend significant time searching internal wikis, API documentation, architecture decision records, and engineering runbooks. An enterprise RAG system indexed across internal documentation — including code comments, API specs, architecture diagrams converted to text, and historical Slack threads exported to documents — gives engineers a single interface for technical questions with answers traced to specific documentation pages, code files, or decision records. This is particularly high-value in organisations where documentation is extensive but fragmented across multiple platforms: Confluence, GitHub wikis, Notion, Notion and SharePoint are commonly consolidated into a single retrieval index. The citation requirement also drives documentation quality over time: engineers discover gaps in the knowledge base through retrieval failures and fill them.

05

Customer-facing product support and sales assistance

Customer-facing enterprise RAG applications require the highest reliability standards because errors are visible to clients. We design these systems with explicit knowledge-boundary enforcement — the model is instructed to return a structured refusal rather than hallucinating an answer when the retrieved context is insufficient. The knowledge base typically comprises product documentation, FAQs, troubleshooting guides, and historical support ticket resolutions. For sales assistance, the knowledge base extends to product specs, pricing sheets, competitive positioning documents, and proposal templates. Citation grounding is essential in sales contexts: a sales assistant that cites specific datasheet pages or case study documents gives the sales representative immediate access to the source material for follow-up, rather than creating a black-box answer that cannot be verified or expanded.