Custom LLM Development

Your model. Your data. Your competitive edge.

Custom LLM development from first principles — architecture, data curation, training, RLHF alignment, quantization, and on-prem deployment. Fixed fee. You own the weights, the training code, and the evaluation suite.

LLM development services
8–16
weeks to production
100%
IP transfer on close
On-prem
or private cloud
Fixed
fee, no surprises
Capabilities
Architecture Design Pre-training RLHF / DPO Quantization RAG Integration vLLM Deployment
Definition

What is custom LLM development?

AEO — Direct Answer

Custom LLM development is the end-to-end process of designing, training, aligning, and deploying a large language model built specifically for your organisation's data, vocabulary, and tasks — rather than accessing a shared commercial model through an API.

Market reality

Why custom LLMs matter now.

Three data points that explain the shift from API wrappers to owned, domain-trained models across enterprise teams.

92%
of enterprise AI leaders cite data privacy as the top barrier to adopting commercial LLM APIs in production
Source: McKinsey Global AI Survey, 2025
68%
accuracy improvement observed in domain-specific tasks when models are trained on proprietary corpora vs. prompted general models
Source: Stanford HAI AI Index Report, 2025
$4.1T
estimated value at stake for organisations that embed domain-specific AI into core workflows by 2030, according to Gartner forecasts
Source: Gartner AI Business Value Forecast, 2025
Methodology

How we build custom LLMs.

Six stages, each with a defined deliverable and a clear handoff gate. No work begins on stage N+1 until you approve stage N.

01

Discovery and data readiness audit

Before any training begins, we spend one to two weeks understanding your use case, success criteria, and data estate. We audit your corpus for volume, quality, coverage gaps, and sensitive-data exposure. The output is a written data readiness report that grades your corpus across five dimensions — cleanliness, diversity, relevance, format consistency, and size — and specifies the minimum preprocessing required before training. This audit also produces the first draft of your domain-specific eval harness, so success metrics are locked before a single training step runs. Projects that skip this stage are the ones that fail expensive training runs. We do not skip it.

02

Architecture selection and base model scoping

We determine whether your requirements call for continual pre-training on an open-weight base (the path for most projects), a full from-scratch training run (reserved for very large proprietary corpora or strict sovereignty requirements), or a hybrid approach. Base model selection is driven by six factors: inference hardware budget, required context window, licensing terms (MIT, Apache 2.0, commercial-friendly Llama license), domain distance from existing model knowledge, parameter budget for your latency target, and whether the model needs to support multi-lingual inference. We typically evaluate two to three candidate bases on a sample task before committing. This stage produces an architecture decision record documenting the choice and the trade-offs considered.

03

Data curation, formatting, and pipeline construction

Raw documents rarely enter a training loop directly. We build a reproducible data pipeline that handles ingestion (PDFs, HTML, SQL exports, APIs), deduplication using MinHash LSH at scale, quality filtering with perplexity scoring and rule-based heuristics, format conversion to instruction-following or causal language modeling format, and train/validation/test splitting with stratification by document type and time period. For instruction tuning datasets, we draft, review, and refine the instruction templates in collaboration with your subject-matter experts before finalising. All pipeline code is versioned, documented, and handed over at project close.

04

Training, instruction tuning, and alignment

Training is executed on GPU clusters using distributed frameworks (DeepSpeed ZeRO, FSDP) sized to your parameter count and data volume. We instrument every run with experiment tracking (Weights and Biases or MLflow) so you can observe loss curves, gradient norms, and learning rate schedules in real time. After the base training run, we apply supervised fine-tuning (SFT) on your instruction dataset to align output format and domain behavior. Where your use case requires preference alignment — for example, to enforce citation accuracy, tone consistency, or refusal behaviour — we apply Direct Preference Optimization (DPO) or a full RLHF pipeline with a trained reward model. Each stage is checkpointed and evaluated against your domain eval harness before proceeding.

05

Quantization and inference optimisation

A 70B parameter model in bfloat16 requires roughly 140 GB of GPU memory — impractical for most on-prem inference budgets. We apply GPTQ or AWQ quantization (4-bit or 8-bit, depending on accuracy tolerance) to produce a model that fits the hardware you have, while maintaining accuracy within the agreed tolerance on your eval set. We also apply speculative decoding where throughput is critical, and configure continuous batching in the inference server to maximise tokens-per-second under your expected query load. All quantization decisions are documented and the final accuracy delta versus the full-precision checkpoint is reported explicitly before deployment.

06

Deployment, observability, and handover

We deploy the final model using vLLM, Text Generation Inference (TGI), or Ollama depending on your infrastructure and team's operational preferences. Deployment includes an OpenAI-compatible REST API so your existing application code typically requires zero changes. We instrument the inference layer with Prometheus metrics (latency p50/p95/p99, tokens per second, error rates) fed into a Grafana dashboard, and configure alerting on model drift using output distribution monitoring. The 30-day post-launch support window covers live traffic tuning and any prompt engineering adjustments. At handover, you receive model weights, training code, eval harness, data pipeline, deployment configs, and full documentation — the complete package to retrain or adapt independently.

Decision framework

Custom LLM vs. the alternatives.

Where each approach wins, and what it costs in the dimensions that matter most for enterprise deployments.

Dimension Commercial API (GPT-4o, Claude) Open-weight + prompt engineering Fine-tuned adapter (LoRA) Custom LLM (Modulus)
Data stays on-premises Sent to vendor servers If self-hosted If self-hosted Air-gapped by design
Domain accuracy ~ Generalist ceiling ~ Generalist ceiling ~ Surface-level adaptation Trained on your corpus
Inference cost at scale Per-token, compounds fast Fixed infra cost Fixed infra cost Fixed infra cost
IP ownership None ~ Weights only, no training IP ~ Adapter weights only Full — weights + code + evals
Regulatory compliance Vendor-dependent ~ Depends on host ~ Depends on host Full control of data flow
Custom vocabulary / terminology Prompt injection only Prompt injection only ~ Partial Baked into weights
Vendor dependency risk High — pricing, API changes None Minimal Zero after delivery
Case study

Numbers from a live deployment.

Legal document intelligence model — 8B parameter, on-prem

A regional legal services group needed to extract structured data from contracts, flag non-standard clauses, and generate compliance summaries across a proprietary 14-year document archive. Commercial APIs introduced unacceptable data residency risk. An existing fine-tuned adapter produced acceptable accuracy on simple extraction tasks but failed on complex multi-clause reasoning.

We ran continual pre-training on 22 GB of curated legal text, applied SFT on 4,200 hand-verified instruction examples, and deployed via vLLM on a 2× A100 on-premises server. The domain eval harness ran across 800 held-out documents spanning six contract types.

91%
clause extraction accuracy on held-out eval (vs. 61% for prompted GPT-4o on the same set)
140ms
median first-token latency on the production vLLM stack, well within the 300ms SLA requirement
$0
per-token cost post-deployment — all inference runs on owned hardware with no external API dependency
11 wks
discovery to production handover, within the 12-week contract timeline
Technology

The stack behind every custom LLM.

We select specific tools based on your project's requirements — not a fixed template. These are the frameworks we work with across every layer of the pipeline.

Base models
Llama 3 / 3.1
8B, 70B, 405B — Meta's flagship open-weight family
Base models
Mistral / Mixtral
Efficient dense and MoE architectures with strong reasoning
Base models
Qwen 2.5
Strong multilingual and code capabilities
Training framework
DeepSpeed / FSDP
Distributed training with ZeRO memory optimisation
Fine-tuning
TRL + PEFT
SFT, DPO, RLHF, and LoRA adapter training
Data pipeline
Apache Spark + dbt
Large-scale deduplication, filtering, and formatting
Quantization
GPTQ / AWQ / GGUF
4-bit and 8-bit quantization with minimal accuracy loss
Inference
vLLM / TGI / Ollama
OpenAI-compatible serving with continuous batching
Observability
W&B / MLflow
Training experiment tracking and artifact versioning
Observability
Prometheus + Grafana
Inference monitoring — latency, throughput, drift
Evaluation
LM Evaluation Harness
Custom domain eval suites built on EleutherAI's framework
RAG layer
Qdrant / pgvector
Vector storage for retrieval-augmented deployment
Investment

Three fixed-fee engagement tiers.

Every project is scoped individually after discovery. These tiers reflect the most common project shapes. Final pricing is confirmed in a fixed-price proposal within 48 hours of your first call.

Starter
Domain Adaptation
$18K
fixed fee, from
Best for
  • Single-domain use case on an existing open-weight base
  • Instruction tuning on 2,000–10,000 curated examples
  • 4-bit quantized deployment on your hardware
  • Custom domain eval harness (200+ test cases)
  • OpenAI-compatible inference API
  • 30-day post-launch support
  • Full IP transfer — weights, code, evals
Enterprise
Full Development
Custom
scoped after discovery
Best for
  • From-scratch pre-training or very large continual runs
  • Multi-stage RLHF with human preference data collection
  • Regulatory-grade documentation and audit trail
  • Managed GPU cluster procurement and setup
  • Multi-model ensemble or mixture-of-experts architecture
  • Ongoing model maintenance retainer available
  • Full IP transfer with source-code escrow option
FAQ

Questions about custom LLM development.

Custom LLM development is the end-to-end process of designing, training, aligning, and deploying a large language model built specifically for your organisation's data, domain, and tasks. Unlike accessing a commercial model through a shared API, a custom-developed LLM is trained on your proprietary corpus, aligned to your quality standards, and deployed within your infrastructure. Your organisation owns all model weights, training code, and evaluation assets upon delivery.
Most projects run 8 to 16 weeks from signed scope to production deployment. A focused domain-adaptation project on a strong open-weight base typically completes in 8–10 weeks. Projects requiring substantial continual pre-training on large proprietary corpora, or multi-stage RLHF pipelines, run 12–16 weeks. The discovery and data-readiness audit in weeks 1–2 has the largest bearing on the final timeline — a clean, well-structured corpus can compress the schedule materially.
Fine-tuning updates a small subset of an existing model's parameters — typically using techniques like LoRA or QLoRA — to shift behaviour toward your domain without altering its core world knowledge. Custom LLM development may encompass continual pre-training on large proprietary corpora, which reshapes the model's knowledge at a deeper level. The right choice depends on your data volume and the distance between your domain and what existing models already know. For most B2B use cases with under 10 GB of domain text, a fine-tuned adapter delivers excellent results faster and cheaper. For specialised domains where existing models have little prior knowledge — regulatory filings, proprietary technical manuals, specialist clinical literature — custom development is the more reliable path to high accuracy.
Custom LLM development at Modulus starts at approximately $18,000 USD for a domain-adapted model with instruction tuning on an open-weight base. Mid-range projects involving continual pre-training and DPO alignment typically fall between $45,000 and $80,000. Large-scale projects with multi-billion-parameter training runs or specialised hardware procurement are scoped individually and typically exceed $100,000. All projects are fixed-fee with full IP transfer — there are no per-token ongoing costs. A fixed-price proposal is produced within 48 hours of your first call, after a free discovery session.
Yes. On-premises deployment is one of the primary reasons organisations choose custom LLM development over commercial APIs. Modulus delivers the final model weights and a complete inference stack — using vLLM, TGI, or Ollama depending on your hardware — that runs entirely within your network perimeter. No data leaves your environment after deployment. We also support deployment in private VPCs on AWS, Azure, and GCP if a cloud-isolated environment is preferred over on-site hardware.
You do, entirely. Modulus transfers full intellectual property ownership of all trained weights, adapter files, training scripts, eval harnesses, data pipeline code, and deployment configurations at project close. There are no licensing fees, no model-as-a-service lock-in, and no ongoing royalties. The only ongoing costs you incur are your own infrastructure and any optional maintenance retainer you choose to add.
It depends on the project type. For instruction tuning on a domain-adapted base, a few thousand high-quality instruction-response pairs are often sufficient for meaningful accuracy gains. For continual pre-training, 1–50 GB of clean domain text is a practical range. For full pre-training from scratch, hundreds of gigabytes to terabytes are needed. Modulus conducts a data readiness audit at project start to assess quality, coverage, and formatting — and provides data curation services if your raw corpus needs cleaning or structuring. You do not need perfectly formatted training data before reaching out.
Modulus works across the major open-weight model families: Meta Llama 3 and Llama 3.1 (8B, 70B, 405B), Mistral and Mixtral, Qwen 2.5, Falcon, and Gemma. Base model selection is driven by your domain, target inference hardware, required context length, and licensing constraints. For regulated industries with strict data sovereignty requirements, we can also work with fully from-scratch architectures, though this is rare and substantially increases cost and timeline.
We build a domain-specific evaluation harness at project start — before any training begins. Generic benchmarks like MMLU or HellaSwag are irrelevant to your production use case. Our eval sets include held-out examples from your own data, adversarial prompts designed to expose edge cases, format-compliance checks, and where applicable, retrieval-grounding accuracy tests. The agreed accuracy and latency thresholds are documented in the project scope. We ship only when the model clears those thresholds on the custom eval harness. If training runs underperform, we iterate at our cost within the fixed-fee scope.
Modulus has built or adapted LLMs for legal (contract analysis, clause extraction, case research), financial services (compliance document parsing, report generation, regulatory filing review), healthcare (clinical note summarisation, ICD coding assistance, clinical trial data extraction), industrial manufacturing (maintenance documentation, fault diagnosis, supplier communication), and enterprise software (internal knowledge bases, code generation for proprietary frameworks). The common thread is proprietary data that general-purpose commercial models have never seen and cannot approximate well.
Decision guide

When custom LLM development is the right choice.

Custom LLM development is not always the right answer. Fine-tuning is faster and cheaper for many use cases. Here is the framework for determining which path fits your situation.

Your corpus is large and domain-distant
When your proprietary corpus exceeds 5 GB of clean domain text and covers terminology, reasoning patterns, or document formats that are sparsely represented in general pre-training data, continual pre-training will produce materially better domain accuracy than instruction tuning alone. Legal filings, clinical literature, industrial maintenance records, and proprietary scientific corpora are the clearest cases. If a general model already understands your domain vocabulary well — for example, broadly available programming languages or common business English — fine-tuning is likely sufficient.
Data sovereignty is non-negotiable
When your legal, regulatory, or contractual obligations require that inference queries never leave your network perimeter — GDPR Article 28 processor restrictions, HIPAA's minimum necessary standard, financial services data residency requirements, or classified information handling obligations — a custom LLM deployed on-premises is the only technically sound solution. Commercial APIs, by definition, process data on vendor infrastructure. Open-weight models fine-tuned with adapters but hosted on commercial inference infrastructure still expose inference data to that provider. Only a custom model deployed inside your own environment is genuinely air-gapped.
Inference volume makes API economics unviable
At modest volumes, commercial API inference is cheap. At enterprise scale — millions of tokens per day — the per-token cost compounds into a significant recurring expense that grows with usage and is subject to vendor price changes. A custom LLM deployed on dedicated hardware converts a variable, vendor-controlled operating expense into a fixed infrastructure cost that you control entirely. The break-even point depends on your specific usage pattern, but for most enterprise deployments processing over five million tokens per day, the infrastructure cost of a custom LLM is lower than the API cost within twelve months.
You need IP ownership as a business asset
When the AI capability itself is a core differentiator — a proprietary model trained on your data and processes that competitors cannot replicate by subscribing to the same API — the model weights are a business asset analogous to proprietary software or a patent. Fine-tuned adapters transfer ownership of the adapter but not the base model. A fully custom-developed model transfers ownership of everything. For companies building AI-native products or enterprise tools where the model's domain accuracy is a selling point, full IP ownership is a strategic requirement, not just a preference.
Latency requirements are below API SLAs
Commercial API latency is governed by shared infrastructure load, network round-trip time, and provider-side rate limits. For applications requiring sub-100ms first-token latency — real-time document processing, high-frequency automated workflows, latency-sensitive customer-facing interfaces — dedicated on-premises inference hardware running a quantized custom model consistently outperforms shared API infrastructure. You control the hardware configuration, the quantization level, and the serving parameters, which means latency can be tuned to your specific SLA rather than accepted as a given.
When fine-tuning is the better path
If your domain vocabulary is well-covered by existing open-weight models, your corpus is under 5 GB of clean text, your use case requires format and style adaptation rather than deep knowledge injection, and your latency and cost requirements are compatible with a hosted or lightly quantized open-weight model, LoRA or QLoRA fine-tuning is the faster and more economical choice. Fine-tuning typically takes 4 to 6 weeks versus 8 to 16 weeks for full custom development, at a proportionally lower cost. Both paths deliver full IP ownership of the trained artifacts. The right path is determined during the discovery audit — we will recommend fine-tuning where it is genuinely the better answer.
Buyer's guide

How to evaluate a custom LLM development partner.

Most organisations engage a custom LLM development vendor once, which means there is limited institutional experience with what separates credible vendors from well-marketed ones. These are the five questions that separate them.

Q1

Can you show us accuracy and latency numbers from a similar project?

Any credible custom LLM development vendor can describe completed projects with specific, named metrics: the domain, the eval methodology, the accuracy achieved, and the inference latency on specified hardware. Vague answers ("we achieved strong results") or references only to internal projects that cannot be discussed are red flags. The numbers may be anonymised, but the specificity of the description should make clear that the vendor has shipped production systems, not just run training experiments.

Q2

What is your evaluation methodology, and when do you define success metrics?

Success metrics must be locked before training begins — not assessed post-hoc on whatever the model happens to produce. Ask the vendor to describe their eval harness construction process. Acceptable answers reference domain-specific golden datasets, held-out test sets with real examples from your corpus, adversarial probing, and automated evaluation frameworks like LM Evaluation Harness or RAGAS. An answer that references generic benchmarks like MMLU or HellaSwag as the primary success measure is a strong signal that the vendor does not build domain-specific evaluation infrastructure.

Q3

Who owns the model weights, training code, and evaluation harness at project close?

The answer should be unambiguous: you own everything. Full intellectual property transfer at project close — covering all trained checkpoints, adapter files, training scripts, data pipeline code, evaluation harnesses, deployment configuration, and documentation — with no licensing fees, no royalties, and no ongoing commercial dependency on the vendor for model operation. If the vendor retains any rights to the trained artifacts, or structures the arrangement as a model-as-a-service that requires ongoing payment, you do not have a custom LLM — you have a hosted fine-tune with exit risk.

Q4

Is the engagement fixed-fee or time-and-materials?

Time-and-materials billing on ML training projects transfers all timeline and compute cost risk to the client. Training runs can take longer than estimated; data quality issues discovered mid-project can require additional preprocessing iterations; alignment stages sometimes require multiple passes. A vendor offering fixed-fee pricing has priced those risks into the engagement and is incentivised to work efficiently. A vendor billing hourly is incentivised in the opposite direction. Fixed-fee engagements with clearly scoped deliverables are the professional standard for production custom LLM development work.

Q5

Can the model be deployed entirely on-premises with no data leaving our network?

If your organisation has data residency requirements — and most regulated industries do — the vendor must be able to describe a complete on-premises deployment stack that runs with no external network calls after initial setup. This means open-weight base models (not API-gated commercial models), a self-hosted inference server (vLLM, TGI, or Ollama), and a deployment architecture where the only persistent external dependency is infrastructure-level (your own cloud VPC or data centre), not model-level. Ask the vendor to describe specifically what network calls are made during inference in their standard deployment — there should be none to any external service.

Related reading

Further context on LLM development.

AI Engineering
Why most enterprise RAG deployments underperform
The retrieval-augmentation mistakes that erode accuracy before the model even generates a token — and how to diagnose them.
Modulus Insights
Strategy
Your AI strategy fails at the data layer
Why the most common point of failure for enterprise AI deployments is not the model — it's the data pipeline feeding it.
Modulus Insights
Strategy
The AI implementation gap: why strategy comes first
The gap between AI strategy and AI implementation is where most enterprise projects stall. A framework for closing it.
Modulus Insights

Build the model your business actually needs.

Free discovery call. Fixed-price proposal within 48 hours. You own the weights, the code, and the eval harness — entirely.

View LLM development services