What is the difference between custom LLM development and fine-tuning?

Fine-tuning updates a small subset of an existing model's parameters — using techniques like LoRA or QLoRA — to shift behaviour toward your domain without altering core world knowledge. Custom LLM development may encompass full pre-training from scratch, continual pre-training on large proprietary corpora, or extensive instruction tuning across the full model. The key difference is depth: fine-tuning is a targeted adaptation; custom development can reshape the model's knowledge, vocabulary, and reasoning patterns at a more fundamental level. Which is appropriate depends on data volume, domain distance from existing models, and sovereignty requirements.

How much does it cost to develop a custom LLM?

Custom LLM development at Modulus starts at approximately $18,000 USD for a domain-adapted model built on an open-weight base with instruction tuning and quantized deployment. Mid-range projects involving continual pre-training and a full RLHF pipeline typically fall between $45,000 and $80,000. Large-scale projects with multi-billion-parameter training runs or specialised hardware procurement are scoped individually and typically exceed $100,000. All projects are fixed-fee with full IP transfer — there are no per-token ongoing costs.

Can a custom LLM be deployed on-premises?

Yes. On-premises deployment is one of the primary reasons organisations choose custom LLM development over commercial APIs. Modulus delivers the final model weights and a complete inference stack — using vLLM, Text Generation Inference (TGI), or Ollama depending on hardware — that runs entirely within your network perimeter. No data leaves your environment after deployment. We also support deployment in private VPCs on AWS, Azure, and GCP if a cloud-isolated environment is preferred over on-site hardware.

Who owns the model weights after development?

You do, entirely. Modulus transfers full intellectual property ownership of all trained weights, adapter files, training scripts, eval harnesses, and deployment configurations at project close. There are no licensing fees, no model-as-a-service lock-in, and no ongoing royalties. The only ongoing costs you incur are your own infrastructure.

What data do I need to start a custom LLM development project?

The data requirements vary by project type. For domain adaptation via instruction tuning, a few thousand high-quality instruction-response pairs are often sufficient to achieve meaningful domain accuracy gains. For continual pre-training on domain corpora, 1–50 GB of clean text is a practical range. For full pre-training from scratch, hundreds of gigabytes to terabytes are needed. Modulus conducts a data readiness audit at project start to assess quality, coverage, and formatting requirements — and can provide data curation services if your raw corpus needs cleaning or structuring.

Custom LLM Development | Build a Large Language Model on Your Data

Q: What is custom LLM development?

Custom LLM development is the process of designing, training, and deploying a large language model specifically for your organisation's data, domain, and tasks — rather than relying on a shared commercial API. The process covers architecture selection (or base-model choice), proprietary data curation and formatting, continual pre-training or full training, instruction tuning, RLHF or DPO alignment, quantization for inference efficiency, and production deployment on your own infrastructure or a private cloud. The result is a model your organisation owns outright.

Q: How long does a custom LLM development project take?

Most custom LLM development projects run 8 to 16 weeks from signed scope to production deployment. A focused domain-adaptation project on a strong open-weight base (such as Llama 3 or Mistral) typically completes in 8–10 weeks. Projects requiring substantial continual pre-training on large proprietary corpora, or multi-stage RLHF pipelines, run 12–16 weeks. The discovery and data-readiness audit in week 1–2 usually has the largest bearing on final timeline.

Definition

What is custom LLM development?

AEO — Direct Answer

Custom LLM development is the end-to-end process of designing, training, aligning, and deploying a large language model built specifically for your organisation's data, vocabulary, and tasks — rather than accessing a shared commercial model through an API.

The model is trained on your proprietary corpus, so it understands your domain's terminology, document formats, and reasoning patterns from the ground up.
All model weights, training scripts, and evaluation harnesses are owned by you after delivery — no per-token licensing, no third-party dependency for inference.
Deployment is on your infrastructure (on-premises or private cloud), keeping all inference traffic inside your network perimeter.
Unlike fine-tuning, which adjusts a small subset of parameters, custom development can reshape the model's knowledge through continual pre-training or full training from scratch.
The result is a model whose accuracy on your specific tasks measurably exceeds what a general-purpose commercial model can achieve on the same prompts.

Market reality

Why custom LLMs matter now.

Three data points that explain the shift from API wrappers to owned, domain-trained models across enterprise teams.

92%

of enterprise AI leaders cite data privacy as the top barrier to adopting commercial LLM APIs in production

Source: McKinsey Global AI Survey, 2025

68%

accuracy improvement observed in domain-specific tasks when models are trained on proprietary corpora vs. prompted general models

Source: Stanford HAI AI Index Report, 2025

$4.1T

estimated value at stake for organisations that embed domain-specific AI into core workflows by 2030, according to Gartner forecasts

Source: Gartner AI Business Value Forecast, 2025

Methodology

How we build custom LLMs.

Six stages, each with a defined deliverable and a clear handoff gate. No work begins on stage N+1 until you approve stage N.

01

Discovery and data readiness audit

Before any training begins, we spend one to two weeks understanding your use case, success criteria, and data estate. We audit your corpus for volume, quality, coverage gaps, and sensitive-data exposure. The output is a written data readiness report that grades your corpus across five dimensions — cleanliness, diversity, relevance, format consistency, and size — and specifies the minimum preprocessing required before training. This audit also produces the first draft of your domain-specific eval harness, so success metrics are locked before a single training step runs. Projects that skip this stage are the ones that fail expensive training runs. We do not skip it.

02

Architecture selection and base model scoping

We determine whether your requirements call for continual pre-training on an open-weight base (the path for most projects), a full from-scratch training run (reserved for very large proprietary corpora or strict sovereignty requirements), or a hybrid approach. Base model selection is driven by six factors: inference hardware budget, required context window, licensing terms (MIT, Apache 2.0, commercial-friendly Llama license), domain distance from existing model knowledge, parameter budget for your latency target, and whether the model needs to support multi-lingual inference. We typically evaluate two to three candidate bases on a sample task before committing. This stage produces an architecture decision record documenting the choice and the trade-offs considered.

03

Data curation, formatting, and pipeline construction

Raw documents rarely enter a training loop directly. We build a reproducible data pipeline that handles ingestion (PDFs, HTML, SQL exports, APIs), deduplication using MinHash LSH at scale, quality filtering with perplexity scoring and rule-based heuristics, format conversion to instruction-following or causal language modeling format, and train/validation/test splitting with stratification by document type and time period. For instruction tuning datasets, we draft, review, and refine the instruction templates in collaboration with your subject-matter experts before finalising. All pipeline code is versioned, documented, and handed over at project close.

04

Training, instruction tuning, and alignment

Training is executed on GPU clusters using distributed frameworks (DeepSpeed ZeRO, FSDP) sized to your parameter count and data volume. We instrument every run with experiment tracking (Weights and Biases or MLflow) so you can observe loss curves, gradient norms, and learning rate schedules in real time. After the base training run, we apply supervised fine-tuning (SFT) on your instruction dataset to align output format and domain behavior. Where your use case requires preference alignment — for example, to enforce citation accuracy, tone consistency, or refusal behaviour — we apply Direct Preference Optimization (DPO) or a full RLHF pipeline with a trained reward model. Each stage is checkpointed and evaluated against your domain eval harness before proceeding.

05

Quantization and inference optimisation

A 70B parameter model in bfloat16 requires roughly 140 GB of GPU memory — impractical for most on-prem inference budgets. We apply GPTQ or AWQ quantization (4-bit or 8-bit, depending on accuracy tolerance) to produce a model that fits the hardware you have, while maintaining accuracy within the agreed tolerance on your eval set. We also apply speculative decoding where throughput is critical, and configure continuous batching in the inference server to maximise tokens-per-second under your expected query load. All quantization decisions are documented and the final accuracy delta versus the full-precision checkpoint is reported explicitly before deployment.

06

Deployment, observability, and handover

We deploy the final model using vLLM, Text Generation Inference (TGI), or Ollama depending on your infrastructure and team's operational preferences. Deployment includes an OpenAI-compatible REST API so your existing application code typically requires zero changes. We instrument the inference layer with Prometheus metrics (latency p50/p95/p99, tokens per second, error rates) fed into a Grafana dashboard, and configure alerting on model drift using output distribution monitoring. The 30-day post-launch support window covers live traffic tuning and any prompt engineering adjustments. At handover, you receive model weights, training code, eval harness, data pipeline, deployment configs, and full documentation — the complete package to retrain or adapt independently.

Decision framework

Custom LLM vs. the alternatives.

Where each approach wins, and what it costs in the dimensions that matter most for enterprise deployments.

Dimension	Commercial API (GPT-4o, Claude)	Open-weight + prompt engineering	Fine-tuned adapter (LoRA)	Custom LLM (Modulus)
Data stays on-premises	✗ Sent to vendor servers	✓ If self-hosted	✓ If self-hosted	✓ Air-gapped by design
Domain accuracy	~ Generalist ceiling	~ Generalist ceiling	~ Surface-level adaptation	✓ Trained on your corpus
Inference cost at scale	✗ Per-token, compounds fast	✓ Fixed infra cost	✓ Fixed infra cost	✓ Fixed infra cost
IP ownership	✗ None	~ Weights only, no training IP	~ Adapter weights only	✓ Full — weights + code + evals
Regulatory compliance	✗ Vendor-dependent	~ Depends on host	~ Depends on host	✓ Full control of data flow
Custom vocabulary / terminology	✗ Prompt injection only	✗ Prompt injection only	~ Partial	✓ Baked into weights
Vendor dependency risk	✗ High — pricing, API changes	✓ None	✓ Minimal	✓ Zero after delivery

Case study

Numbers from a live deployment.

Legal document intelligence model — 8B parameter, on-prem

A regional legal services group needed to extract structured data from contracts, flag non-standard clauses, and generate compliance summaries across a proprietary 14-year document archive. Commercial APIs introduced unacceptable data residency risk. An existing fine-tuned adapter produced acceptable accuracy on simple extraction tasks but failed on complex multi-clause reasoning.

We ran continual pre-training on 22 GB of curated legal text, applied SFT on 4,200 hand-verified instruction examples, and deployed via vLLM on a 2× A100 on-premises server. The domain eval harness ran across 800 held-out documents spanning six contract types.

91%

clause extraction accuracy on held-out eval (vs. 61% for prompted GPT-4o on the same set)

140ms

median first-token latency on the production vLLM stack, well within the 300ms SLA requirement

$0

per-token cost post-deployment — all inference runs on owned hardware with no external API dependency

11 wks

discovery to production handover, within the 12-week contract timeline

Technology

The stack behind every custom LLM.

We select specific tools based on your project's requirements — not a fixed template. These are the frameworks we work with across every layer of the pipeline.

Base models

Llama 3 / 3.1

8B, 70B, 405B — Meta's flagship open-weight family

Base models

Mistral / Mixtral

Efficient dense and MoE architectures with strong reasoning

Base models

Qwen 2.5

Strong multilingual and code capabilities

Training framework

DeepSpeed / FSDP

Distributed training with ZeRO memory optimisation

Fine-tuning

TRL + PEFT

SFT, DPO, RLHF, and LoRA adapter training

Data pipeline

Apache Spark + dbt

Large-scale deduplication, filtering, and formatting

Quantization

GPTQ / AWQ / GGUF

4-bit and 8-bit quantization with minimal accuracy loss

Inference

vLLM / TGI / Ollama

OpenAI-compatible serving with continuous batching

Observability

W&B / MLflow

Training experiment tracking and artifact versioning

Observability

Prometheus + Grafana

Inference monitoring — latency, throughput, drift

Evaluation

LM Evaluation Harness

Custom domain eval suites built on EleutherAI's framework

RAG layer

Qdrant / pgvector

Vector storage for retrieval-augmented deployment

Investment

Three fixed-fee engagement tiers.

Every project is scoped individually after discovery. These tiers reflect the most common project shapes. Final pricing is confirmed in a fixed-price proposal within 48 hours of your first call.

Starter

Domain Adaptation

$18K

fixed fee, from

Best for

Single-domain use case on an existing open-weight base
Instruction tuning on 2,000–10,000 curated examples
4-bit quantized deployment on your hardware
Custom domain eval harness (200+ test cases)
OpenAI-compatible inference API
30-day post-launch support
Full IP transfer — weights, code, evals

Most common

Continual Pre-training

$48K

fixed fee, from

Best for

Large proprietary corpus (5–50 GB clean text)
Continual pre-training + instruction tuning + DPO alignment
Multi-task evaluation across 3–5 use cases
vLLM deployment with observability stack
RLHF reward model (optional add-on)
RAG integration with your vector store
60-day post-launch support and one re-train cycle

Enterprise

Full Development

Custom

scoped after discovery

Best for

From-scratch pre-training or very large continual runs
Multi-stage RLHF with human preference data collection
Regulatory-grade documentation and audit trail
Managed GPU cluster procurement and setup
Multi-model ensemble or mixture-of-experts architecture
Ongoing model maintenance retainer available
Full IP transfer with source-code escrow option

FAQ

Questions about custom LLM development.

Custom LLM development is the end-to-end process of designing, training, aligning, and deploying a large language model built specifically for your organisation's data, domain, and tasks. Unlike accessing a commercial model through a shared API, a custom-developed LLM is trained on your proprietary corpus, aligned to your quality standards, and deployed within your infrastructure. Your organisation owns all model weights, training code, and evaluation assets upon delivery.

Most projects run 8 to 16 weeks from signed scope to production deployment. A focused domain-adaptation project on a strong open-weight base typically completes in 8–10 weeks. Projects requiring substantial continual pre-training on large proprietary corpora, or multi-stage RLHF pipelines, run 12–16 weeks. The discovery and data-readiness audit in weeks 1–2 has the largest bearing on the final timeline — a clean, well-structured corpus can compress the schedule materially.

Fine-tuning updates a small subset of an existing model's parameters — typically using techniques like LoRA or QLoRA — to shift behaviour toward your domain without altering its core world knowledge. Custom LLM development may encompass continual pre-training on large proprietary corpora, which reshapes the model's knowledge at a deeper level. The right choice depends on your data volume and the distance between your domain and what existing models already know. For most B2B use cases with under 10 GB of domain text, a fine-tuned adapter delivers excellent results faster and cheaper. For specialised domains where existing models have little prior knowledge — regulatory filings, proprietary technical manuals, specialist clinical literature — custom development is the more reliable path to high accuracy.

Custom LLM development at Modulus starts at approximately $18,000 USD for a domain-adapted model with instruction tuning on an open-weight base. Mid-range projects involving continual pre-training and DPO alignment typically fall between $45,000 and $80,000. Large-scale projects with multi-billion-parameter training runs or specialised hardware procurement are scoped individually and typically exceed $100,000. All projects are fixed-fee with full IP transfer — there are no per-token ongoing costs. A fixed-price proposal is produced within 48 hours of your first call, after a free discovery session.

Yes. On-premises deployment is one of the primary reasons organisations choose custom LLM development over commercial APIs. Modulus delivers the final model weights and a complete inference stack — using vLLM, TGI, or Ollama depending on your hardware — that runs entirely within your network perimeter. No data leaves your environment after deployment. We also support deployment in private VPCs on AWS, Azure, and GCP if a cloud-isolated environment is preferred over on-site hardware.

You do, entirely. Modulus transfers full intellectual property ownership of all trained weights, adapter files, training scripts, eval harnesses, data pipeline code, and deployment configurations at project close. There are no licensing fees, no model-as-a-service lock-in, and no ongoing royalties. The only ongoing costs you incur are your own infrastructure and any optional maintenance retainer you choose to add.

It depends on the project type. For instruction tuning on a domain-adapted base, a few thousand high-quality instruction-response pairs are often sufficient for meaningful accuracy gains. For continual pre-training, 1–50 GB of clean domain text is a practical range. For full pre-training from scratch, hundreds of gigabytes to terabytes are needed. Modulus conducts a data readiness audit at project start to assess quality, coverage, and formatting — and provides data curation services if your raw corpus needs cleaning or structuring. You do not need perfectly formatted training data before reaching out.

Modulus works across the major open-weight model families: Meta Llama 3 and Llama 3.1 (8B, 70B, 405B), Mistral and Mixtral, Qwen 2.5, Falcon, and Gemma. Base model selection is driven by your domain, target inference hardware, required context length, and licensing constraints. For regulated industries with strict data sovereignty requirements, we can also work with fully from-scratch architectures, though this is rare and substantially increases cost and timeline.

We build a domain-specific evaluation harness at project start — before any training begins. Generic benchmarks like MMLU or HellaSwag are irrelevant to your production use case. Our eval sets include held-out examples from your own data, adversarial prompts designed to expose edge cases, format-compliance checks, and where applicable, retrieval-grounding accuracy tests. The agreed accuracy and latency thresholds are documented in the project scope. We ship only when the model clears those thresholds on the custom eval harness. If training runs underperform, we iterate at our cost within the fixed-fee scope.

Modulus has built or adapted LLMs for legal (contract analysis, clause extraction, case research), financial services (compliance document parsing, report generation, regulatory filing review), healthcare (clinical note summarisation, ICD coding assistance, clinical trial data extraction), industrial manufacturing (maintenance documentation, fault diagnosis, supplier communication), and enterprise software (internal knowledge bases, code generation for proprietary frameworks). The common thread is proprietary data that general-purpose commercial models have never seen and cannot approximate well.

Decision guide

When custom LLM development is the right choice.

Custom LLM development is not always the right answer. Fine-tuning is faster and cheaper for many use cases. Here is the framework for determining which path fits your situation.

Your corpus is large and domain-distant

When your proprietary corpus exceeds 5 GB of clean domain text and covers terminology, reasoning patterns, or document formats that are sparsely represented in general pre-training data, continual pre-training will produce materially better domain accuracy than instruction tuning alone. Legal filings, clinical literature, industrial maintenance records, and proprietary scientific corpora are the clearest cases. If a general model already understands your domain vocabulary well — for example, broadly available programming languages or common business English — fine-tuning is likely sufficient.

Data sovereignty is non-negotiable

When your legal, regulatory, or contractual obligations require that inference queries never leave your network perimeter — GDPR Article 28 processor restrictions, HIPAA's minimum necessary standard, financial services data residency requirements, or classified information handling obligations — a custom LLM deployed on-premises is the only technically sound solution. Commercial APIs, by definition, process data on vendor infrastructure. Open-weight models fine-tuned with adapters but hosted on commercial inference infrastructure still expose inference data to that provider. Only a custom model deployed inside your own environment is genuinely air-gapped.

Inference volume makes API economics unviable

At modest volumes, commercial API inference is cheap. At enterprise scale — millions of tokens per day — the per-token cost compounds into a significant recurring expense that grows with usage and is subject to vendor price changes. A custom LLM deployed on dedicated hardware converts a variable, vendor-controlled operating expense into a fixed infrastructure cost that you control entirely. The break-even point depends on your specific usage pattern, but for most enterprise deployments processing over five million tokens per day, the infrastructure cost of a custom LLM is lower than the API cost within twelve months.

You need IP ownership as a business asset

When the AI capability itself is a core differentiator — a proprietary model trained on your data and processes that competitors cannot replicate by subscribing to the same API — the model weights are a business asset analogous to proprietary software or a patent. Fine-tuned adapters transfer ownership of the adapter but not the base model. A fully custom-developed model transfers ownership of everything. For companies building AI-native products or enterprise tools where the model's domain accuracy is a selling point, full IP ownership is a strategic requirement, not just a preference.

Latency requirements are below API SLAs

Commercial API latency is governed by shared infrastructure load, network round-trip time, and provider-side rate limits. For applications requiring sub-100ms first-token latency — real-time document processing, high-frequency automated workflows, latency-sensitive customer-facing interfaces — dedicated on-premises inference hardware running a quantized custom model consistently outperforms shared API infrastructure. You control the hardware configuration, the quantization level, and the serving parameters, which means latency can be tuned to your specific SLA rather than accepted as a given.

When fine-tuning is the better path

If your domain vocabulary is well-covered by existing open-weight models, your corpus is under 5 GB of clean text, your use case requires format and style adaptation rather than deep knowledge injection, and your latency and cost requirements are compatible with a hosted or lightly quantized open-weight model, LoRA or QLoRA fine-tuning is the faster and more economical choice. Fine-tuning typically takes 4 to 6 weeks versus 8 to 16 weeks for full custom development, at a proportionally lower cost. Both paths deliver full IP ownership of the trained artifacts. The right path is determined during the discovery audit — we will recommend fine-tuning where it is genuinely the better answer.

Buyer's guide

How to evaluate a custom LLM development partner.

Most organisations engage a custom LLM development vendor once, which means there is limited institutional experience with what separates credible vendors from well-marketed ones. These are the five questions that separate them.

Q1

Can you show us accuracy and latency numbers from a similar project?

Any credible custom LLM development vendor can describe completed projects with specific, named metrics: the domain, the eval methodology, the accuracy achieved, and the inference latency on specified hardware. Vague answers ("we achieved strong results") or references only to internal projects that cannot be discussed are red flags. The numbers may be anonymised, but the specificity of the description should make clear that the vendor has shipped production systems, not just run training experiments.

Q2

What is your evaluation methodology, and when do you define success metrics?

Success metrics must be locked before training begins — not assessed post-hoc on whatever the model happens to produce. Ask the vendor to describe their eval harness construction process. Acceptable answers reference domain-specific golden datasets, held-out test sets with real examples from your corpus, adversarial probing, and automated evaluation frameworks like LM Evaluation Harness or RAGAS. An answer that references generic benchmarks like MMLU or HellaSwag as the primary success measure is a strong signal that the vendor does not build domain-specific evaluation infrastructure.

Q3

Who owns the model weights, training code, and evaluation harness at project close?

The answer should be unambiguous: you own everything. Full intellectual property transfer at project close — covering all trained checkpoints, adapter files, training scripts, data pipeline code, evaluation harnesses, deployment configuration, and documentation — with no licensing fees, no royalties, and no ongoing commercial dependency on the vendor for model operation. If the vendor retains any rights to the trained artifacts, or structures the arrangement as a model-as-a-service that requires ongoing payment, you do not have a custom LLM — you have a hosted fine-tune with exit risk.

Q4

Is the engagement fixed-fee or time-and-materials?

Time-and-materials billing on ML training projects transfers all timeline and compute cost risk to the client. Training runs can take longer than estimated; data quality issues discovered mid-project can require additional preprocessing iterations; alignment stages sometimes require multiple passes. A vendor offering fixed-fee pricing has priced those risks into the engagement and is incentivised to work efficiently. A vendor billing hourly is incentivised in the opposite direction. Fixed-fee engagements with clearly scoped deliverables are the professional standard for production custom LLM development work.

Q5

Can the model be deployed entirely on-premises with no data leaving our network?

If your organisation has data residency requirements — and most regulated industries do — the vendor must be able to describe a complete on-premises deployment stack that runs with no external network calls after initial setup. This means open-weight base models (not API-gated commercial models), a self-hosted inference server (vLLM, TGI, or Ollama), and a deployment architecture where the only persistent external dependency is infrastructure-level (your own cloud VPC or data centre), not model-level. Ask the vendor to describe specifically what network calls are made during inference in their standard deployment — there should be none to any external service.