Custom LLM development from first principles — architecture, data curation, training, RLHF alignment, quantization, and on-prem deployment. Fixed fee. You own the weights, the training code, and the evaluation suite.
Custom LLM development is the end-to-end process of designing, training, aligning, and deploying a large language model built specifically for your organisation's data, vocabulary, and tasks — rather than accessing a shared commercial model through an API.
Three data points that explain the shift from API wrappers to owned, domain-trained models across enterprise teams.
Six stages, each with a defined deliverable and a clear handoff gate. No work begins on stage N+1 until you approve stage N.
Before any training begins, we spend one to two weeks understanding your use case, success criteria, and data estate. We audit your corpus for volume, quality, coverage gaps, and sensitive-data exposure. The output is a written data readiness report that grades your corpus across five dimensions — cleanliness, diversity, relevance, format consistency, and size — and specifies the minimum preprocessing required before training. This audit also produces the first draft of your domain-specific eval harness, so success metrics are locked before a single training step runs. Projects that skip this stage are the ones that fail expensive training runs. We do not skip it.
We determine whether your requirements call for continual pre-training on an open-weight base (the path for most projects), a full from-scratch training run (reserved for very large proprietary corpora or strict sovereignty requirements), or a hybrid approach. Base model selection is driven by six factors: inference hardware budget, required context window, licensing terms (MIT, Apache 2.0, commercial-friendly Llama license), domain distance from existing model knowledge, parameter budget for your latency target, and whether the model needs to support multi-lingual inference. We typically evaluate two to three candidate bases on a sample task before committing. This stage produces an architecture decision record documenting the choice and the trade-offs considered.
Raw documents rarely enter a training loop directly. We build a reproducible data pipeline that handles ingestion (PDFs, HTML, SQL exports, APIs), deduplication using MinHash LSH at scale, quality filtering with perplexity scoring and rule-based heuristics, format conversion to instruction-following or causal language modeling format, and train/validation/test splitting with stratification by document type and time period. For instruction tuning datasets, we draft, review, and refine the instruction templates in collaboration with your subject-matter experts before finalising. All pipeline code is versioned, documented, and handed over at project close.
Training is executed on GPU clusters using distributed frameworks (DeepSpeed ZeRO, FSDP) sized to your parameter count and data volume. We instrument every run with experiment tracking (Weights and Biases or MLflow) so you can observe loss curves, gradient norms, and learning rate schedules in real time. After the base training run, we apply supervised fine-tuning (SFT) on your instruction dataset to align output format and domain behavior. Where your use case requires preference alignment — for example, to enforce citation accuracy, tone consistency, or refusal behaviour — we apply Direct Preference Optimization (DPO) or a full RLHF pipeline with a trained reward model. Each stage is checkpointed and evaluated against your domain eval harness before proceeding.
A 70B parameter model in bfloat16 requires roughly 140 GB of GPU memory — impractical for most on-prem inference budgets. We apply GPTQ or AWQ quantization (4-bit or 8-bit, depending on accuracy tolerance) to produce a model that fits the hardware you have, while maintaining accuracy within the agreed tolerance on your eval set. We also apply speculative decoding where throughput is critical, and configure continuous batching in the inference server to maximise tokens-per-second under your expected query load. All quantization decisions are documented and the final accuracy delta versus the full-precision checkpoint is reported explicitly before deployment.
We deploy the final model using vLLM, Text Generation Inference (TGI), or Ollama depending on your infrastructure and team's operational preferences. Deployment includes an OpenAI-compatible REST API so your existing application code typically requires zero changes. We instrument the inference layer with Prometheus metrics (latency p50/p95/p99, tokens per second, error rates) fed into a Grafana dashboard, and configure alerting on model drift using output distribution monitoring. The 30-day post-launch support window covers live traffic tuning and any prompt engineering adjustments. At handover, you receive model weights, training code, eval harness, data pipeline, deployment configs, and full documentation — the complete package to retrain or adapt independently.
Where each approach wins, and what it costs in the dimensions that matter most for enterprise deployments.
| Dimension | Commercial API (GPT-4o, Claude) | Open-weight + prompt engineering | Fine-tuned adapter (LoRA) | Custom LLM (Modulus) |
|---|---|---|---|---|
| Data stays on-premises | ✗ Sent to vendor servers | ✓ If self-hosted | ✓ If self-hosted | ✓ Air-gapped by design |
| Domain accuracy | ~ Generalist ceiling | ~ Generalist ceiling | ~ Surface-level adaptation | ✓ Trained on your corpus |
| Inference cost at scale | ✗ Per-token, compounds fast | ✓ Fixed infra cost | ✓ Fixed infra cost | ✓ Fixed infra cost |
| IP ownership | ✗ None | ~ Weights only, no training IP | ~ Adapter weights only | ✓ Full — weights + code + evals |
| Regulatory compliance | ✗ Vendor-dependent | ~ Depends on host | ~ Depends on host | ✓ Full control of data flow |
| Custom vocabulary / terminology | ✗ Prompt injection only | ✗ Prompt injection only | ~ Partial | ✓ Baked into weights |
| Vendor dependency risk | ✗ High — pricing, API changes | ✓ None | ✓ Minimal | ✓ Zero after delivery |
A regional legal services group needed to extract structured data from contracts, flag non-standard clauses, and generate compliance summaries across a proprietary 14-year document archive. Commercial APIs introduced unacceptable data residency risk. An existing fine-tuned adapter produced acceptable accuracy on simple extraction tasks but failed on complex multi-clause reasoning.
We ran continual pre-training on 22 GB of curated legal text, applied SFT on 4,200 hand-verified instruction examples, and deployed via vLLM on a 2× A100 on-premises server. The domain eval harness ran across 800 held-out documents spanning six contract types.
We select specific tools based on your project's requirements — not a fixed template. These are the frameworks we work with across every layer of the pipeline.
Every project is scoped individually after discovery. These tiers reflect the most common project shapes. Final pricing is confirmed in a fixed-price proposal within 48 hours of your first call.
Custom LLM development is not always the right answer. Fine-tuning is faster and cheaper for many use cases. Here is the framework for determining which path fits your situation.
Most organisations engage a custom LLM development vendor once, which means there is limited institutional experience with what separates credible vendors from well-marketed ones. These are the five questions that separate them.
Any credible custom LLM development vendor can describe completed projects with specific, named metrics: the domain, the eval methodology, the accuracy achieved, and the inference latency on specified hardware. Vague answers ("we achieved strong results") or references only to internal projects that cannot be discussed are red flags. The numbers may be anonymised, but the specificity of the description should make clear that the vendor has shipped production systems, not just run training experiments.
Success metrics must be locked before training begins — not assessed post-hoc on whatever the model happens to produce. Ask the vendor to describe their eval harness construction process. Acceptable answers reference domain-specific golden datasets, held-out test sets with real examples from your corpus, adversarial probing, and automated evaluation frameworks like LM Evaluation Harness or RAGAS. An answer that references generic benchmarks like MMLU or HellaSwag as the primary success measure is a strong signal that the vendor does not build domain-specific evaluation infrastructure.
The answer should be unambiguous: you own everything. Full intellectual property transfer at project close — covering all trained checkpoints, adapter files, training scripts, data pipeline code, evaluation harnesses, deployment configuration, and documentation — with no licensing fees, no royalties, and no ongoing commercial dependency on the vendor for model operation. If the vendor retains any rights to the trained artifacts, or structures the arrangement as a model-as-a-service that requires ongoing payment, you do not have a custom LLM — you have a hosted fine-tune with exit risk.
Time-and-materials billing on ML training projects transfers all timeline and compute cost risk to the client. Training runs can take longer than estimated; data quality issues discovered mid-project can require additional preprocessing iterations; alignment stages sometimes require multiple passes. A vendor offering fixed-fee pricing has priced those risks into the engagement and is incentivised to work efficiently. A vendor billing hourly is incentivised in the opposite direction. Fixed-fee engagements with clearly scoped deliverables are the professional standard for production custom LLM development work.
If your organisation has data residency requirements — and most regulated industries do — the vendor must be able to describe a complete on-premises deployment stack that runs with no external network calls after initial setup. This means open-weight base models (not API-gated commercial models), a self-hosted inference server (vLLM, TGI, or Ollama), and a deployment architecture where the only persistent external dependency is infrastructure-level (your own cloud VPC or data centre), not model-level. Ask the vendor to describe specifically what network calls are made during inference in their standard deployment — there should be none to any external service.
Free discovery call. Fixed-price proposal within 48 hours. You own the weights, the code, and the eval harness — entirely.
Tell us your use case and data situation. Fixed-price proposal within 48 hours.