LLM Engineering

Self-hosted vs managed LLM: TCO breakdown

Modulus May 15, 2026 6 min read

The self-hosted vs managed LLM debate is almost always framed as a cost question. It should be framed as a total cost of ownership question — because the headline difference between paying $0.01 per 1k tokens on a managed API versus running your own GPU cluster is almost never the whole story. Staffing, compliance, latency, vendor lock-in, and operational overhead shift the math in ways that surprise both sides of the argument.

This breakdown covers every cost category that matters when making the self-hosted vs managed decision, and the conditions under which each choice wins on pure economics.

TL;DR
  • Managed APIs (OpenAI, Anthropic, Google) win on cost up to roughly 2–5 million tokens per day, depending on model size.
  • Self-hosted open-source models break even or win above that threshold — but require 1–2 dedicated ML infrastructure engineers.
  • Compliance requirements (HIPAA, GDPR data residency, air-gap) often override cost math entirely — certain industries have no choice but to self-host.
  • Hidden costs in self-hosting: GPU procurement delays, driver and CUDA versioning, model update cycles, and security patching.
  • Hybrid routing — cheap self-hosted model for high-volume simple tasks, managed API for complex tasks — often delivers the best economics.

What "managed" actually means in 2026

Managed LLM means you are calling a vendor's API: OpenAI, Anthropic, Google Vertex AI, or one of the inference-as-a-service providers like Together AI, Fireworks AI, or Groq. You pay per token, you get SLAs on uptime and latency, you do not provision or maintain infrastructure, and you benefit from continuous model improvements without retraining. Your data passes through the vendor's infrastructure unless you have a private deployment agreement.

Self-hosted means you run the model yourself — either on cloud VMs with GPU instances (AWS, GCP, Azure), on a managed Kubernetes cluster, or on dedicated on-premise hardware. You use open-source models: Llama 3.x, Mistral, Qwen, Falcon, or a commercially licensed variant. You own the full stack: model weights, inference runtime, scaling, monitoring, and security.

Direct cost comparison at different scale tiers

Daily token volume Managed API cost/month Self-hosted cost/month Winner
100k tokens/day ~$90 (GPT-4o-mini rate) ~$800 (cheapest GPU instance + ops) Managed API
1M tokens/day ~$900 ~$1,200 (single A10G, amortized) Near parity
5M tokens/day ~$4,500 ~$2,500 (A100 instance, shared) Self-hosted
50M tokens/day ~$45,000 ~$12,000 (dedicated cluster) Self-hosted by a wide margin

Note: these figures use commodity open-source models (Llama 3.3 70B) for self-hosted, not a custom trained model. Custom model costs are additive. Managed API rates vary significantly by model tier — frontier models (GPT-4o, Claude 3.7 Opus) are 5–10x more expensive than mid-tier models, which changes the crossover point substantially.

The costs managed API pricing hides

The per-token price on a managed API is real, but it excludes several costs that accrue regardless of which provider you choose:

Integration engineering: Every managed API requires prompt engineering, retry logic, rate limiting, error handling, context management, and output parsing. This is non-trivial engineering work whether you are self-hosting or not.

Vendor lock-in remediation: If you build deeply integrated workflows around GPT-4o and OpenAI changes pricing, deprecates the model, or restricts your use case, migration is expensive. Architecting for portability from day one — abstraction layers, prompt format standardization — has a cost that is not reflected in API pricing.

Data egress and privacy: Every token you send to a managed API is a token of potentially sensitive data passing through a third-party system. SOC 2, HIPAA, and GDPR compliance requirements for this data flow have real cost: legal review, BAAs with providers, audit trails, and in some cases prohibitive constraints that rule out managed APIs entirely.

The costs self-hosting hides

Self-hosting advocates consistently underestimate the operational overhead. The full cost includes:

GPU procurement and provisioning: Cloud GPU instances are available but often subject to capacity constraints. Dedicated on-premise hardware requires capital expenditure, procurement lead times (6–16 weeks for high-end hardware in normal market conditions), power and cooling infrastructure, and physical security.

ML infrastructure engineering: Running inference at production scale requires expertise in CUDA optimization, quantization, batching strategies, KV cache management, and distributed inference frameworks (vLLM, TensorRT-LLM, TGI). This is a specialized skill set. A senior ML infrastructure engineer costs $180k–$280k per year in US markets. If you do not have one, you hire a contractor or accept degraded performance.

Model maintenance: Open-source models do not update themselves. When Llama 4 ships and outperforms your self-hosted Llama 3.3 deployment, you need to evaluate, fine-tune, test, and migrate. Each model update cycle is a meaningful engineering effort. Managed API providers absorb this cost for you.

Security and patching: Inference servers have CVEs. Container images need updating. GPU drivers need patching. These are not catastrophic tasks, but they require dedicated attention. A self-hosted LLM deployment that is not actively maintained is a security liability.

When compliance forces the decision

For many regulated industries, the cost math is secondary to regulatory requirements:

  • Healthcare (HIPAA): Patient data may not pass through vendor APIs without a signed Business Associate Agreement (BAA). Most managed LLM providers offer BAAs at enterprise tier pricing, but not all use cases qualify. Self-hosting on compliant infrastructure is often required for clinical workflows.
  • Finance: PII and transaction data have strict data residency requirements in many jurisdictions. Some financial regulators require the ability to audit the model itself — which is impossible with a closed managed API.
  • Government / defense: Air-gapped deployments are non-negotiable. Managed APIs are categorically excluded. Self-hosting on certified infrastructure is the only path.
  • GDPR data residency: EU data residency requirements constrain which regions you can use, even within managed API providers. Self-hosting in a compliant EU data center eliminates this constraint.

The hybrid routing model: best of both

The most cost-effective architecture for organizations with diverse LLM workloads is not a binary choice. Hybrid routing — using a fast, cheap self-hosted model for high-volume routine tasks and a managed API for complex, low-volume tasks — often delivers 40–60% total cost reduction versus using a managed API for everything.

A practical hybrid stack: Llama 3.3 7B self-hosted on a single A10G instance handles classification, extraction, and simple generation tasks at a fraction of API cost. Claude 3.7 Sonnet via managed API handles complex reasoning, document analysis, and tasks requiring frontier capability. A routing layer sends each query to the appropriate model based on complexity signals.

Decision checklist: self-hosted vs managed

  • Do you process more than 2 million tokens per day? Self-hosting economics start to become favorable.
  • Do you have a dedicated ML infrastructure engineer on staff or budget to hire one? If not, managed API has lower operational risk.
  • Are you in a regulated industry with data residency or air-gap requirements? Self-hosting may be mandatory.
  • Do you need to audit or modify the model weights? Self-hosting only.
  • Is your use case sensitive to vendor pricing changes or model deprecation? Self-hosting provides pricing stability.
  • Do you need sub-50ms inference latency that managed APIs cannot reliably deliver? Self-hosted with dedicated hardware.
  • Are you in early validation stage with uncertain volume? Managed API — switch later when scale justifies the operational overhead.
  • Do you have GPU procurement relationships or existing cloud reserved instance commitments? Self-hosting may be cheaper than headline spot pricing suggests.

The decision between self-hosted and managed is one of the foundational architectural choices in production LLM engineering. Getting it right before you build avoids a painful migration later. Our companion piece on picking the right model for production covers the model selection layer once the hosting decision is made. For a vendor-level perspective, see our custom LLM development overview or browse our full insights library. For RAG-specific infrastructure decisions, see our enterprise RAG service.

Category LLM Engineering
← All insights
Related

Self-hosted or managed? We architect for your constraints.

Free discovery call. Fixed-price proposal. You own the infrastructure decision.