How to Evaluate an LLM Development Vendor (RFP Template)

The LLM development vendor market has two distinct populations. The first builds and ships production systems: they have opinions on evaluation harnesses, they ask about your data before they quote a price, and they can describe the last three projects that ran over budget and why. The second population can build a compelling demo. Distinguishing between the two before you sign a contract is worth more than any technical due diligence you could do after.

This guide gives you a structured RFP template, a scoring rubric, and the specific questions — and the answers — that reveal which population a vendor belongs to.

TL;DR

Evaluate on four dimensions: technical credibility, project management maturity, data handling practices, and post-deployment commitment.
The most revealing questions are about past failures, not past successes.
Any vendor who quotes a timeline without asking about your data quality is not trustworthy.
Require a reference call with a past client whose use case resembles yours — not a testimonial.
Fixed-price proposals with explicit scope boundaries are safer than time-and-materials for LLM projects with defined deliverables.

Why most vendor evaluations fail

The standard vendor evaluation process — RFP out, proposals in, lowest price or best slide deck wins — is poorly designed for LLM development. The failure modes are structural. LLM projects require domain expertise that is genuinely rare and difficult to assess from a written proposal. Timeline estimates depend almost entirely on data quality, which the vendor cannot evaluate from a brief. And the systems that look most impressive in a demo are often the most poorly architected for production.

The evaluation framework below is designed to surface operational maturity rather than sales capability. A vendor who struggles to answer these questions fluently in a first meeting is telling you something important.

The RFP structure: what to send before the first call

Your RFP document should be short. Its purpose is to pre-qualify vendors, not to exhaustively specify the system. Include:

A two-paragraph description of the use case and the user who will interact with the system.
A description of the data you have available: volume, format, freshness, and who controls access.
Your existing tech stack and any non-negotiable infrastructure constraints (cloud provider, data residency, compliance standards).
Your success criteria in plain language: what does "working" look like, and how would you know?
Your timeline expectations and any hard deadlines.
Your budget range. Withholding budget from LLM vendors is counterproductive — the architecture decision between a $30k RAG pipeline and a $150k fine-tuned model depends on knowing what you can spend.

Ask for a written response covering: their proposed architecture and why, their timeline estimate and what it depends on, their data assessment process, their evaluation methodology, and three references with contact details.

Section 1: Technical credibility evaluation

The technical evaluation should happen in a working session, not a presentation. Ask the vendor to walk through how they would approach your specific use case architecturally. A credible team will immediately ask clarifying questions about your data. A less credible team will present a generic architecture slide.

Questions to ask:

Walk me through the architecture you would propose for this use case and the tradeoffs of the alternatives you considered.
How would you evaluate whether the system is performing well enough to ship? What metrics, what thresholds?
What would make you recommend fine-tuning over RAG for this use case, or vice versa?
How do you handle hallucination in production? What is your strategy for cases where the model makes a factual error?
What model would you use and why? What would change your recommendation?
How do you instrument a production LLM system? What do you monitor after launch?

Red flag answers: Proposing the architecture before hearing about your data. Recommending a specific model without asking about your volume or compliance constraints. Using the phrase "the model will learn" without specifying from what. Describing evaluation as "we test it until it works."

Section 2: Project management and delivery track record

LLM projects have specific failure modes around scope creep, data surprises, and evaluation loops. The questions here are designed to surface whether the vendor has managed these failure modes before.

Questions to ask:

Tell me about an LLM project that ran significantly over the original timeline. What caused it and what did you change?
How do you handle a mid-project discovery that the client's data is significantly worse quality than expected?
What is your process for managing scope changes? Show me an example of a change order from a past project.
Who on your team would work day-to-day on this project? Can I meet them before signing?
What are the client responsibilities on a typical project? What do you need from us, and when?
What does your project end look like? What is handed over, in what format, with what documentation?

Red flag answers: Projects that "always deliver on time" without any nuance. Vague answers about client responsibilities. Proposals where you would not meet the actual team until after signing. No clear description of handover deliverables.

Section 3: Data handling and security practices

Your data is both the most valuable input to the project and the most significant risk surface. This section is non-negotiable regardless of the vendor's technical capability.

Questions to ask:

How do you handle client data during training and development? Where does it live, who has access, how is it deleted?
Do you have a data processing agreement template? Can you sign our DPA?
What is your security posture for client credentials and API keys during the engagement?
Have you completed a SOC 2 audit? If not, what is your equivalent assurance process?
How do you handle a situation where training data includes PII or sensitive business information?
What happens to client data if our engagement ends unexpectedly?

For regulated industries, add questions specific to HIPAA, GDPR, or your applicable framework. A vendor who cannot fluently answer data handling questions is not ready to work with enterprise data.

Section 4: Post-deployment commitment

Production LLM systems require ongoing attention. Model drift, prompt injection vulnerabilities, knowledge base staleness, and performance degradation are ongoing concerns, not one-time problems. Many vendors treat the go-live date as the end of the engagement. That is a mistake that becomes your problem.

Questions to ask:

What is included in your post-deployment support period, and how long does it last?
How do you handle a production incident where the model begins producing harmful or incorrect outputs at scale?
Do you offer ongoing model maintenance? What does that look like contractually?
What monitoring and alerting do you set up, and who receives the alerts after handover?
How do you handle model updates from the underlying provider that affect your system's behavior?

RFP scoring rubric

Dimension	Weight	Score 1–5	What a 5 looks like
Technical credibility	30%	—	Architecture tailored to your specific use case, tradeoffs clearly articulated, evaluation methodology defined upfront
Delivery track record	25%	—	Honest account of past failures, verifiable references, clear client responsibility documentation
Data & security practices	25%	—	DPA ready to sign, SOC 2 or equivalent, explicit data lifecycle policy
Post-deployment commitment	20%	—	Explicit support period, monitoring setup, incident response process documented

Contract terms that protect you

Beyond the evaluation, the contract structure matters. Terms that are non-negotiable in a credible LLM engagement:

IP assignment: you own the model weights, fine-tuned artifacts, prompts, and code. Not a license — ownership.
Code escrow or source delivery: you receive the full source code, not a deployed black box.
Explicit acceptance criteria: the project is not "done" until defined metrics are met, not just when the vendor says it is.
Data deletion clause: all client data is deleted within 30 days of project completion or termination.
Change order process: any scope change requires written agreement before work begins.
Subcontractor disclosure: you know who is actually doing the work, not just who is billing you.
Key personnel clause: the team members you evaluated are the team members who work on the project.

The vendor evaluation process is the most leveraged step in an LLM development services engagement. The decisions made before contract signature determine 80% of outcomes. For the parallel question of how long the project will take, see our guide on custom LLM project timelines. For the security dimension of production LLM systems, see our piece on defending against prompt injection. Our custom LLM development page describes how we structure our own engagements against these criteria. Browse our full insights library for more buyer-stage guidance.

How to evaluate an LLM development vendor (RFP template)

Why most vendor evaluations fail

The RFP structure: what to send before the first call

Section 1: Technical credibility evaluation

Section 2: Project management and delivery track record

Section 3: Data handling and security practices

Section 4: Post-deployment commitment

RFP scoring rubric

Contract terms that protect you

We answer every question on this list. Honestly.

How to evaluate an LLM development vendor (RFP template)

Why most vendor evaluations fail

The RFP structure: what to send before the first call

Section 1: Technical credibility evaluation

Section 2: Project management and delivery track record

Section 3: Data handling and security practices

Section 4: Post-deployment commitment

RFP scoring rubric

Contract terms that protect you

We answer every question on this list. Honestly.

Start a project