A hand holding a glowing green crystal cube against a dark background: the working prototype, tangible and deliverable, built from careful decisions.
The first AI feature you ship is not about the model. It is about deciding fast enough that you learn something real.

This guide is for technical founders, CTOs, and startup leads integrating an LLM into a product for the first time. It covers the six decisions that determine whether your first AI project ships and scales, or stalls in prototyping. By the end, you will have a clear technology stack, a realistic cost model, and a checklist that separates a prototype from a production system.

This is not a survey of everything possible. It is a direct path through the decisions that matter at the start.


The six phases of a founder AI project

Step 1 Define the task Write the exact input the model receives and the exact output format you expect. Test with 20 real examples before touching code.
Step 2 Choose your model API or open-weight? Which provider fits your volume, latency, and data residency requirements?
Step 3 Build the prototype Minimal integration using real data and real users. Ship the loop fast enough to learn something.
Step 4 Measure what matters Track latency, output accuracy, and cost per call. Build a small eval set from real user requests.
Step 5 Harden for production Add error handling, model fallbacks, rate limiting, and observability before scaling traffic.
Step 6 Optimize cost Cache repeated context, route cheap requests to smaller models, and batch where latency allows.

Define the task first

LLM projects fail when the task is vague. “Make it smarter” is not a task. “Extract the invoice total and due date from a PDF and return them as JSON” is a task.

Before you write a line of code, write out:

  1. The exact input. What text, documents, or data does the model receive? What is the maximum length? Where does it come from?
  2. The exact output format. Is it free-form prose, structured JSON, a classification label, or a short summary? Define the schema.
  3. The quality bar. What does “correct” mean? How will you know when the model gets it wrong?

Then collect 20 real examples from your domain. Not synthetic examples you made up, real ones from your data or your users. Run each through the model manually in a playground (Claude.ai, ChatGPT, or the provider console). If you cannot get a useful result from 20 real examples by hand, adding a backend will not fix the problem.

This step takes one to three days. Skipping it costs weeks.

Write your system prompt as a specification

Your system prompt is not a suggestion. It is the specification for what the model does. Treat it like a function signature:

You are an invoice parser. You receive the full text of a PDF invoice.
Return a JSON object with these fields:
- total_amount: number (in euros, no currency symbol)
- due_date: string (ISO 8601 format, e.g. 2026-07-15)
- vendor_name: string

If any field cannot be found, return null for that field.
Return only the JSON object. No explanation.

Version this prompt. Store it in a file or a database row, not hardcoded in your application. You will iterate on it.


Choosing your first model

Use this table to pick a starting point. Start with the cheapest model that meets your quality bar. You can always switch up.

SituationRecommended modelRough monthly cost at 10k calls/day
Budget under €200/monthGPT-4o mini or Claude Haiku 3.5€30 to €80
Need 200K context windowClaude Sonnet 4€150 to €400
Need self-hostingLlama 3.3 70B via Groq or Together AI€0 (API) or infra cost
Data must stay in AWSAmazon Bedrock with Nova or ClaudeUsage-based, same model pricing
Need open-source with complianceIBM Granite on Hugging Face€0 model, infra cost only
Need lowest latencyGroq with Llama or Mixtral€20 to €60
Best for most foundersClaude Haiku 3.5 to start, Claude Sonnet 4 when you need it€30 to €150

A few rules that hold in almost every case:

  • Start with an API model. Do not self-host until you have a cost problem that justifies the operational overhead.
  • Use Claude Haiku 3.5 or GPT-4o mini for high-volume, structured tasks. They cost 10 to 30 times less than frontier models and are fast.
  • Move to Claude Sonnet 4 or GPT-4o when the task requires reasoning, long context, or nuanced generation.
  • Use Amazon Bedrock if you are already on AWS and need data residency controls, auditability, or enterprise procurement.

See the LLM landscape comparison for a full breakdown of models by capability, context window, and pricing.


The prototype stack

Keep the prototype stack boring. Every layer you add is a problem you have to debug at 2 AM before launch.

Frontend
Your existing UI Add AI as a feature to what you already have. Do not build a new frontend for the prototype.
API layer
FastAPI (Python) Express (Node) One endpoint per AI task. Keep it thin.
LLM access
Anthropic Python SDK OpenAI Python SDK Both SDKs are two-line integrations. Pick the one matching your model.
Database
Supabase (Postgres) Includes pgvector. When you add RAG, the vector store is already there.
Deployment
Railway Render Push to deploy. No infrastructure to manage for the MVP.
Observability
Structured logging to file Langfuse (add later) Log every request, response, latency, and token count from day one.

A minimal FastAPI integration

This is the pattern you need. Everything else is wiring:

python
from fastapi import FastAPI
from anthropic import Anthropic

app = FastAPI()
client = Anthropic()

SYSTEM_PROMPT = open("prompts/invoice_parser_v1.txt").read()

@app.post("/parse-invoice")
async def parse_invoice(body: dict):
    message = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=512,
        system=SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": body["invoice_text"]}
        ]
    )
    return {"result": message.content[0].text}

Notice max_tokens=512. Always set max_tokens. Without it, the model can return thousands of tokens on a task that should return 50. That cost adds up across thousands of calls.

Notice the system prompt loads from a file. Version that file. Commit it to git. When the prompt changes, the filename or a version tag changes with it.


LLM cost reality check

Run this calculation before you commit to a model:

Daily volume:         1,000 users x 5 requests = 5,000 requests/day
Tokens per request:   1,500 input + 500 output = 2,000 tokens
Monthly token volume: 5,000 x 30 x 2,000 = 300,000,000 tokens

At current pricing (June 2026):

ModelInput priceOutput priceMonthly estimate
Claude Haiku 3.5€0.00075 per 1K tokens€0.00375 per 1K tokens~€75/month
GPT-4o mini€0.00015 per 1K tokens€0.0006 per 1K tokens~€22/month
Claude Sonnet 4€0.0028 per 1K tokens€0.014 per 1K tokens~€315/month
GPT-4o€0.0023 per 1K tokens€0.009 per 1K tokens~€230/month

These numbers change. Always check the provider pricing page before committing. But the ratios hold: Haiku and 4o mini are roughly 10 times cheaper than their frontier counterparts.

Three cost traps to avoid

Trap 1: Conversation context compounds. If you pass the full conversation history on every turn, your input tokens grow with each message. A 10-turn conversation at 200 tokens per turn means the 10th turn sends 2,000 tokens of history. Cap conversation context or summarize it.

Trap 2: Not setting max_tokens. A model asked to “summarize this article” can return 4,000 tokens if you do not constrain it. Set max_tokens to the maximum useful output for your task, not the model maximum.

Trap 3: Not caching static context. If every request includes the same 2,000-token system prompt and reference document, use prompt caching. Anthropic and OpenAI both support it. You pay full price on the first call and a fraction on subsequent calls with the same prefix. On high-volume tasks, this cuts costs by 70 to 90 percent.

See Reducing LLM Inference Costs in Production for the full optimization playbook.


When RAG changes everything

Prompting alone works when the knowledge your product needs is already in the model’s training data. It breaks when your product needs to answer questions from:

  • Your internal documents (contracts, SOPs, product specs)
  • A knowledge base that updates frequently
  • Content the model has never seen (proprietary research, customer data, recent events)

In these cases, retrieval-augmented generation (RAG) is the right architecture. You store your documents as vector embeddings in a database, retrieve the most relevant chunks at query time, and inject them into the prompt as context. The model answers from that context, not from training memory.

This is not more complex than it sounds, but it is a distinct system with its own failure modes. Read the Building RAG Systems guide before adding document retrieval to your product. The key decision is your embedding model and vector store. Supabase with pgvector covers the prototype and the first production deployment.

Do not try to implement RAG in the same sprint as your first LLM integration. Get the base integration working and measured first.


The production checklist

A prototype becomes a production system when it handles failure gracefully, costs are visible, and personal data is protected. Work through this list before you open the feature to all users.

Prototype to production: what changes
AreaPrototypeProduction
Error handlingCrashes on API errorCatches errors, returns user-friendly message
RetriesNoneExponential backoff on 429 and 5xx
FallbackOne modelFalls back to cheaper model on failure
Rate limitingNonePer-user limits to prevent runaway costs
Prompt versioningHardcoded stringFile with version tag, committed to git
Prompt testingManualEval set of 50 real examples, run on each change
Cost alertsNoneBudget alert at €X/day in provider console
Token loggingNoneLog input tokens, output tokens, latency per request
PII handlingNot consideredStrip or pseudonymize before sending to external API
max_tokensNot setSet to the maximum useful output for the task

PII handling is not optional

If your product processes user data, do not send personal information to an external LLM API without a legal basis and user consent. This applies to names, email addresses, phone numbers, financial data, and health information. Check your jurisdiction’s data protection rules (GDPR applies across the EU). If you cannot strip PII before sending, use Amazon Bedrock or Azure OpenAI with a data processing agreement in place.


Common mistakes

These are the four mistakes that add the most time and cost to a first AI project.

Starting with the most expensive model

The temptation is to start with GPT-4o or Claude Sonnet because they produce the best results in demos. Resist it. Start with GPT-4o mini or Claude Haiku 3.5. If the cheap model cannot do the task at all, moving up to the frontier model is a real decision. If it can do it with a better prompt, you saved €200 per month.

Not testing with real, messy data

Demos use clean examples. Real user data is messy: incorrect spelling, mixed languages, unexpected formats, edge cases the prompt does not handle. Build your eval set from actual user requests or documents from your domain. A model that works on 20 synthetic examples and breaks on real ones is not ready to ship.

Hardcoding prompts

The system prompt is your most important artifact. It determines output quality more than model choice does. If it lives inline in your code:

  • You cannot A/B test two versions without a deployment
  • You cannot roll back a bad prompt change without reverting code
  • You cannot see the history of what changed and when

Store prompts as versioned files or database rows. Track which version produced which outputs.

Ignoring latency until users complain

LLM calls take 500 ms to 5 seconds depending on model and output length. If your product calls the API synchronously on a user action, that latency is visible. Measure it from day one. For tasks where the result does not need to be instant (document analysis, nightly reports, background enrichment), move to an async queue. For real-time tasks, use the fastest model that meets your quality bar and stream the response so users see output as it generates.


Further reading

  • LLM Landscape 2026 : Full comparison of models by capability, context window, speed, and pricing. Use this before committing to a provider.
  • Reducing LLM Inference Costs in Production : Caching, batching, model routing, and prompt optimization strategies for when your costs grow.
  • Building RAG Systems : Step-by-step guide to document ingestion, chunking, embedding, and retrieval for products that need private knowledge.
  • Amazon Bedrock : AWS-native LLM gateway supporting Claude, Nova, and other models with enterprise data controls.
  • FastAPI : The Python framework used in the prototype stack above. Fast to build, production-ready from day one.
  • Railway : Push-to-deploy hosting for your backend API. Covers the MVP without infrastructure work.
  • Anthropic API documentation : Official reference for the Claude API including prompt caching, tool use, and streaming.
  • OpenAI API documentation : Official reference for GPT-4o and o-series models with pricing calculator.