A circular five-node cycle in dark grey and red, representing the repeating lifecycle of building, testing, and releasing an AI model.
Model evaluation is a loop, not a finish line. Each release feeds the next round of testing.

You type a question, you get an answer. That chat box is the last centimetre of a very long pipeline. Behind it sits a lifecycle of training, tuning, attacking, scoring, and watching that decides whether a model is fit to ship at all. Most of this work never reaches the public, yet it shapes every response you read.

This guide walks through that lifecycle stage by stage. It explains what a benchmark actually is, why red teaming borrows its playbook from the military, and why the current habit of each company grading its own homework is a problem worth naming.

The lifecycle at a glance

A model does not go from raw data to your screen in one step. It moves through a repeating loop. When a new test exposes a weakness, the loop starts again with a fresh version.

Stage 1 Training The model learns patterns from a large corpus of text, code, and other data.
Stage 2 Alignment Tuning steers the model toward helpful, honest, harmless behaviour.
Stage 3 Internal QA Engineers check quality, regressions, and basic behaviour before wider testing.
Stage 4 Red teaming Testers try to break the model on purpose to find failures first.
Stage 5 Safety benchmarks Structured test sets measure harm, bias, and refusal behaviour.
Stage 6 External evaluation Outside researchers or auditors test the model, sometimes under contract.
Stage 7 Release The model ships to users, often behind a gradual rollout.
Stage 8 Continuous monitoring Live traffic reveals new failure modes that testing missed.
Stage 9 New benchmark failures Fresh tests and real incidents expose gaps the old model cannot fix.
Stage 10 Next model version Findings feed the next training run, and the loop repeats.

Stage by stage

Training

Training is where the model learns. It reads a large body of text, code, and other data, and it adjusts billions of internal weights to predict what comes next. At the end of training you have a raw model that knows a lot but has no manners. It will answer a dangerous question as readily as a harmless one.

Alignment

Alignment shapes behaviour. Techniques such as instruction tuning and reinforcement learning from human feedback push the model toward being helpful, honest, and harmless. This is where the model learns to refuse clearly harmful requests and to follow instructions rather than ramble.

Internal QA

Before anyone tries to break the model, engineers check that it works. They look for regressions against the previous version, obvious quality drops, and broken behaviour on everyday tasks. QA answers a friendly question: does the model do what we intended?

Red teaming

Red teaming asks the opposite question: how would someone break this? The term comes from the military and from cybersecurity, where a red team plays the attacker so the defenders learn their weak points before a real adversary finds them. You can read more in the red teaming glossary entry and on the role of an AI red team .

An AI red team tries to make a model misbehave on purpose. Common goals include:

  • Make it hallucinate: state confident falsehoods as fact.
  • Make it leak secrets: reveal training data, system prompts, or private information.
  • Make it give dangerous advice: produce content that could cause real harm.
  • Make it discriminate: treat people differently based on protected characteristics.
  • Make it fall for prompt injection : follow hidden instructions smuggled into input.

The point is to find these failures before real users do. Red teaming sits inside the broader field of adversarial machine learning and directly serves AI safety .

Safety benchmarks

A benchmark is not a vibe check. It is a designed experiment. A good safety benchmark contains many test cases, and each case carries structured metadata:

  • A category, such as self-harm or misinformation.
  • A persona and sometimes an age, so the test reflects a real user.
  • A language, so the model is tested beyond English.
  • An expected behaviour, describing the correct response.
  • A scoring rubric, describing how to grade the actual response.

Automated scorers handle some of this work, but humans still score many outputs by hand. Judging whether a refusal was appropriate, or whether an answer was subtly biased, often needs a person. That human labour is a large, hidden cost of evaluation.

External evaluation

Internal teams have blind spots. External evaluation brings in outside researchers, auditors, or specialist firms to test the model independently. Sometimes this happens under contract before release. Sometimes it happens through public challenges where many researchers probe a system at once.

Release

Release is rarely a single switch. Models often ship through a gradual rollout, exposed to a small slice of traffic first so problems surface at low blast radius. Guides such as from zero to production cover the same staged mindset for shipping software.

Continuous monitoring

Live users do things no test anticipated. Monitoring watches real traffic for new failure modes, abuse patterns, and drift in behaviour. This is where the build-measure-learn loop applies to models: you ship, you measure, and you feed what you learn back into the next version.

New benchmark failures and the next version

Over time, new benchmarks and real incidents expose gaps the current model cannot close through patching alone. Those findings define the goals for the next training run. The loop closes, and a new version begins the lifecycle again.

What red teaming looks like in practice

Recent research has moved red teaming from ad hoc probing toward reproducible, dynamic benchmarks. Four examples are worth knowing because they test different surfaces.

  • RIFT-Bench evaluates the security of agentic AI systems using a broad set of dynamically adaptable adversarial probes across diverse attack vectors. It works in two automated phases, discovery then scanning, and the authors report testing it across 45 different agentic systems (arXiv 2606.23927).
  • REALM is a unified red-teaming benchmark for vision language models in physical-world contexts. It probes how models handle adversarial images and instructions tied to real-world and robotics tasks, across multiple attack categories (arXiv 2606.23892).
  • AIRTBench measures whether language models can perform autonomous red teaming: independently finding and exploiting vulnerabilities without a human driving each step. The code is open source (arXiv 2506.14682).
  • The Agent Red Teaming benchmark comes from a large-scale public competition run by Gray Swan AI with the UK AI Safety Institute. Researchers competed to break deployed AI agents , surfacing novel attack patterns and showing that current defences often fall short when agents have tool access (arXiv 2507.20526).

These benchmarks matter because agents raise the stakes. A model that only chats can give a bad answer. A model that can call tools and act can be turned against the systems around it.

Model, system, and agent evaluation

A benchmark score on a model in isolation is no longer the whole story. Modern AI ships as a system: a model wired to retrieval, tools, browsers, APIs, and other agents. Failures increasingly happen outside the model itself.

  • Model evaluation tests the model alone, its capabilities and its refusals on fixed inputs.
  • System evaluation tests the model plus its retrieval, prompts, and guardrails as one product. A strong model with a weak retrieval layer is still a weak system.
  • Agent evaluation tests a model that can plan and call tools, where a single wrong tool call can cause real-world harm.
  • Workflow evaluation tests a chain of steps and hand-offs end to end, where small errors compound across stages.

The practical rule is to evaluate the thing you actually ship. If you ship a RAG system or an agent , a model-only benchmark will miss most of your real failure modes. This gap between model quality and system quality is one of the biggest themes in current AI engineering.

The crash test problem

Cars are tested by independent bodies under identical, public conditions. You can compare two cars because the crash test was the same for both. AI has no equivalent for safety.

Today, each company largely tests its own models behind closed doors and publishes the numbers it chooses to publish. There is no universally trusted body that continuously tests all major systems under identical conditions. That gap creates a trust problem for everyone downstream: researchers cannot reproduce claims, regulators cannot compare systems, startups cannot prove parity with incumbents, and users cannot verify marketing.

Other industries solved coordination across competitors. The Linux Foundation hosts shared infrastructure that rival companies all depend on. The CVE program gives security flaws common identifiers so the whole industry tracks the same vulnerability by the same name. An independent, open, reproducible AI benchmark would play a similar role: a common yardstick that serves researchers, developers, regulators, startups, and users at once.

Secret testing versus open testing

The Meta covert-benchmarking story shows why secret testing draws fire even when the method is familiar. Reports describe Meta running competitor models through internal benchmarks without disclosing it, and the controversy was less about the technique than the secrecy. Testing rivals is normal. Grading them privately and shaping the narrative around undisclosed results is what breaks trust. See our coverage in Meta and competitor AI benchmarking .

Secret in-house benchmarkingOpen, reproducible benchmarking
Who runs itThe vendor, privatelyIndependent body or coalition
Test set visibilityHidden, chosen by vendorPublished and inspectable
ReproducibleNo, outsiders cannot rerun itYes, anyone can rerun it
Comparable across vendorsRarely, conditions differYes, identical conditions
Trust modelTake our word for itVerify it yourself
Best forFast internal iterationPublic accountability

Both modes have a place. Internal benchmarks let a team iterate quickly and keep hard test cases out of training data. Open benchmarks let the outside world hold everyone to the same standard. The problem is not that private testing exists. The problem is that no strong public counterpart exists to check it.

Who runs each stage

Evaluation is not one team. Different groups own different stages, and their incentives differ too.

StageWho typically runs itMain question
TrainingResearch and infrastructure teamsDid the model learn?
AlignmentAlignment and safety teamsIs it helpful and harmless?
Internal QAEngineeringDoes it work as intended?
Red teamingInternal and external red teamsHow would someone break it?
Safety benchmarksEvaluation teams plus human ratersHow bad are the failures?
External evaluationIndependent researchers, auditorsDoes it hold up to outsiders?
MonitoringOperations and trust teamsWhat breaks in the wild?

Further reading