A heavy industrial gear and lens under deep red light, representing systematic testing and evaluation of AI systems.
Systematic evaluation is standard practice in AI, but the method and the secrecy are what put Meta's project under scrutiny.

A WIRED investigation reported that Meta ran an internal project, code-named “Cannes”, in which contractors posed as teenagers to test competing chatbots. The work was managed through a contractor called Covalen. Hundreds of contractors created dummy accounts posing as users aged 13 to 17, sent prompts and images to rival products, then logged the replies in spreadsheets.

The competitors tested were OpenAI’s ChatGPT, Google’s Gemini, and Character.AI. The prompts were deliberately difficult. They covered suicide, self-harm, eating disorders, sex, drugs, and abuse, and were built to push safety systems toward answers they are meant to refuse. One testing round, finished in August 2025, ran more than 45,000 prompts.

None of the three companies were told the testing was happening. Character.AI said the conduct violated its Terms of Service. OpenAI said it was looking into it. Google said it had not approved the testing and did not know its purpose. Meta characterised the work as responsible, industry-standard safety benchmarking and said it does not use competitor outputs to train its own models. Legal experts consulted by WIRED reportedly concluded the reviewed prompts did not cross into illegal child sexual abuse material.

Step 1 Create personas Contractors set up dummy accounts posing as users aged 13 to 17.
Step 2 Send hard prompts They sent prompts on suicide, self-harm, drugs, sex, and abuse to rival chatbots.
Step 3 Log replies Responses were copied into spreadsheets, over 45,000 prompts in one round.
Step 4 No disclosure None of the three tested companies were told the testing was happening.

Why it matters

The controversy is not that benchmarking exists. Adversarial prompting, red teaming , and age-persona evaluation are standard parts of AI safety work. Teams run difficult prompts against their own models to find where guardrails break before real users do. That is what an AI red team is for.

The issue is method and secrecy. Sound safety testing usually runs on systems you own, or it is disclosed and reproducible. When testing happens against competitors, in secret, at scale, and behind personas of minors, it stops looking like adversarial machine learning and starts raising questions about consent, terms of service, and intent.

This is why independent, public benchmarks matter. A public accountability benchmark is transparent about its prompts, its scoring, and who ran it, so anyone can reproduce it. Not every trustworthy benchmark discloses everything, since some keep test sets private to prevent contamination. The problem with covert, in-house comparisons is that no outside party can check them at all, which is the opposite of what public evaluation is meant to deliver. If you want to understand how credible testing works, start with how AI models are evaluated .

For product teams, the lesson is practical. Benchmark your own systems openly, document your methods, and treat rival products the way you would want yours treated. Comparisons like Claude vs ChatGPT hold up because they are reproducible, not because they were run in the dark.

Further reading

Sources