AI in the Real World: Successes, Failures, and What They Teach

The biggest real AI deployments of 2024 to 2026, wins and failures, biggest companies first. Every number labeled vendor-reported or independently verified, with a primary source, plus the patterns that separate success from failure.

Added 23 Jun 2026 8 min read Updated 23 Jun 2026

#use-cases #case-studies #enterprise-ai #failures #roi #trends #risk #intermediate

Learn this your way

Read Guided course

Most AI writing is either a press release or a panic. This guide is neither. It collects the biggest real AI deployments from 2024 to 2026, the wins and the failures, and grades each one by how trustworthy the evidence is. The single most useful habit when reading any AI claim is to ask two questions: who is the source, and was it independently checked? Throughout this page, every number is tagged (vendor-reported) when it comes from the company that benefits from it, or (independently verified) when it comes from a court, a regulator, a peer-reviewed study, or an audited filing. Treat the second kind as far stronger than the first.

Silhouettes at a railing watching a glowing red AI system below. Real AI outcomes are judged by the people who depend on them, not by the launch announcement. — An AI system is judged by what it does in production, not by what its launch post claims. This page separates the two.

The hype-to-reality arc

Nearly every big AI story follows the same shape. Learn the shape and you can place any new headline on it.

Stage 1 Announcement A company publishes an impressive metric. It is vendor-reported and rarely audited.

→

Stage 2 Adoption Others deploy on the strength of the claim, often before evidence exists.

→

Stage 3 Measurement Independent studies, regulators, or real use test the claim.

→

Stage 4 Outcome The win holds, gets revised down, or reverses. This is the part that matters.

The biggest successes

Ordered with the most independently verified evidence first, because that is the evidence you can trust.

Independently verified wins

DeepMind AlphaFold predicted 200M+ protein structures (independently verified). AlphaFold’s structure database expanded known protein structures roughly 200-fold to over 200 million, and is used by more than 2 million researchers. Demis Hassabis and John Jumper won the 2024 Nobel Prize in Chemistry for it. This is the clearest large-scale scientific AI win, validated by citations, patents, and a Nobel rather than a marketing page.

AI-supported mammography found ~29% more cancers (independently verified). The Swedish MASAI randomized trial of about 106,000 women found that AI-supported screening detected roughly 29% more cancers while cutting radiologist reading workload by about 44%, published in The Lancet Digital Health . It kept a human radiologist as the final reader, which is a recurring feature of durable healthcare AI.

Vendor-reported wins (strong, but unaudited)

Klarna’s AI assistant did “the work of 700 agents” (vendor-reported, later partly reversed). Klarna and OpenAI said the assistant handled 2.3 million chats in its first month, cut resolution time from 11 minutes to under 2, and was on track for about $40M in profit improvement, per Klarna’s press release . Read this one with its sequel below: in 2025 Klarna said it cut too far and began rehiring humans.

Salesforce Agentforce resolves ~75% of internal support autonomously (vendor-reported). Salesforce reports its own help portal handled over 1 million conversations a year at roughly a 75% autonomous resolution rate, per the Salesforce blog . It is a credible “dogfooding” example, but the numbers are the vendor’s own.

Anthropic’s Claude Code grew from $0 to about $8B run-rate in under a year (vendor-reported). Anthropic reports Claude Code reached roughly $1B annualized within about six months of launch and around $8B by mid-2026, and that 80%+ of code merged to production at Anthropic in May 2025 was Claude-authored, per Anthropic’s Series G announcement . Treat the 80% figure as a best case from a company using its own tool.

GitHub Copilot made developers 55% faster on a benchmark task (vendor-reported, scope caveat). A randomized study by GitHub authors found Copilot users completed a specific task (building an HTTP server) about 55.8% faster, published on arXiv . The design was randomized, which is good, but the task was a greenfield toy, not maintenance of a mature codebase. See the contradicting independent study below.

Other notable vendor-reported wins. JPMorgan’s COIN reportedly reviews commercial loan contracts and saves around 360,000 lawyer hours a year (a widely cited but aging 2017 figure). Mastercard says generative AI lifts fraud detection by about 20% on average, scoring transactions in under 50 milliseconds, per CNBC . Insilico Medicine reported a positive Phase IIa topline for an AI-designed drug for idiopathic pulmonary fibrosis, per Insilico (early-stage, vendor topline). Moderna deployed 750+ custom GPTs across the company, per OpenAI’s case study (adoption, not audited P&L).

The biggest failures and reversals

This is the half of the story your competitors leave out. Each one is a lesson.

Klarna and Commonwealth Bank reversed AI-for-headcount cuts (independently verified). Klarna’s CEO told Bloomberg the company “went too far,” quality dropped, and it began rehiring human agents, per Fortune . Commonwealth Bank of Australia cut 45 support roles citing a bot that “reduced calls by 2,000 a week,” then admitted an “error” and reinstated the workers after the union showed call volumes were actually rising, per Bloomberg . Lesson: cutting humans before quality is proven gets reversed in public.

AI tools made experienced developers 19% slower (independently verified). In a randomized trial by the nonprofit METR, 16 experienced open-source developers working in mature repositories with Cursor and Claude were 19% slower with AI, even though they predicted a 24% speedup. The time went into verifying unreliable output. Published on arXiv . Lesson: the productivity feeling and the productivity fact can point in opposite directions.

Builder.ai collapsed; its “AI” allegedly relied on about 700 human engineers (alleged, under investigation). The Microsoft-backed startup once valued near $1.5B went insolvent in 2025 amid reports of revenue round-tripping and human-powered “AI,” per TechCrunch . The fraud allegations are not yet adjudicated. Lesson: “AI” branding can hide outsourced labor.

Zillow’s home-buying algorithm lost over $500M (independently verified). Zillow’s pricing model overpaid as the market turned, forcing a $304M inventory write-down, the shutdown of Zillow Offers, and roughly 2,000 job cuts, per Zillow’s own SEC filing . Lesson: a model trained on one market regime breaks when the regime shifts.

IBM Watson for Oncology was abandoned after $62M (independently verified). A partnership with MD Anderson cost over $62M, never integrated with the hospital’s records system, and was shelved without treating a patient outside Houston, per IEEE Spectrum . Lesson: the canonical case of AI overpromising in healthcare.

Amazon’s “Just Walk Out” and McDonald’s drive-thru AI were pulled (independently reported). Amazon’s cashierless checkout reportedly relied on around 1,000 reviewers in India and was dropped from Fresh grocery stores, per the Washington Times (Amazon disputes the framing). McDonald’s ended its IBM voice-ordering test after viral errors, per CNBC . Lesson: “autonomous” often hides a human, and narrow tasks still fail below a high accuracy bar.

Air Canada was held liable for its chatbot’s wrong advice (independently verified). A tribunal rejected the airline’s argument that its chatbot was a separate entity and ordered it to honor a policy the bot invented, per McCarthy Tetrault . Lesson: you own whatever your AI tells a customer.

Deloitte refunded a government over AI-fabricated citations (independently reported). A paid report for the Australian government contained nonexistent references and a fabricated court quote produced with a language model; Deloitte refunded part of the fee, per Fortune . Lesson: hallucinations reach paid professional deliverables. Verify every citation.

Cruise hid a pedestrian-dragging incident and shut down (independently verified). After its robotaxi dragged a pedestrian about 20 feet, Cruise filed a misleading report, paid a $500K federal penalty, lost its permits, and was folded into GM, per the US Department of Justice . Lesson: concealing the failure ended the company, not the crash alone.

A deepfake video call cost engineering firm Arup about $25.6M (independently reported). An employee joined a video call where every other participant was an AI deepfake of company executives, then made 15 transfers, per CNN . Lesson: live video and voice are no longer proof of identity. Verify large transfers out of band.

The systemic reality: most pilots do not pay off

The single most important number for anyone planning an AI project: an MIT report found that about 95% of enterprise generative-AI pilots showed no measurable profit-and-loss impact, despite $30B to $40B invested (independent, though its method has been debated). The same study found that buying or partnering succeeded roughly twice as often as building in-house, and that the biggest returns came from back-office automation, not the customer-facing projects that attract the most budget. This is the backdrop against which every success story above should be read.

What separates a win from a failure

The pattern across all of this is consistent.

Factor	AI wins look like	AI failures look like
Task type	Narrow, with a clear correct answer	Open-ended, no ground truth
Human role	Human in the loop, AI assists	Full replacement, no fallback
Where value lands	Back-office and operations	Customer-facing flash
Build vs buy	Buy or partner, proven tools	Big in-house bet, unproven
Validation	Independent or peer-reviewed	Vendor metric, unaudited
Failure handling	Reversible, monitored	Concealed, no rollback

How to use this page

When you read the next big AI announcement, run it through the arc. Is the number vendor-reported or independently verified? Is the task narrow with a clear answer, or open-ended? Is a human still in the loop? Is there a fallback if it fails? The companies that got burned skipped these questions. The ones that succeeded, quietly, almost always answered yes to the boring ones.

This page is a living snapshot and will be updated as outcomes change. For the longer arc of why technologies rise and fade, see the history of IT and how technology dies . For where the claims come from in the first place, see how to read technology trends .