Quick Answer
Computer vision is the field of AI that enables software to understand images and video: recognising objects, detecting faces, reading text, tracking motion, and interpreting what is happening in a visual scene. Modern computer vision uses deep learning (convolutional neural networks and vision transformers) and is embedded in products ranging from smartphone cameras and medical imaging tools to warehouse robots and self-driving cars.
A precision red machined lens on dark slate: a computer vision model is a precision optical instrument that filters and focuses visual information into structured data.
Computer vision acts like a precision optical instrument: where a camera captures raw light, a vision model focuses that signal into structured understanding: what objects are where, doing what.

The core problem computer vision solves

Humans recognise objects effortlessly. You glance at a photo and instantly know it contains two people, a dog, a coffee table, and a window. This task that takes your brain under a second has no simple programmatic solution.

You cannot write code that says “if pixels at position (100, 200) through (300, 400) are brown-ish, it is a dog.” Dogs vary wildly in colour, size, breed, pose, and lighting. Computer vision solves this by training neural networks on millions of labelled images until they learn the features that define each class.

Core computer vision tasks

Classification
What is in this image? Single label output Used in: product categorisation, quality control pass/fail, content moderation
Object detection
What objects are where? Bounding boxes + labels Used in: retail shelf analysis, security cameras, autonomous driving, defect detection
Segmentation
Pixel-level boundaries Semantic (class per pixel) Instance (each object separately) Used in: medical imaging, satellite imagery analysis, augmented reality
OCR and document understanding
Optical character recognition Table extraction Form field detection Used in: invoice processing, contract extraction, identity document verification
Video analysis
Action recognition Object tracking Anomaly detection Used in: sports analysis, security monitoring, manufacturing process control

How computer vision works

The dominant architecture for computer vision is the Convolutional Neural Network (CNN), with Vision Transformers (ViT) increasingly replacing them for large-scale tasks.

How a CNN sees an image:

  1. The first layers detect simple features: edges, corners, gradients
  2. Middle layers combine simple features into shapes and textures
  3. Deeper layers combine shapes into object parts (wheel, window, door)
  4. The final layers combine parts into objects (car)
  5. The output layer produces a probability score for each class
python
import torch
import torchvision.transforms as transforms
from torchvision.models import resnet50, ResNet50_Weights
from PIL import Image

# Load a pre-trained ResNet-50 model (trained on 1.2M ImageNet images)
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()

# Preprocess the image
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

image = Image.open("product_photo.jpg")
input_tensor = transform(image).unsqueeze(0)

# Run inference
with torch.no_grad():
    output = model(input_tensor)
    probabilities = torch.softmax(output, dim=1)
    top5 = torch.topk(probabilities, 5)

# Top 5 predictions with confidence scores

Using computer vision via API

For most production use cases, cloud APIs are faster to deploy than building and hosting your own models.

python
import boto3
import json

client = boto3.client('rekognition', region_name='eu-west-1')

with open('product.jpg', 'rb') as image_file:
    image_bytes = image_file.read()

# Detect objects and labels
response = client.detect_labels(
    Image={'Bytes': image_bytes},
    MaxLabels=10,
    MinConfidence=70
)

for label in response['Labels']:
    print(f"{label['Name']}: {label['Confidence']:.1f}%")
    # Output examples:
    # Person: 99.8%
    # Clothing: 94.2%
    # Laptop: 87.3%

Multimodal LLMs: computer vision plus language

Modern multimodal LLMs combine vision understanding with language generation. Instead of getting a label list, you ask questions about the image in plain language:

python
from anthropic import Anthropic
import base64

client = Anthropic()

with open("invoice.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Extract all line items from this invoice and return as JSON with fields: description, quantity, unit_price, total."
                }
            ],
        }
    ],
)

print(response.content[0].text)
Step 1 Define the task Decide: classification, detection, segmentation, OCR, or open-ended visual Q&A. Task determines the right model and API.
Step 2 Choose access route Cloud API (AWS Rekognition, Google Vision, Azure) for standard tasks. Multimodal LLM (GPT-4o, Claude) for complex visual reasoning. Open-source model for custom tasks needing fine-tuning.
Step 3 Integrate and validate Test on representative images from your actual use case. AI accuracy on stock photos differs significantly from accuracy on your specific image types and conditions.
Step 4 Fine-tune if needed If a pre-trained model's accuracy is insufficient on your specific images (unusual products, niche defect types), fine-tune with 50-500 labelled examples of your actual data.

Computer vision services comparison

ServiceBest forPricing (approx.)
AWS RekognitionFace detection, content moderation, label detection€0.001-€0.004/image
Google Cloud VisionOCR, logo detection, label detection€0.0015-€0.006/image
Azure Computer VisionDocument analysis, OCR, spatial analysis€0.001-€0.004/image
AWS TextractStructured document extraction (forms, tables)€0.015/page
GPT-4o / ClaudeComplex visual reasoning, multimodal Q&A€0.003-€0.015/image
YOLO (open source)Real-time object detection, self-hostedFree (self-host cost only)

When not to use off-the-shelf computer vision

Highly specific domain images: A model trained on consumer photos will underperform on satellite imagery, manufacturing defect photos, or medical scans. These domains require fine-tuning on domain-specific training data.

Real-time on-device requirements: Cloud APIs add 100-500ms of latency. For real-time applications on edge devices (cameras, robots, phones), you need lightweight models like MobileNet or YOLO that run locally.

Highly regulated medical contexts: Medical AI in EU healthcare must comply with the Medical Device Regulation (MDR) and the EU AI Act’s high-risk provisions. Consumer-grade APIs are not certified for clinical diagnosis.

What’s next

Further reading