What is Computer Vision?
Computer vision is the AI field that enables software to understand images and video. Plain-English guide covering how it works and where it is used in 2026.

The core problem computer vision solves
Humans recognise objects effortlessly. You glance at a photo and instantly know it contains two people, a dog, a coffee table, and a window. This task that takes your brain under a second has no simple programmatic solution.
You cannot write code that says “if pixels at position (100, 200) through (300, 400) are brown-ish, it is a dog.” Dogs vary wildly in colour, size, breed, pose, and lighting. Computer vision solves this by training neural networks on millions of labelled images until they learn the features that define each class.
Core computer vision tasks
How computer vision works
The dominant architecture for computer vision is the Convolutional Neural Network (CNN), with Vision Transformers (ViT) increasingly replacing them for large-scale tasks.
How a CNN sees an image:
- The first layers detect simple features: edges, corners, gradients
- Middle layers combine simple features into shapes and textures
- Deeper layers combine shapes into object parts (wheel, window, door)
- The final layers combine parts into objects (car)
- The output layer produces a probability score for each class
import torch
import torchvision.transforms as transforms
from torchvision.models import resnet50, ResNet50_Weights
from PIL import Image
# Load a pre-trained ResNet-50 model (trained on 1.2M ImageNet images)
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()
# Preprocess the image
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
image = Image.open("product_photo.jpg")
input_tensor = transform(image).unsqueeze(0)
# Run inference
with torch.no_grad():
output = model(input_tensor)
probabilities = torch.softmax(output, dim=1)
top5 = torch.topk(probabilities, 5)
# Top 5 predictions with confidence scoresUsing computer vision via API
For most production use cases, cloud APIs are faster to deploy than building and hosting your own models.
import boto3
import json
client = boto3.client('rekognition', region_name='eu-west-1')
with open('product.jpg', 'rb') as image_file:
image_bytes = image_file.read()
# Detect objects and labels
response = client.detect_labels(
Image={'Bytes': image_bytes},
MaxLabels=10,
MinConfidence=70
)
for label in response['Labels']:
print(f"{label['Name']}: {label['Confidence']:.1f}%")
# Output examples:
# Person: 99.8%
# Clothing: 94.2%
# Laptop: 87.3%Multimodal LLMs: computer vision plus language
Modern multimodal LLMs combine vision understanding with language generation. Instead of getting a label list, you ask questions about the image in plain language:
from anthropic import Anthropic
import base64
client = Anthropic()
with open("invoice.jpg", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data,
},
},
{
"type": "text",
"text": "Extract all line items from this invoice and return as JSON with fields: description, quantity, unit_price, total."
}
],
}
],
)
print(response.content[0].text)Computer vision services comparison
| Service | Best for | Pricing (approx.) |
|---|---|---|
| AWS Rekognition | Face detection, content moderation, label detection | €0.001-€0.004/image |
| Google Cloud Vision | OCR, logo detection, label detection | €0.0015-€0.006/image |
| Azure Computer Vision | Document analysis, OCR, spatial analysis | €0.001-€0.004/image |
| AWS Textract | Structured document extraction (forms, tables) | €0.015/page |
| GPT-4o / Claude | Complex visual reasoning, multimodal Q&A | €0.003-€0.015/image |
| YOLO (open source) | Real-time object detection, self-hosted | Free (self-host cost only) |
When not to use off-the-shelf computer vision
Highly specific domain images: A model trained on consumer photos will underperform on satellite imagery, manufacturing defect photos, or medical scans. These domains require fine-tuning on domain-specific training data.
Real-time on-device requirements: Cloud APIs add 100-500ms of latency. For real-time applications on edge devices (cameras, robots, phones), you need lightweight models like MobileNet or YOLO that run locally.
Highly regulated medical contexts: Medical AI in EU healthcare must comply with the Medical Device Regulation (MDR) and the EU AI Act’s high-risk provisions. Consumer-grade APIs are not certified for clinical diagnosis.
What’s next
- What is Machine Learning? : The training process behind computer vision models
- What is a Neural Network? : CNNs and ViTs are neural networks: understanding the architecture helps
- AWS Rekognition : Deep dive on the AWS managed computer vision service
Further reading
- AWS Rekognition documentation : API reference for face detection, label detection, and content moderation
- PyTorch Computer Vision Tutorial : How to train an object detection model with your own images
- Roboflow : Platform for labelling images and training custom object detection models, used by teams without ML expertise
- Papers With Code: Computer Vision : Benchmark results and state-of-the-art models for all CV tasks
Frequently asked questions