Molten metal pouring in a dark furnace, representing the training process that shapes a model's weights.
Training is a foundry. PyTorch is the machinery that pours gradients through a model until its weights take the shape the data demands.

PyTorch is an open-source deep learning framework that combines a NumPy-like tensor library with GPU acceleration, a reverse-mode automatic differentiation engine, and higher-level building blocks for defining and training neural networks. Its defining trait is define-by-run, also called eager execution: the computation graph is built dynamically as your Python runs, so a model is ordinary, debuggable Python rather than a static graph you compile first. It began at Meta AI and is now governed by the PyTorch Foundation under the Linux Foundation. It is the framework most new AI research and most open-weight large language models are written in.

Where PyTorch sits

PyTorch is the layer between your model code and the hardware. You describe a network and a training step in Python; PyTorch records the operations, computes gradients, and dispatches the math to a CPU, GPU, or other accelerator. Most people use it through higher-level libraries that build on top of it.

Ecosystem
Hugging Face Transformers PyTorch Lightning torchvision High-level training and model libraries
Core API
torch.nn autograd torch.optim DDP / FSDP2 Layers, gradients, optimisers, distributed training
Compiler and runtime
torch.compile TorchInductor ExecuTorch Graph capture and kernel generation for speed and edge
Hardware
NVIDIA CUDA AMD ROCm Apple MPS CPU Same code runs across backends

How autograd works

The heart of PyTorch is autograd, its automatic differentiation engine. When a tensor is marked requires_grad=True, every operation on it is recorded into a directed graph that also stores each operation’s local derivative. The forward pass builds this graph on the fly. Calling .backward() on the loss walks the graph in reverse, applies the chain rule at each node, and fills in each parameter’s gradient. PyTorch never builds full Jacobian matrices; it computes vector-Jacobian products, which is why reverse-mode is cheap for the common case of many parameters mapping to a single scalar loss. This is the mechanism every training loop relies on, and it is worth understanding before you rely on gradient descent in practice.

Installing PyTorch

Generate the exact command for your hardware from the official selector, since the index URL tracks your CUDA or ROCm version.

bash
# CPU only
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# NVIDIA GPU (CUDA 12.8 build)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Apple Silicon and default builds
pip install torch torchvision torchaudio

PyTorch follows a roughly quarterly release cadence on the 2.x line (2.12 as of mid-2026). On Apple Silicon the default build enables Metal (MPS) acceleration automatically.

A real training loop

This is the canonical shape of PyTorch: define a module, run a forward pass, compute a loss, backpropagate, and step the optimiser.

python
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

device = "cuda" if torch.cuda.is_available() else "cpu"

X = torch.randn(2048, 20)
y = X @ torch.randn(20, 1) + 0.1 * torch.randn(2048, 1)
loader = DataLoader(TensorDataset(X, y), batch_size=64, shuffle=True)

class MLP(nn.Module):
    def __init__(self, in_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden), nn.ReLU(), nn.Linear(hidden, 1)
        )
    def forward(self, x):
        return self.net(x)

model = MLP(20).to(device)
loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

model.train()
for epoch in range(10):
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()     # clear old gradients
        loss = loss_fn(model(xb), yb)
        loss.backward()           # autograd fills every .grad
        optimizer.step()          # update the weights

Transfer learning with torch.compile

The common production pattern is transfer learning: freeze a pretrained backbone, retrain a small head, and wrap the model in torch.compile for a graph-optimised speedup.

python
import torch
from torch import nn
from torchvision import models

device = "cuda" if torch.cuda.is_available() else "cpu"

model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
for p in model.parameters():
    p.requires_grad = False                       # freeze the backbone

model.fc = nn.Linear(model.fc.in_features, 10)    # new trainable head
model = model.to(device)

compiled = torch.compile(model, mode="max-autotune")  # graph capture + kernel gen
optimizer = torch.optim.AdamW(model.fc.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

imgs = torch.randn(16, 3, 224, 224, device=device)
labels = torch.randint(0, 10, (16,), device=device)

optimizer.zero_grad()
loss = loss_fn(compiled(imgs), labels)
loss.backward()
optimizer.step()

Introduced in PyTorch 2.0, torch.compile captures the model into a graph with TorchDynamo, traces the backward pass with AOTAutograd, and generates fused GPU kernels with TorchInductor. The first call pays a compilation cost; later calls run the optimised graph. It is opt-in and backward-compatible, so you add it without rewriting model code.

The typical path from idea to model

Step 1 Define Write a network as an nn.Module and load data with DataLoader.
Step 2 Train Loop forward, loss, backward, step, using autograd for gradients.
Step 3 Scale Shard across GPUs with FSDP2 and compile with torch.compile.
Step 4 Deploy Serve with vLLM or TorchServe, or export to edge with ExecuTorch.

For very large models, PyTorch provides distributed training primitives: DistributedDataParallel replicates the model and all-reduces gradients across GPUs, while FSDP2 shards parameters, gradients, and optimiser state to fit models that no single GPU can hold. On the deployment side, ExecuTorch (GA in late 2025) exports and runs the same model on phones and embedded devices, closing PyTorch’s historical edge gap.

How it compares

PyTorchTensorFlowJAXKeras 3
ExecutionEager, compile optionalEager, graphs via tf.functionFunctional transformsAPI over a backend
AutodiffautogradGradientTapejax.gradDelegates to backend
Research useDominantDecliningRising at scaleSits on top
EdgeExecuTorchLiteRT (mature)LimitedVia backend
Best forNew research, LLMs, fast iterationProduction, mobile, TPUTPU-scale trainingPortable multi-backend code

PyTorch and TensorFlow are both eager-first with optional graph compilation. JAX is functional and shines on TPUs. Keras 3 is not a rival engine but a high-level API that runs on PyTorch, TensorFlow, or JAX.

When not to use PyTorch

  • You run an existing TensorFlow production estate. If you already depend on TF Serving, TFX, or TensorFlow.js, rewriting into PyTorch adds risk. Keras 3 is often the better bridge.
  • You need TPU-scale functional training. For very large training runs on TPUs, JAX with XLA is generally the stronger fit.
  • You target deeply embedded devices with a mature toolchain. ExecuTorch narrows this, but LiteRT still has broader coverage for constrained microcontrollers.
  • You only need to serve, not train. For pure inference, a dedicated runtime such as vLLM , TensorRT-LLM , or ONNX Runtime often beats plain eager PyTorch.
  • You want a no-code path. PyTorch is a code-first library. Non-engineers are better served by managed platforms.

Further reading

Sources