Tool

Added 1 Jul 2026 Last updated 1 Jul 2026 Read time 6 min

PyTorch

PyTorch is the open-source deep learning framework behind most modern AI research and models, combining a GPU tensor library, automatic differentiation, and an eager, Python-native programming model.

deep-learningtrainingframeworkopen-sourcegpupython

Connected TensorFlow Hugging Face Transformers - Open-Source Model Library Deep Learning Transformer Architecture Fine-Tuning LLMs - A Practical Guide

At a glance

OpennessOpen source

Self-hostYes

Learn this your way

Read Guided course

Molten metal pouring in a dark furnace, representing the training process that shapes a model's weights. — Training is a foundry. PyTorch is the machinery that pours gradients through a model until its weights take the shape the data demands.

PyTorch is an open-source deep learning framework that combines a NumPy-like tensor library with GPU acceleration, a reverse-mode automatic differentiation engine, and higher-level building blocks for defining and training neural networks. Its defining trait is define-by-run, also called eager execution: the computation graph is built dynamically as your Python runs, so a model is ordinary, debuggable Python rather than a static graph you compile first. It began at Meta AI and is now governed by the PyTorch Foundation under the Linux Foundation. It is the framework most new AI research and most open-weight large language models are written in.

Where PyTorch sits

PyTorch is the layer between your model code and the hardware. You describe a network and a training step in Python; PyTorch records the operations, computes gradients, and dispatches the math to a CPU, GPU, or other accelerator. Most people use it through higher-level libraries that build on top of it.

Ecosystem

Hugging Face Transformers PyTorch Lightning torchvision High-level training and model libraries

Core API

torch.nn autograd torch.optim DDP / FSDP2 Layers, gradients, optimisers, distributed training

Compiler and runtime

torch.compile TorchInductor ExecuTorch Graph capture and kernel generation for speed and edge

Hardware

NVIDIA CUDA AMD ROCm Apple MPS CPU Same code runs across backends

How autograd works

The heart of PyTorch is autograd, its automatic differentiation engine. When a tensor is marked requires_grad=True, every operation on it is recorded into a directed graph that also stores each operation’s local derivative. The forward pass builds this graph on the fly. Calling .backward() on the loss walks the graph in reverse, applies the chain rule at each node, and fills in each parameter’s gradient. PyTorch never builds full Jacobian matrices; it computes vector-Jacobian products, which is why reverse-mode is cheap for the common case of many parameters mapping to a single scalar loss. This is the mechanism every training loop relies on, and it is worth understanding before you rely on gradient descent in practice.

Installing PyTorch

Generate the exact command for your hardware from the official selector, since the index URL tracks your CUDA or ROCm version.

bash

# CPU only
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# NVIDIA GPU (CUDA 12.8 build)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Apple Silicon and default builds
pip install torch torchvision torchaudio

PyTorch follows a roughly quarterly release cadence on the 2.x line (2.12 as of mid-2026). On Apple Silicon the default build enables Metal (MPS) acceleration automatically.

A real training loop

This is the canonical shape of PyTorch: define a module, run a forward pass, compute a loss, backpropagate, and step the optimiser.

python

import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

device = "cuda" if torch.cuda.is_available() else "cpu"

X = torch.randn(2048, 20)
y = X @ torch.randn(20, 1) + 0.1 * torch.randn(2048, 1)
loader = DataLoader(TensorDataset(X, y), batch_size=64, shuffle=True)

class MLP(nn.Module):
    def __init__(self, in_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden), nn.ReLU(), nn.Linear(hidden, 1)
        )
    def forward(self, x):
        return self.net(x)

model = MLP(20).to(device)
loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

model.train()
for epoch in range(10):
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()     # clear old gradients
        loss = loss_fn(model(xb), yb)
        loss.backward()           # autograd fills every .grad
        optimizer.step()          # update the weights

Transfer learning with torch.compile

The common production pattern is transfer learning: freeze a pretrained backbone, retrain a small head, and wrap the model in torch.compile for a graph-optimised speedup.

python

import torch
from torch import nn
from torchvision import models

device = "cuda" if torch.cuda.is_available() else "cpu"

model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
for p in model.parameters():
    p.requires_grad = False                       # freeze the backbone

model.fc = nn.Linear(model.fc.in_features, 10)    # new trainable head
model = model.to(device)

compiled = torch.compile(model, mode="max-autotune")  # graph capture + kernel gen
optimizer = torch.optim.AdamW(model.fc.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

imgs = torch.randn(16, 3, 224, 224, device=device)
labels = torch.randint(0, 10, (16,), device=device)

optimizer.zero_grad()
loss = loss_fn(compiled(imgs), labels)
loss.backward()
optimizer.step()

Introduced in PyTorch 2.0, torch.compile captures the model into a graph with TorchDynamo, traces the backward pass with AOTAutograd, and generates fused GPU kernels with TorchInductor. The first call pays a compilation cost; later calls run the optimised graph. It is opt-in and backward-compatible, so you add it without rewriting model code.

The typical path from idea to model

Step 1 Define Write a network as an nn.Module and load data with DataLoader.

→

Step 2 Train Loop forward, loss, backward, step, using autograd for gradients.

→

Step 3 Scale Shard across GPUs with FSDP2 and compile with torch.compile.

→

Step 4 Deploy Serve with vLLM or TorchServe, or export to edge with ExecuTorch.

For very large models, PyTorch provides distributed training primitives: DistributedDataParallel replicates the model and all-reduces gradients across GPUs, while FSDP2 shards parameters, gradients, and optimiser state to fit models that no single GPU can hold. On the deployment side, ExecuTorch (GA in late 2025) exports and runs the same model on phones and embedded devices, closing PyTorch’s historical edge gap.

How it compares

	PyTorch	TensorFlow	JAX	Keras 3
Execution	Eager, compile optional	Eager, graphs via tf.function	Functional transforms	API over a backend
Autodiff	autograd	GradientTape	jax.grad	Delegates to backend
Research use	Dominant	Declining	Rising at scale	Sits on top
Edge	ExecuTorch	LiteRT (mature)	Limited	Via backend
Best for	New research, LLMs, fast iteration	Production, mobile, TPU	TPU-scale training	Portable multi-backend code

PyTorch and TensorFlow are both eager-first with optional graph compilation. JAX is functional and shines on TPUs. Keras 3 is not a rival engine but a high-level API that runs on PyTorch, TensorFlow, or JAX.

When not to use PyTorch

You run an existing TensorFlow production estate. If you already depend on TF Serving, TFX, or TensorFlow.js, rewriting into PyTorch adds risk. Keras 3 is often the better bridge.
You need TPU-scale functional training. For very large training runs on TPUs, JAX with XLA is generally the stronger fit.
You target deeply embedded devices with a mature toolchain. ExecuTorch narrows this, but LiteRT still has broader coverage for constrained microcontrollers.
You only need to serve, not train. For pure inference, a dedicated runtime such as vLLM , TensorRT-LLM , or ONNX Runtime often beats plain eager PyTorch.
You want a no-code path. PyTorch is a code-first library. Non-engineers are better served by managed platforms.

Sources

Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 2019. https://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library
PyTorch. torch.compile and the PyTorch 2.x stack. https://pytorch.org/get-started/pytorch-2-x/
PyTorch. Overview of the PyTorch autograd engine. https://pytorch.org/blog/overview-of-pytorch-autograd-engine/
PyTorch Foundation expands to an umbrella foundation (2025). https://pytorch.org/blog/pt-foundation-expands/
PyTorch. Introducing ExecuTorch 1.0 (2025). https://pytorch.org/blog/introducing-executorch-1-0/
PyTorch releases. https://github.com/pytorch/pytorch/releases

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session