A dark data-center corridor lined with red-lit server racks, representing a fleet of GPUs waiting to be allocated to jobs.
A GPU fleet is expensive and finite. Scheduling decides which job gets which GPU, when, and for how long.

GPU scheduling is the problem of deciding which job runs on which GPU, and when. It matters because GPUs are the most expensive part of an AI stack and the easiest to waste. An idle H100 still costs money, and a cluster that hands whole GPUs to jobs that need a fraction of one can burn most of its budget on air. Good scheduling keeps utilization high while making sure large training jobs get the coordinated GPUs they need. This guide covers the two scheduler worlds, how jobs get placed, and how to share a single GPU safely.

Why GPUs are hard to schedule

CPUs are easy to share: the operating system slices time between processes thousands of times a second. GPUs are different. By default a GPU is allocated whole, one GPU to one container or process, because splitting it safely needs extra hardware or software. That coarse granularity creates three problems:

  • Fragmentation. A job needing one GPU can leave the other seven on an eight-GPU node stranded if the scheduler is not packing carefully.
  • Gang requirements. A distributed training job needs all its GPUs at once or none of them. Starting half the workers is useless and wastes the GPUs they hold while they wait.
  • Bursty demand. Interactive development and inference need a GPU briefly and often, while training holds many GPUs for hours or days. The same cluster has to serve both.
Step 1 Submit A job declares how many GPUs it needs and of what type.
Step 2 Queue The scheduler holds it until enough GPUs are free, applying quotas and priority.
Step 3 Place It bin-packs the job onto nodes, keeping multi-GPU jobs on fast interconnects.
Step 4 Run and release The job runs, then frees its GPUs for the next in the queue.

Two scheduler worlds: Slurm and Kubernetes

Most GPU clusters run one of two schedulers, from two different traditions.

SlurmKubernetes
OriginHigh-performance computingCloud-native containers
Job modelBatch jobs, sbatch and srunPods and controllers
GPU request--gres=gpu:2nvidia.com/gpu: 2
Gang schedulingNativeNeeds Kueue or Volcano
StrengthLarge batch training, fair-share queuesMixed training and serving, ecosystem
Best forResearch and HPC clustersPlatform teams running many workloads

Slurm comes from supercomputing and is still the default for large training clusters. It models GPUs as generic resources (GRES), supports partitions, fair-share accounting, and backfill scheduling that slots small jobs into gaps without delaying big ones. A job asks for GPUs directly:

bash
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=gpu
#SBATCH --gres=gpu:8          # 8 GPUs on one node
#SBATCH --nodes=4             # across 4 nodes = 32 GPUs
srun python train.py

Kubernetes schedules containers and now runs a large share of AI workloads because one cluster can host training, inference, and everything else. GPUs are exposed through the NVIDIA device plugin, which advertises the nvidia.com/gpu resource, and a pod requests them like any other resource:

yaml
resources:
  limits:
    nvidia.com/gpu: 2        # this pod gets 2 whole GPUs

The catch is gang scheduling. The default Kubernetes scheduler places pods one at a time, so a distributed job can get half its workers and deadlock. Batch systems layered on top, Kueue (the Kubernetes-native option) or Volcano, add all-or-nothing gang scheduling, queues, and quotas. On the operations side, the NVIDIA GPU Operator automates the driver, container toolkit, and device-plugin install so nodes are ready to serve GPUs without manual setup.

Sharing one GPU

Handing a whole GPU to a notebook that uses five percent of it is the biggest source of waste. Three mechanisms let several workloads share one physical GPU, with very different isolation.

Four glowing translucent spheres overlapping on black, a metaphor for several workloads sharing one GPU at the same time.
Sharing a GPU is a juggling act: how many jobs can occupy one device before they interfere depends on the mechanism you choose.
MIGTime-slicingMPS
MechanismHardware partitionsRound-robin time sharingConcurrent shared context
Memory isolationYes, dedicated per instanceNoNo
Fault isolationStrongWeakWeak
OverheadNone, true partitionsContext-switch costLow
Best forMulti-tenant inferenceDev and bursty jobsCo-located small jobs
  • MIG (Multi-Instance GPU) partitions a data-center GPU (A100, H100, and Blackwell generations) in hardware into as many as seven isolated instances, each with its own memory and compute slice. It gives the hardest isolation, which is what you want for multi-tenant inference where one tenant must not affect another.
  • Time-slicing oversubscribes a GPU by letting processes take turns on it. It is simple and needs no special hardware, but there is no memory isolation and a context-switch cost, so it suits development and bursty, low-stakes workloads rather than production training.
  • MPS (Multi-Process Service) lets several processes run on one GPU concurrently through a shared context, packing small jobs together more efficiently than time-slicing, with weaker isolation than MIG.

Keeping utilization high

Beyond placement, a few practices decide whether an expensive cluster earns its keep:

  • Bin-pack, do not spread. Fill nodes before starting new ones so multi-GPU jobs can find contiguous space and idle nodes can scale down.
  • Be topology-aware. Keep the GPUs of one job on the same node or NVLink domain, because distributed training is sensitive to interconnect speed.
  • Use quotas and priorities. Fair-share and preemption stop one team monopolising the cluster and let urgent jobs jump the queue.
  • Separate training from serving. Long training jobs and latency-sensitive inference have opposite needs; MIG or dedicated node pools keep them from interfering.

When not to build GPU scheduling

  • You have one GPU and one user. Run the job directly. A scheduler adds pure overhead.
  • You rent per-job cloud GPUs. If you spin up a node per training run and tear it down after, the cloud provider is your scheduler. See GPU clouds and neoclouds .
  • Your utilization is already high with whole-GPU allocation. Do not add MIG or time-slicing complexity until measurement shows GPUs sitting idle.

Further reading

Sources