Tool

Added 29 Jun 2026 Last updated 29 Jun 2026 Read time 5 min

Modal

Modal is a serverless cloud platform for running Python and AI workloads on GPUs without managing servers, billing per second of usage.

serverlessgpuinferencepythoninfrastructure

Connected Inference - Running AI Models in Production Fine-Tuning vs Prompt Engineering vs RAG RunPod Baseten

Learn this your way

Read Guided course

A dark floor with a red neon grid, representing serverless cloud infrastructure for AI workloads. — Modal is the grid beneath your code: you write Python functions, and the platform provisions the GPUs to run them.

Modal is a serverless cloud platform for running Python and AI workloads on GPUs. You write ordinary Python functions, add a decorator that declares the hardware you want, and Modal provisions containers in the cloud to run them. There are no servers to configure, no Kubernetes clusters to manage, and no idle machines to pay for. Modal bills per second of compute and scales to zero when nothing is running.

The problem it solves is the gap between a working model on your laptop and a scalable service in production. Renting a GPU virtual machine means paying for it around the clock, patching the operating system, and writing autoscaling logic yourself. Modal removes that operational layer. It handles container builds, cold starts, autoscaling, and scheduling, so you keep your attention on the Python code that does the work.

Where it sits in the stack

Modal is a compute layer. It sits below your application logic and above the raw GPU hardware, turning a Python function into a scheduled, autoscaling cloud job.

Your code

Python functions @app.function decorators Hardware and dependencies declared in code

Modal platform

Container builds Autoscaling Scheduling Scale to zero, per-second billing

GPU hardware

H100 A100 A10 Fleet spread across multiple cloud providers

How to access it and how it fits

You define infrastructure inside your Python file. A decorator tells Modal what image to build and which GPU to attach, and the Modal command line tool deploys or runs the function.

bash

pip install modal
modal setup

The modal setup step links your account through the browser. From there you write a function and declare its hardware in the decorator.

python

import modal

app = modal.App("llm-inference")

image = modal.Image.debian_slim().pip_install("vllm", "torch")

@app.function(gpu="A100", image=image)
def generate(prompt: str) -> str:
    from vllm import LLM
    model = LLM("meta-llama/Llama-3.1-8B-Instruct")
    output = model.generate(prompt)
    return output[0].outputs[0].text

You run it from the command line, and Modal builds the container, starts a GPU instance, executes the function, and shuts the instance down afterward.

bash

modal run inference.py

To turn the same function into a live HTTP endpoint, you add a web decorator and deploy it. Modal keeps the code warm while traffic arrives and scales back to zero when it stops.

python

@app.function(gpu="A100")
@modal.fastapi_endpoint(method="POST")
def api(prompt: str) -> dict:
    return {"text": generate.local(prompt)}

This model suits three workload shapes well. Serve a model behind an endpoint for real-time inference . Run a batch of thousands of parallel jobs for embeddings or data processing. Launch a fine-tuning run that finishes in minutes and stops billing the moment it completes. The teams that reach for Modal are usually ML engineers and application developers who want production GPU compute without standing up their own platform team.

Step 1 Write function Add a decorator declaring the GPU and container image.

→

Step 2 Deploy Modal builds the image and registers the app.

→

Step 3 Autoscale Containers start on demand and scale with traffic.

→

Step 4 Scale to zero Idle containers stop, and billing stops with them.

Pricing model

Modal bills per second of compute. You pay for the GPU, CPU, and memory a container uses while it runs, and nothing while it sits idle. According to Modal’s pricing page, the Starter plan carries no platform fee and includes free monthly compute credits, the Team plan adds a fixed monthly fee with more credits, and an Enterprise tier offers custom volume pricing. GPU rates are published per second for hardware such as the H100, A100, and A10. Because billing stops when a job finishes, a fine-tuning run that completes in 47 minutes costs 47 minutes, not a full rounded-up hour.

How it compares

Modal is a serverless abstraction. It differs from raw GPU clouds, where you rent a machine and manage it yourself, and from inference-only endpoints, where you deploy a model but cannot run arbitrary code.

	Modal	RunPod	Baseten	CoreWeave
Model	Serverless Python	GPU pods and serverless	Model serving	GPU cloud infrastructure
Runs arbitrary code	Yes	Yes	Model focused	Yes
Scales to zero	Yes	Serverless tier	Yes	No, you rent capacity
Billing	Per second	Per second or hourly	Per usage	Reserved and on demand
Best for	Batch, inference, training in Python	Cheap raw GPU access	Production model endpoints	Large reserved GPU fleets

For a wider survey of GPU providers and how serverless platforms sit against dedicated clouds, see the GPU clouds and neoclouds comparison .

When not to use it

Modal is not the right tool for every job.

Steady, always-on high load. If a GPU runs at full use around the clock, a reserved instance on a GPU cloud is often cheaper than per-second serverless rates.
Non-Python stacks. Modal centers on Python. A service written in Go, Rust, or Java fits a general container platform better.
Bare-metal control. Workloads that need a specific kernel, custom drivers, or direct hardware tuning are constrained by the managed abstraction.
Strict data residency in one region. Modal spreads its fleet across providers. If regulation pins you to a single region or cloud account, a dedicated deployment gives firmer control.

For teams shipping their first GPU service, the from zero to production guide covers the surrounding deployment decisions.

Sources

Modal homepage : platform overview, workload types, and GPU support
Modal pricing page : plan structure, free credits, and per-second GPU pricing
Modal documentation : Python SDK, decorators, and deployment workflow

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session