Two dark gear clusters meshing under red light, a metaphor for Terraform and Ansible as two complementary automation tools.
Terraform and Ansible are usually framed as rivals. In practice they are two gears that mesh: one builds the machine, the other configures it.

Terraform and Ansible are the two most common tools for automating infrastructure, and they are constantly compared as if you must pick one. You usually do not. They solve different halves of the same problem: Terraform provisions infrastructure (create the servers, networks, and clusters), and Ansible configures it (install and set up what runs on those servers). For AI teams standing up GPU clusters, understanding the split saves a lot of wasted effort.

Comparison table

TerraformAnsible
Primary jobProvision infrastructureConfigure and deploy
ParadigmDeclarative (desired state)Procedural, but idempotent
StateTracks a state fileStateless, checks live system
AgentAgentless (cloud APIs)Agentless (SSH or WinRM)
LanguageHCLYAML playbooks
StrengthCloud resource lifecycleOS, packages, app config
WeaknessIn-server configurationFull lifecycle and dependency graphs
Best forCreating GPU nodes, VPCs, clustersInstalling drivers, CUDA, services

What each one is built for

Terraform: provisioning

Terraform is declarative. You describe the infrastructure you want, GPU instances, a VPC, a Kubernetes cluster, object storage, and Terraform works out the create, update, and delete actions to reach that state. It records what it has built in a state file, so it can detect drift and tear everything down cleanly. This lifecycle management is its core strength and pairs naturally with an immutable-infrastructure style, where you replace servers rather than patch them.

Its main limits are inside the machine. Terraform provisions a GPU node, but it is awkward at the follow-on work of installing a specific driver version, compiling a kernel module, or restarting a service in the right order. One note on licensing: HashiCorp moved Terraform to the Business Source License in 2023, which prompted the MPL-licensed OpenTofu fork, now a drop-in alternative worth knowing about.

Ansible: configuration

Ansible is agentless and push-based: it connects over SSH and runs tasks on the target machines, so there is nothing to install on them first. Its modules are idempotent, meaning running the same playbook twice leaves the system in the same state, which is what makes configuration management safe to repeat. Playbooks are ordered lists of tasks, so Ansible shines at “do these steps, in this sequence” work: install packages, render config files, manage services, deploy an application.

Its weak spot is the mirror image of Terraform’s strength. Ansible has no built-in model of your full cloud estate or its dependency graph, so using it as your primary provisioner means reinventing state tracking and lifecycle management that Terraform gives you for free.

The AI infrastructure split

On a real GPU cluster the division of labour is clean, and it maps directly onto the two tools:

  • Terraform provisions the GPU instances (for example H100 or Blackwell nodes), the VPC and subnets, the managed Kubernetes cluster, shared storage, and IAM roles.
  • Ansible configures each node once it exists: the matching NVIDIA driver, the CUDA toolkit, the container toolkit, Slurm or the Kubernetes agent, monitoring, and any model-serving runtime.

That is why the two appear together in most AI platform playbooks. The GPU driver and CUDA version are exactly the kind of in-server, order-sensitive configuration Terraform handles poorly and Ansible handles well.

Using them together

The standard pattern runs Terraform first, then hands off to Ansible.

Step 1 Provision Terraform creates the GPU nodes, network, and cluster, and outputs their addresses.
Step 2 Configure Ansible installs drivers, CUDA, the container toolkit, and services on each node.
Step 3 Deploy Ship the training or serving workload onto the ready cluster.
Step 4 Maintain Terraform manages the estate lifecycle; Ansible re-runs to keep config in line.

Ansible can consume Terraform’s outputs as its inventory, so the addresses of freshly provisioned nodes flow straight into the configuration step with no manual copying.

When to choose which

Choose Terraform alone when your work is pure cloud provisioning and the machines are configured from prebuilt images, so there is little in-server setup to do.

Choose Ansible alone when the infrastructure already exists (on-premise servers, for example) and you only need to configure and deploy onto it.

Choose both for anything involving custom GPU nodes: Terraform to build the fleet, Ansible to make each node ready to train or serve. This is the default for serious AI infrastructure.

If your question is really Terraform versus a cloud-native provisioner rather than versus Ansible, see Terraform vs CDK .

Further reading

Sources

  • HashiCorp. “Terraform Documentation.” https://developer.hashicorp.com/terraform/docs . Declarative provisioning, state, and the resource lifecycle.
  • Red Hat. “Ansible Documentation.” https://docs.ansible.com/ . Agentless architecture, idempotent modules, and playbooks.
  • OpenTofu. “OpenTofu: An open-source, community fork of Terraform.” https://opentofu.org/ . The MPL-licensed fork created after the 2023 license change.
  • Morris, K. Infrastructure as Code: Dynamic Systems for the Cloud Age, 2nd ed. O’Reilly (2020). The provisioning-versus-configuration distinction and IaC patterns.