A multi-layer server block with red strips, representing a widely deployed open-weight model family.
Llama weights are downloadable, so the model runs on your own servers rather than only behind a vendor API.

Meta Llama is Meta’s family of open-weight large language models . The weights are published for download, so you can run the model on your own hardware, fine-tune it, and serve it through the platform of your choice. This solves a problem that closed model APIs cannot: full control over where the model runs, what data it sees, and how it is customised, without sending every request to a third-party endpoint. Llama became one of the most widely deployed open-weight model ecosystems since its first release on 24 February 2023.

The current generation, Llama 4, is Meta’s first family built on a mixture-of-experts architecture and its first that is natively multimodal, meaning a single model handles text and images together.

Where Llama sits

Llama is a set of downloadable foundation models . It occupies the model layer of a stack: you supply the serving infrastructure and application code around it.

Application
Chat and agents RAG pipelines Your product logic and prompts
Serving
vLLM Ollama Hosted inference APIs Self-hosted or via a provider
Model weights
Llama 4 Scout Llama 4 Maverick Downloadable, open-weight
Hardware
GPU servers Cloud GPU rental You own or rent the compute

The Llama 4 family

Meta released two open-weight Llama 4 models and announced a third, larger model that was still in training at the last public update.

  • Llama 4 Scout: a 17 billion active parameter model with 16 experts, 109 billion total parameters, and a stated context window of 10 million tokens. Natively multimodal.
  • Llama 4 Maverick: a 17 billion active parameter model with 128 experts and 400 billion total parameters. Natively multimodal.
  • Llama 4 Behemoth: a 288 billion active parameter model with 16 experts and nearly two trillion total parameters. Announced as still in training and not released as of Meta’s April 2026 update.

Scout and Maverick use a mixture-of-experts design. Only a fraction of the total parameters activate for any given token, which lowers the compute cost of running a large model.

How to access it

There are two main paths, and you can mix them.

Step 1 Get the weights Download from llama.com or Hugging Face after accepting the license.
Step 2 Choose serving Self-host with vLLM or Ollama, or use a hosted inference provider.
Step 3 Customise Fine-tune on your data or wire the model into a RAG or agent pipeline.

Self-hosting. Download Scout or Maverick from llama.com or Hugging Face, then serve the weights on your own GPU servers or rented cloud GPUs. This gives you data residency, offline operation, and the ability to fine-tune freely.

Hosted APIs. Many providers serve Llama behind an API so you skip infrastructure work. That includes cloud model catalogues and dedicated inference vendors. The trade-off is that you no longer control where the model runs.

Both paths are governed by the Llama 4 Community License Agreement and the Llama 4 Acceptable Use Policy. This license permits commercial use but is not an OSI-approved open-source license. It carries conditions, including a threshold clause that has historically required a separate license for the largest deployers, and use restrictions defined by the acceptable use policy. Read the license before shipping to production.

How it compares

Llama competes with other open-weight families and with closed model APIs. The main axis is control versus convenience.

Meta LlamaAlibaba QwenMistralClosed API (Claude, Gemini)
WeightsDownloadableDownloadableDownloadableNot released
Self-hostYesYesYesNo
License typeCommunity license, use limitsApache 2.0 on many modelsApache 2.0 on open modelsProprietary API only
MultimodalYes (Llama 4)Yes (several models)Yes (several models)Yes
Best forControl and fine-tuningMultilingual, permissive termsEuropean stack, efficiencyNo infra, fastest to start

See Alibaba Qwen , Mistral AI , and DeepSeek for the other major open-weight options, and the LLM landscape 2026 for the full picture including closed providers.

When not to use it

  • You want zero infrastructure. If you have no wish to manage GPUs or a serving stack, a closed API removes that burden. Hosted Llama providers narrow the gap but you still pick and manage a vendor.
  • The license clashes with your case. The Llama Community License is not a permissive open-source license. If you need Apache 2.0 style freedom, a Qwen or Mistral open model may fit better.
  • You need the single strongest general model regardless of openness. Frontier closed models may lead on specific tasks. Benchmark against your own workload before committing.
  • Your deployment crosses the license thresholds. Very large-scale deployers face extra conditions. Confirm your obligations with legal counsel first.

Further reading

Sources