Xinity
Xinity is open-source sovereign AI infrastructure software: an OpenAI-compatible engine that runs large language models entirely on your own hardware, with zero data egress.

Xinity is open-source software for running generative AI entirely on your own infrastructure. It provides an OpenAI-compatible API in front of models hosted on your own GPUs, so existing applications keep working after you change one thing: the endpoint URL. The point is sovereignty . Data, models, and compute all stay inside your premises and your jurisdiction, with zero data egress to an external provider. Xinity targets regulated European enterprises (media, manufacturing, and public institutions) that cannot send prompts or documents to a foreign cloud. It ships as two layers: an open-source engine and a paid enterprise platform.
Where Xinity sits
Xinity is the control and serving layer between your applications and the GPUs in your building. Your app calls the gateway; the gateway routes to model runtimes on your GPU nodes.
What it is made of
Xinity is a set of components, most of them Apache 2.0 licensed:
- Gateway: the OpenAI-compatible API proxy that handles routing and rate limiting.
- Daemon: the model runtime that runs on GPU nodes, backed by Ollama or vLLM .
- Infoserver: the model registry and configuration server.
- Database layer: PostgreSQL and Redis for state.
- Dashboard: a management interface with role-based access and single sign-on. This component uses the Elastic License v2 with a free tier for one organisation and one node, while the gateway, daemon, CLI, infoserver, and database schema are Apache 2.0.
Because it exposes an OpenAI-compatible API and can run open-weight or your own models, Xinity acts as a drop-in replacement for a hosted API without the data leaving your control.
Installing and running Xinity
Install the CLI, then bring the platform up.
# Install the CLI
curl -fsSL https://get.xinity.ai/install.sh | bash
# Bring up all services (gateway, daemon, infoserver, dashboard, DB)
xinity up allDeploy a model so the gateway can serve it:
xinity act deployment.create '{
"name": "Phi-3 Mini",
"publicSpecifier": "phi-3-mini",
"modelSpecifier": "phi3:mini",
"enabled": true
}'Running the platform needs Docker and Docker Compose on the host. Model runtimes need local GPUs for anything beyond small models.
Calling it like OpenAI
The whole value proposition is that your existing code barely changes. Point any OpenAI client at the local gateway:
from openai import OpenAI
# The only change from a hosted provider is the base_url
client = OpenAI(base_url="http://localhost:3000/v1", api_key="sk_...")
resp = client.chat.completions.create(
model="phi-3-mini",
messages=[{"role": "user", "content": "Summarise this contract clause."}],
)
print(resp.choices[0].message.content)The same request over plain HTTP:
curl http://localhost:3000/v1/chat/completions \
-H "Authorization: Bearer sk_..." \
-H "Content-Type: application/json" \
-d '{"model": "phi-3-mini", "messages": [{"role": "user", "content": "Hello"}]}'The path to a sovereign endpoint
Xinity uses capacity-based pricing (you pay for the GPU resources you run, not per token) with published tiers that run from a free Community tier up to enterprise plans. It positions its audit trails as mapped to EU AI Act articles, and its compliance story around GDPR, the EU AI Act, and NIS2. Treat those as vendor claims to verify against your own obligations.
How it compares
| Xinity | Ollama | vLLM | Hosted cloud API | |
|---|---|---|---|---|
| Deployment | On-premise platform | Local runtime | Serving framework | Fully managed |
| Sovereignty | Full, zero egress | Full (local) | Full (self-hosted) | Provider-controlled |
| OpenAI-compatible | Yes | Yes | Yes | Native |
| Management | Dashboard, RBAC, SSO, audit | Minimal | Minimal | Provider console |
| Best for | Regulated enterprise fleets | Single-machine local use | High-throughput serving | Fastest path, least control |
Xinity sits above runtimes like Ollama and vLLM rather than replacing them; it uses them underneath and adds the gateway, governance, and multi-node management an enterprise needs.
When not to use Xinity
- You have no sovereignty requirement. If your data can go to a cloud, a hosted API such as Azure OpenAI or Amazon Bedrock is faster to adopt and needs no operations.
- You have no GPUs to run it on. Xinity serves models on hardware you own or control. For rented compute, see GPU clouds and neoclouds .
- You only need one model on one machine. For a single local model with no governance needs, Ollama alone is simpler.
- You want peak single-endpoint throughput and nothing else. A tuned vLLM or SGLang deployment may serve raw throughput needs more directly.
- You need a large, proven ecosystem. Xinity is a newer, focused product. If you need a wide partner and support ecosystem today, weigh the larger sovereign stacks from established vendors.
Further reading
- What is sovereign AI? : the concept and the 2026 landscape Xinity fits into.
- What is data sovereignty? : the narrower data-control idea.
- On-premise vs cloud AI : the core trade-off Xinity addresses.
- Hybrid and multicloud AI : splitting sensitive and non-sensitive workloads.
- Ollama : a local runtime Xinity can use underneath.
- Xinity website : official product, pricing, and documentation.
- Xinity on GitHub : the open-source engine and install instructions.
Sources
- Xinity. Sovereign AI Infrastructure Software for European Enterprises. https://xinity.ai/
- Xinity. Open-source engine (GitHub, components and install). https://github.com/xinity-ai/xinity-ai
- Xinity. Pricing. https://xinity.ai/sovereign-ai-pricing