A dark industrial gateway lit red at its core, representing an OpenAI-compatible gateway that keeps data inside your own premises.
Xinity is a gateway you own. Apps point at a local endpoint, and prompts never cross your boundary.

Xinity is open-source software for running generative AI entirely on your own infrastructure. It provides an OpenAI-compatible API in front of models hosted on your own GPUs, so existing applications keep working after you change one thing: the endpoint URL. The point is sovereignty . Data, models, and compute all stay inside your premises and your jurisdiction, with zero data egress to an external provider. Xinity targets regulated European enterprises (media, manufacturing, and public institutions) that cannot send prompts or documents to a foreign cloud. It ships as two layers: an open-source engine and a paid enterprise platform.

Where Xinity sits

Xinity is the control and serving layer between your applications and the GPUs in your building. Your app calls the gateway; the gateway routes to model runtimes on your GPU nodes.

Application
Existing apps Agents Point OpenAI SDKs at a local endpoint
Gateway and control
OpenAI-compatible gateway Dashboard (RBAC, SSO) Infoserver (model registry) Routing, rate limiting, audit trails
Runtime
Daemon on GPU nodes Ollama / vLLM Runs the actual model weights
Your hardware
Local GPUs On-premise datacentre From a single workstation to a cluster

What it is made of

Xinity is a set of components, most of them Apache 2.0 licensed:

  • Gateway: the OpenAI-compatible API proxy that handles routing and rate limiting.
  • Daemon: the model runtime that runs on GPU nodes, backed by Ollama or vLLM .
  • Infoserver: the model registry and configuration server.
  • Database layer: PostgreSQL and Redis for state.
  • Dashboard: a management interface with role-based access and single sign-on. This component uses the Elastic License v2 with a free tier for one organisation and one node, while the gateway, daemon, CLI, infoserver, and database schema are Apache 2.0.

Because it exposes an OpenAI-compatible API and can run open-weight or your own models, Xinity acts as a drop-in replacement for a hosted API without the data leaving your control.

Installing and running Xinity

Install the CLI, then bring the platform up.

bash
# Install the CLI
curl -fsSL https://get.xinity.ai/install.sh | bash

# Bring up all services (gateway, daemon, infoserver, dashboard, DB)
xinity up all

Deploy a model so the gateway can serve it:

bash
xinity act deployment.create '{
  "name": "Phi-3 Mini",
  "publicSpecifier": "phi-3-mini",
  "modelSpecifier": "phi3:mini",
  "enabled": true
}'

Running the platform needs Docker and Docker Compose on the host. Model runtimes need local GPUs for anything beyond small models.

Calling it like OpenAI

The whole value proposition is that your existing code barely changes. Point any OpenAI client at the local gateway:

python
from openai import OpenAI

# The only change from a hosted provider is the base_url
client = OpenAI(base_url="http://localhost:3000/v1", api_key="sk_...")

resp = client.chat.completions.create(
    model="phi-3-mini",
    messages=[{"role": "user", "content": "Summarise this contract clause."}],
)
print(resp.choices[0].message.content)

The same request over plain HTTP:

bash
curl http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer sk_..." \
  -H "Content-Type: application/json" \
  -d '{"model": "phi-3-mini", "messages": [{"role": "user", "content": "Hello"}]}'

The path to a sovereign endpoint

Step 1 Install Run the CLI installer on a host with GPUs and Docker.
Step 2 Deploy a model Register open-weight or your own models with the infoserver.
Step 3 Repoint apps Change the base URL in your OpenAI SDK to the local gateway.
Step 4 Govern Manage access and audit trails from the dashboard.

Xinity uses capacity-based pricing (you pay for the GPU resources you run, not per token) with published tiers that run from a free Community tier up to enterprise plans. It positions its audit trails as mapped to EU AI Act articles, and its compliance story around GDPR, the EU AI Act, and NIS2. Treat those as vendor claims to verify against your own obligations.

How it compares

XinityOllamavLLMHosted cloud API
DeploymentOn-premise platformLocal runtimeServing frameworkFully managed
SovereigntyFull, zero egressFull (local)Full (self-hosted)Provider-controlled
OpenAI-compatibleYesYesYesNative
ManagementDashboard, RBAC, SSO, auditMinimalMinimalProvider console
Best forRegulated enterprise fleetsSingle-machine local useHigh-throughput servingFastest path, least control

Xinity sits above runtimes like Ollama and vLLM rather than replacing them; it uses them underneath and adds the gateway, governance, and multi-node management an enterprise needs.

When not to use Xinity

  • You have no sovereignty requirement. If your data can go to a cloud, a hosted API such as Azure OpenAI or Amazon Bedrock is faster to adopt and needs no operations.
  • You have no GPUs to run it on. Xinity serves models on hardware you own or control. For rented compute, see GPU clouds and neoclouds .
  • You only need one model on one machine. For a single local model with no governance needs, Ollama alone is simpler.
  • You want peak single-endpoint throughput and nothing else. A tuned vLLM or SGLang deployment may serve raw throughput needs more directly.
  • You need a large, proven ecosystem. Xinity is a newer, focused product. If you need a wide partner and support ecosystem today, weigh the larger sovereign stacks from established vendors.

Further reading

Sources