Inference

49 articles Use search to find specific topics

All articles 49 total

Xinity New Xinity is open-source sovereign AI infrastructure software: an …

Added 1 Jul · Upd 1 Jul ·5 min

Baseten New A production inference platform for deploying, serving, and autoscaling …

Added 29 Jun · Upd 29 Jun ·5 min

Continuous Batching New An inference-serving technique that packs many users' requests onto one …

Added 29 Jun · Upd 29 Jun ·4 min

CoreWeave New CoreWeave is a GPU-focused cloud provider, a neocloud built for training …

Added 29 Jun · Upd 29 Jun ·5 min

Crusoe New Crusoe is an energy-first AI cloud that builds its own data centers and …

Added 29 Jun · Upd 29 Jun ·4 min

FinOps for AI: Controlling the Cost of LLMs and GPUs New A practical guide to controlling AI spend. Learn the cost drivers behind … Guides

Added 29 Jun · Upd 29 Jun ·9 min

Fireworks AI New Fireworks AI is a low-latency inference and fine-tuning platform that …

Added 29 Jun · Upd 29 Jun ·5 min

GPU Clouds and Neoclouds Compared New A hub comparison of GPU clouds and neoclouds for AI, from bare GPU … Comparisons

Added 29 Jun · Upd 29 Jun ·5 min

Groq New Groq builds the LPU, a custom inference chip, and GroqCloud, a fast, …

Added 29 Jun · Upd 29 Jun ·5 min

KV Cache New The key-value cache stores attention keys and values for tokens a …

Added 29 Jun · Upd 29 Jun ·4 min

Lambda (GPU Cloud) New Lambda is a developer-friendly GPU cloud for training and inference, …

Added 29 Jun · Upd 29 Jun ·4 min

Modal New Modal is a serverless cloud platform for running Python and AI workloads …

Added 29 Jun · Upd 29 Jun ·5 min

Nebius New Nebius is a full-stack AI cloud offering GPU compute, storage, and …

Added 29 Jun · Upd 29 Jun ·4 min

Neocloud New A newer cloud provider specialized in GPU compute for AI training and …

Added 29 Jun · Upd 29 Jun ·4 min

NVIDIA AI Platform (NIM, NeMo, DGX) New How NVIDIA combines GPUs, DGX systems, NVIDIA AI Enterprise, NIM …

Added 29 Jun · Upd 29 Jun ·5 min

NVIDIA TensorRT-LLM New An open-source library that compiles and optimizes large language models …

Added 29 Jun · Upd 29 Jun ·5 min

Ray Serve New Ray Serve is a framework-agnostic model-serving library on Ray that …

Added 29 Jun · Upd 29 Jun ·5 min

Reka AI New Reka AI is a research lab building natively multimodal models that read …

Added 29 Jun · Upd 29 Jun ·5 min

RunPod New RunPod is a GPU cloud offering on-demand pods, a lower-cost community …

Added 29 Jun · Upd 29 Jun ·4 min

SGLang New SGLang is an open-source high-performance serving framework for large …

Added 29 Jun · Upd 29 Jun ·5 min

Speculative Decoding New An inference speedup where a small draft model proposes several tokens …

Added 29 Jun · Upd 29 Jun ·5 min

Text Generation Inference (TGI) New Hugging Face's open-source, production-grade server for deploying open …

Added 29 Jun · Upd 5 Jul ·5 min

Together AI New Together AI is a cloud platform for running, fine-tuning, and serving …

Added 29 Jun · Upd 29 Jun ·4 min

Vast.ai New Vast.ai is a marketplace that rents GPU compute from many providers, …

Added 29 Jun · Upd 29 Jun ·5 min

Vultr New Vultr is an independent developer cloud that pairs general compute with …

Added 29 Jun · Upd 29 Jun ·5 min

How ChatGPT Actually Works Behind the Scenes New A plain-words walk through the request lifecycle of ChatGPT: … Guides

Added 25 Jun · Upd 25 Jun ·7 min

OpenAI Unveils Jalapeno, Its First Custom AI Inference Chip New OpenAI and Broadcom revealed Jalapeno, a custom ASIC built for running … News

Added 25 Jun · Upd 25 Jun ·4 min

The Cloud Architecture Behind Every AI App New A plain-English tour of the production cloud stack behind a real AI … Guides

Added 25 Jun · Upd 25 Jun ·7 min

Why AI Companies Are Building Their Own Chips New Why OpenAI, Google, and Amazon design custom AI silicon: inference cost, … Guides

Added 25 Jun · Upd 25 Jun ·6 min

Multi-Model Routing New How to route each query to the right LLM to cut cost and add … Guides

Added 23 Jun · Upd 23 Jun ·9 min

Batch Inference Patterns for AI Workloads Processing large volumes of AI inference requests efficiently. Queue … Patterns

Added 28 Mar · Upd 30 May ·3 min

Batch vs Real-Time Inference Patterns Comparing batch and real-time inference patterns for ML models, covering … Comparisons

Added 28 Mar · Upd 14 Jun ·5 min

Building gRPC Microservices for ML Inference How to build gRPC-based microservices for ML inference: proto … Guides

Added 28 Mar · Upd 30 May ·3 min

Capacity Planning for AI Inference How to right-size GPU and TPU clusters, configure autoscaling for … Guides

Added 28 Mar · Upd 30 May ·4 min

Direct Model Interface The Simplest AI Integration Pattern Patterns

Added 28 Mar · Upd 30 May ·4 min

Edge Computing What edge computing is, how it brings computation closer to data … Glossary

Added 28 Mar · Upd 30 May ·2 min

FastAPI Modern Python API Framework Tools

Added 28 May · Upd 30 May ·8 min

Flash Attention How Flash Attention makes transformer self-attention memory-efficient by … Glossary

Added 28 Mar · Upd 30 May ·2 min

GPU vs TPU for AI Training and Inference Comparing GPUs and TPUs for AI model training and inference, covering … Comparisons

Added 28 Mar · Upd 14 Jun ·6 min

Inference Running AI Models in Production Glossary

Added 24 Mar · Upd 30 May ·3 min

Inference-Time Compute The practice of allocating additional computation during model inference … Glossary

Added 28 Mar · Upd 30 May ·2 min

Model Tier Routing Matching Request Complexity to Model Cost Patterns

Added 28 Mar · Upd 30 May ·3 min

Performance Engineering for AI Systems A comprehensive guide to latency optimization, GPU memory management, … Guides

Added 28 Mar · Upd 30 May ·3 min

Prompt Caching Server-side caching of attention key/value tensors for repeated prompt … Glossary

Added 8 May · Upd 30 May ·5 min

Real-Time Feature Computation Pattern The architectural pattern for computing ML features from event streams: … Patterns

Added 28 Mar · Upd 30 May ·5 min

Reducing LLM Inference Costs in Production Practical strategies for reducing LLM API and hosting costs without … Guides

Added 28 Mar · Upd 30 May ·3 min

Token Budget The maximum number of tokens allocated for an LLM request or workflow, … Glossary

Added 28 Mar · Upd 30 May ·2 min

vLLM High-Performance LLM Serving Engine Tools

Added 28 Mar · Upd 30 May ·3 min

Why Your AI Output Sounds Generic And How to Fix It With Your Own Data Guides

Added 24 Mar · Upd 30 May ·3 min

49 articles in this section. Search for a specific topic.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session