AI Systems Are Software Systems
Why production AI requires the same engineering discipline as any distributed system, and how this wiki covers the full stack of AI …
Step-by-step guides for implementing AI systems, from prototyping to production.
In-depth implementation guides covering every stage of the AI development lifecycle.
Why production AI requires the same engineering discipline as any distributed system, and how this wiki covers the full stack of AI …
A comprehensive guide to GitHub Actions security vulnerabilities, common exploit patterns, and how to audit and harden your CI/CD pipelines …
The principle of defining infrastructure, configuration, documentation, policy, video, and design as version-controlled code artifacts - and …
How to design and run A/B tests for AI models and features, covering experiment design, traffic splitting, metrics selection, and …
How to apply Agile principles to AI and ML projects, addressing the unique challenges of experimentation, data dependencies, and uncertain …
A practical guide to preparing your organization and AI systems for internal and external audits, covering documentation, evidence …
How to implement cost tracking, allocation, and chargeback models for AI workloads including token-based billing, GPU hour accounting, and …
How to use AI to accelerate legacy system modernization, covering code analysis, documentation generation, migration assistance, and …
Using AI for code generation, bug detection, test generation, code review, and other software engineering tasks.
Launch playbooks for AI products covering positioning, early adopter programs, managing expectations, and scaling from beta to general …
How to implement model governance for production AI systems, covering model registries, approval workflows, audit trails, and lifecycle …
Usage-based pricing, credit systems, freemium models, and other monetization approaches for AI-powered products.
How product management changes for AI-powered products, covering requirements definition, success metrics, user experience design, and …
How to track both product metrics and model metrics for AI products, bridging the gap between business outcomes and technical performance.
A cross-regulation compliance checklist covering GDPR, EU AI Act, NIS2, DORA, and key standards for organizations deploying AI systems in …
Security considerations for AI systems, covering prompt injection, data poisoning, model theft, access control, and building …
How to structure AI teams within an organization, covering centralized vs embedded models, role definitions, reporting structures, and …
Full lifecycle cost modeling for AI platforms covering compute, data, personnel, and hidden costs that affect AI project budgets.
Guide to transparency requirements for AI systems under the EU AI Act, GDPR, and related regulations, covering disclosure, explainability, …
User research methods for AI products including Wizard-of-Oz testing, measuring user trust in AI, and designing studies for probabilistic …
Best practices for designing APIs that serve AI workloads, covering streaming responses, versioning, error handling for probabilistic …
How to version AI APIs as models evolve: URL path versioning, header versioning, model version pinning, backward compatibility, and …
Practical guide for implementing cloud governance on AWS for AI and ML workloads, covering Organizations, SCPs, tagging, cost management, …
Frameworks and techniques for prioritizing AI project backlogs, balancing business value, technical risk, data readiness, and research …
A practical guide to building production AI chatbots, covering architecture, conversation design, context management, guardrails, and …
A practical guide to establishing an AI ethics review board, from composition and charter to review processes and decision-making …
How to design and build a shared platform that enables ML teams to develop, deploy, and operate models without reinventing infrastructure …
How to build an internal developer platform for AI/ML teams: service catalogs, golden paths for model deployment, self-service GPU …
How to implement a feature store that serves consistent features for both training and inference, reducing duplication and preventing …
How to build gRPC-based microservices for ML inference: proto definitions, streaming token delivery, load balancing, health checks, and …
How to design, populate, and query knowledge graphs that enhance AI systems with structured relational knowledge.
How to right-size GPU and TPU clusters, configure autoscaling for inference workloads, manage GPU memory, and plan capacity for variable AI …
How to manage organizational change when introducing AI systems, addressing resistance, training needs, process redesign, and cultural …
Chaos engineering for AI: injecting model API latency, simulating provider outages, degraded embeddings, corrupted indexes, and verifying …
Which tests to run at each CI/CD stage: PR-level unit tests, merge-level eval suites, scheduled regression and drift detection, cost …
Guide to implementing CSPM for AI and ML workloads, covering misconfigurations, compliance monitoring, and security automation in cloud AI …
Practical guide to code review for ML projects, covering what to look for in training code, data pipelines, serving code, and experiment …
How to evaluate ML models holistically, covering performance metrics, fairness analysis, robustness testing, and business impact assessment.
A practical guide to implementing computer vision in enterprise settings, covering use cases, model selection, data requirements, and …
A structured methodology for identifying, evaluating, and mitigating risks in AI systems before and after deployment.
Step-by-step guide for conducting Data Protection Impact Assessments for AI and machine learning systems, with templates and practical …
Contract testing between AI services: defining input/output contracts, latency SLAs, Pact for AI services, provider vs consumer-driven …
How to estimate and manage costs for AI workloads on AWS, covering Bedrock, SageMaker, compute, storage, and strategies for cost …
Guide to managing international data transfers for AI systems under GDPR, covering transfer mechanisms, cloud considerations, and practical …
A guide to data anonymization techniques for AI including k-anonymity, l-diversity, t-closeness, differential privacy, and practical methods …
How to design labeling workflows, choose tools, manage annotators, and ensure label quality for ML training data.
How to implement data quality validation for AI workloads using Great Expectations and Deequ: profiling, expectation suites, pipeline …
A practical guide to designing and implementing a data lakehouse architecture optimized for AI and machine learning workloads.
Practical approaches to monitoring for data drift, concept drift, and model performance degradation, with strategies for automated response.
How to assess, prepare, and govern your organization's data assets to support AI projects effectively.
How to plan disaster recovery for AI systems: RTO/RPO targets, multi-region model serving, model artifact backup, and failover strategies …
What to document for AI systems, how to structure it, and how to keep documentation current as models and data evolve.
Practical guide for implementing DORA requirements in financial services organizations that deploy AI systems for trading, risk management, …
How to deploy AI models on edge devices, covering hardware selection, model optimization, deployment strategies, and managing edge AI at …
How to choose embedding models for semantic search, RAG, and similarity tasks, comparing popular models across quality, speed, cost, and …
How to E2E test AI applications: browser automation for chatbot UIs, testing streaming responses, handling non-deterministic outputs, visual …
Practical steps for achieving compliance with the EU AI Act, covering risk classification, conformity assessment, documentation, and …
How to measure and improve both retrieval quality and generation quality in RAG systems, with practical metrics and evaluation frameworks.
Systematic approaches to feature creation, selection, and transformation for building effective machine learning models.
What feature stores are, why they matter, how to choose one, and practical implementation guidance for ML feature management.
A practical guide to federated learning, covering how it works, when to use it, implementation approaches, and challenges for enterprise …
When and how to fine-tune large language models, covering data preparation, training approaches (full fine-tuning, LoRA, QLoRA), evaluation, …
A structured approach to evaluating AI vendors covering technical capabilities, data handling, compliance, pricing, and long-term viability.
How to navigate the journey from AI proof of concept to production deployment, covering the common pitfalls, decision gates, and engineering …
How to implement comprehensive observability for AI applications covering traces, evaluations, metrics, and alerting across the entire …
A practical guide for AI and machine learning teams on meeting GDPR requirements across the ML lifecycle, from data collection through model …
A practical guide to adopting MLOps practices, moving ML models from experimental notebooks to reliable, automated production systems.
Strategies for building effective classifiers on skewed datasets, from sampling techniques to algorithm-level adjustments and evaluation …
How to hire AI and ML engineers effectively, covering role definition, sourcing, technical evaluation, and common hiring mistakes in the AI …
Practical guide to grid search, random search, and Bayesian optimization for finding optimal model configurations.
How to implement metadata management with DataHub or OpenMetadata: automated ingestion, data lineage, ownership, classification, and …
A framework for establishing AI governance structures, policies, and processes that balance innovation velocity with risk management.
How to set up automated retraining pipelines that keep ML models current as data distributions and business conditions change.
A practical guide to applying data mesh principles for decentralized data ownership and governance in organizations scaling AI across …
A practical guide to implementing the four core functions of the NIST AI RMF: Govern, Map, Measure, and Manage across your AI portfolio.
How to handle incidents in AI systems: on-call rotations, escalation policies, AI-specific runbooks, and post-incident reviews for model and …
A structured approach to detecting, triaging, mitigating, and learning from AI system failures in production.
How to integration test AI systems: testing RAG retrieval pipelines, model inference chains, tool-call sequences, and contract testing …
A practical guide to implementing an AI management system and achieving ISO/IEC 42001 certification for responsible AI governance.
Implementing Kanban for AI operations teams managing model deployments, monitoring, retraining, and incident response in production ML …
A comprehensive guide to evaluating large language models, covering automated metrics (BLEU, ROUGE, BERTScore), LLM-as-judge, human …
How to design a centralized LLM access layer that handles routing, rate limiting, cost tracking, caching, and logging across multiple model …
How to lead AI adoption efforts that succeed by addressing the human side: stakeholder alignment, workforce readiness, and cultural change.
How to treat prompts as first-class software artifacts with version control, testing, review processes, and safe deployment practices.
How to identify and manage technical debt specific to machine learning systems, covering data debt, pipeline debt, configuration debt, …
Test environment strategies for AI: local dev with mocked models, staging with real models, Docker Compose for local AI stacks, cost …
A practical guide for migrating on-premise AI and ML workloads to cloud platforms, covering assessment, planning, execution, and …
A clear comparison of ML Engineer and Data Scientist roles, covering responsibilities, skills, career paths, and guidance on which to hire …
How to automate machine learning pipelines for training, evaluation, and deployment, moving from manual notebook workflows to production …
Strategies for mocking LLM APIs, embedding services, and vector databases in tests: fixture responses, VCR pattern, deterministic stubs, and …
Practical guide to SHAP, LIME, feature importance, partial dependence plots, and other techniques for understanding ML model behavior.
A comprehensive guide to monitoring production AI systems, covering model quality, data drift, infrastructure health, and alerting …
How to design and implement a multi-cloud AI strategy covering portable ML pipelines, abstraction layers, vendor lock-in avoidance, and data …
A practical guide to building multi-modal AI applications that process text, images, audio, and video, covering architectures, use cases, …
Step-by-step guide for implementing NIS2 Directive compliance, covering risk assessment, security measures, incident reporting, and supply …
How to design and build NLP pipelines for enterprise applications, covering text processing, entity extraction, classification, and …
Practical guide to the OWASP Top 10 vulnerabilities for LLM applications, covering prompt injection, data leakage, supply chain risks, and …
A comprehensive guide to latency optimization, GPU memory management, throughput engineering, and model acceleration techniques for …
Comprehensive Playwright guide: setup, page objects, selectors, assertions, network interception for mocking AI APIs, visual comparison, …
What the EU AI Act requires, which of your AI systems are affected, and concrete steps to achieve and maintain compliance.
A concrete checklist covering model quality, infrastructure, security, monitoring, documentation, compliance, and rollback planning for …
Techniques for estimating AI project timelines, budgets, and resource requirements, accounting for the inherent uncertainty of machine …
How to design and implement prompt chains for complex AI tasks, covering chain architecture, error handling, optimization, and practical …
How to build RAG systems that handle documents containing images, tables, charts, and mixed content alongside text.
How to implement rate limiting for AI API endpoints: token bucket and sliding window algorithms, per-user and per-model limits, token-based …
Implementation guide for real-time streaming data pipelines: four-layer architecture, Flink feature computation, late-arriving data handling …
How to plan and execute red team exercises that systematically probe AI systems for vulnerabilities, biases, and failure modes before …
Practical strategies for reducing LLM API and hosting costs without sacrificing quality, from caching and routing to model selection and …
Release strategies for AI model deployments including canary releases, shadow mode, A/B testing, and rollback procedures for ML systems.
Practical guide to gathering, documenting, and managing requirements for AI projects where outputs are probabilistic and data availability …
How to implement responsible AI practices including fairness, transparency, accountability, and privacy in enterprise AI systems.
Identifying, assessing, and mitigating risks specific to AI and ML projects, from data quality to model failure to organizational …
How to scale AI infrastructure for growing workloads, covering compute scaling, model serving at scale, data infrastructure, and cost …
How to implement Scrum in ML teams, covering sprint cadence, role adaptations, backlog structure, and ceremony modifications for data …
How to manage API keys, credentials, and sensitive configuration in AI pipelines using vault integration, rotation policies, and secure …
How to integrate security scanning into AI/ML CI/CD pipelines: dependency scanning, container image analysis, model file validation, secrets …
A practical guide to establishing an AI ethics board including composition, charter development, review processes, and escalation procedures …
How to implement a model registry that tracks model versions, metadata, lineage, and approval status across the ML lifecycle.
Snapshot and golden file testing for AI: capturing expected outputs, managing updates, structural snapshots, semantic similarity assertions, …
Architecture decisions, ADRs, and trade-offs for AI systems covering serving patterns, training infrastructure, and system decomposition.
How to apply software quality practices to ML projects: code coverage for non-model code, quality gates in CI/CD, static analysis, testing …
How to run effective sprint planning sessions for AI and ML teams, covering estimation techniques, capacity planning, and handling research …
How to manage stakeholder expectations, communicate uncertainty, and build trust throughout AI project delivery from proof of concept to …
How to generate and use synthetic data for AI training, covering techniques, quality validation, privacy considerations, and practical use …
How to set up experiment tracking that makes ML research reproducible, comparable, and auditable across your team.
Understanding and managing technical debt specific to AI and ML systems, covering data debt, model debt, pipeline debt, and strategies for …
Writing API documentation, model cards, design documents, and runbooks for AI/ML systems.
Managing test data for AI: synthetic data generation, fixture design, golden datasets for regression, data versioning, anonymization, and …
How to test AI agents that use tools: mocking tool responses, testing tool selection logic, error handling, multi-step workflows, sandboxed …
Frameworks for evaluating AI agents that plan, use tools, and take actions, covering correctness, reliability, safety, and cost efficiency.
LLM-specific testing strategies: prompt template testing, structured output validation, guardrail verification, token limit testing, model …
Strategies for testing AI systems where the same input produces different outputs: statistical assertions, distribution testing, confidence …
How to test Retrieval-Augmented Generation systems: unit testing chunking, integration testing retrieval quality, testing citation accuracy, …
Comprehensive guide to time series forecasting methods including ARIMA, SARIMA, Prophet, seasonal decomposition, and practical …
A practical guide to time series forecasting for business applications, covering classical methods, machine learning approaches, deep …
How to unit test AI codebases effectively: testing prompt templates, output parsers, data validation, chunking functions, and embedding …
How to conduct UAT for probabilistic AI outputs, including test design, success criteria, and managing stakeholder expectations around error …
Strategies for driving AI adoption through structured change management, effective training programs, trust-building, and measurable …
How to choose the right vector database for your AI application, covering performance requirements, managed vs self-hosted options, and …
How to build voice-enabled AI applications, covering speech-to-text, text-to-speech, voice assistants, and real-time voice processing …
A practical comparison of waterfall and agile methodologies for AI and ML projects, including hybrid approaches and decision criteria for …
A practical guide to the three languages used across a modern AI stack: Python for agents and models, TypeScript for frontends and video …
Practical prompt engineering patterns for production AI systems: system prompts, few-shot examples, chain-of-thought, structured output, …
How sorting and search algorithms underpin AI pipeline design: complexity trade-offs, partial sorting for top-k selection, tiered analysis …
How AI system architecture evolves from monolithic single-model deployments through microservices to collaborative multi-agent systems, with …
How the four cloud deployment models apply to AI workloads: when to use managed models, platform endpoints, GPU instances, or serverless …
A detailed walkthrough of a CI/CD pipeline for AI: source control, Docker builds, model evaluation, staged deployment, and drift monitoring …
Building reliable CI/CD pipelines for AI projects: model artifact management, automated evaluation gates, GitHub Actions workflows, and …
Why IaC matters for AI reproducibility, multi-environment consistency, and cost tracking. Terraform and CDK patterns for Bedrock agents, …
Applying Open Practice Library practices to AI: Event Storming for AI use case discovery, Impact Mapping for AI value, User Story Mapping …
A practical testing strategy for AI systems: property-based testing, integration testing with mocked models, evaluation frameworks, and …
How AWS shared responsibility applies to AI and ML workloads: data, model, and infrastructure responsibilities across Bedrock and SageMaker.
How each of the 12 original 12-factor app principles applies to AI and LLM-based systems: model configuration, artifact management, vector …
End-to-end document automation covering intake, classification, extraction, validation, routing, and archive. AWS services at each stage.
Low-cost AI tools, quick wins in email automation and document processing, and guidance on when to invest in custom solutions.
Common fraud signals, anomaly detection approaches, rule-based versus ML-based detection, and human review workflow design for insurance and …
A practical cost breakdown for enterprise AI projects - from prototype to production - covering model inference, infrastructure, data, …
How to design AI assistants that are genuinely useful rather than technically impressive but frustrating to use. Intake design, context …
Document ingestion, chunking strategies, embedding models, vector stores, retrieval tuning, and generation with context for production RAG …
How the discipline of preparing conference talks produces better AI prototypes, clarifies system design, and accelerates learning. Covers …
How to prepare data for AI projects: assessing what you have, cleaning and normalizing it, building evaluation datasets, and setting up …
A practical introduction to Amazon Bedrock: what it is, which models are available, how pricing works, and how to get your first use case …
A practical framework for selecting the right first AI use case - prioritizing for quick wins, avoiding common traps, and setting up for a …
Preparation, agenda design, stakeholder management, use case brainstorming techniques, prioritization exercises, and gap management between …
A practical guide to AWS PoC funding (up to 10,000 EUR) and migration funding (up to 400,000 EUR) - eligibility, application process, and …
A practical introduction to multi-agent AI architectures: when to use them, how they work, and which frameworks are production-ready.
The difference between prompting and grounding. Five stages from zero context to production-ready assets. The Personal Inference Pack …