AI systems require documentation that traditional software does not: model cards, experiment reports, data dictionaries, and fairness assessments. At the same time, the standard documentation (API docs, design docs, runbooks) needs adaptation for probabilistic systems. This guide covers how to write effective documentation for AI/ML systems.

API Documentation

AI serving APIs need documentation beyond standard endpoint references. Include:

Input specification with constraints. Not just “accepts a JSON object with a text field” but “accepts a text field containing 1-5000 UTF-8 characters in English, French, or German. Inputs exceeding 5000 characters are truncated. Non-supported languages return a 422 error.”

Output specification with confidence scores. Document what confidence scores mean and how consumers should use them. “The confidence field ranges from 0.0 to 1.0. Values above 0.85 indicate high confidence suitable for automated processing. Values between 0.5 and 0.85 should be routed for human review. Values below 0.5 indicate the model is uncertain and the prediction should not be used without human verification.”

Error responses specific to ML. Beyond standard HTTP errors, document model-specific failure modes: “If the model cannot produce a prediction within the latency SLA, the endpoint returns a 503 with a fallback_prediction field containing a rule-based default.”

Rate limits and costs. AI inference is expensive. Document rate limits, quota management, and per-request costs so consumers can budget their usage.

Versioning. Document which model version is serving, how to pin to a specific version, and the deprecation policy for old versions.

Model Cards

Model cards, introduced by Mitchell et al. (2019), are standardized documentation for trained models. A model card should include:

Model details - Architecture, training algorithm, framework, version, owner, and date trained.

Intended use - Primary use cases, out-of-scope uses (explicitly state what the model should not be used for), and target users.

Training data - Data sources, size, date range, preprocessing steps, and known limitations (biases in the data, underrepresented populations, labeling methodology).

Evaluation data - Test set description, how it was constructed, and its relationship to production data.

Performance metrics - Primary metrics with confidence intervals, broken down by relevant subgroups (demographics, input categories, difficulty levels). Include both aggregate and disaggregated metrics.

Ethical considerations - Known biases, potential harms, and mitigation strategies. Be honest. A model card that claims no ethical concerns is not credible.

Limitations and recommendations - Known failure modes, conditions under which performance degrades, and recommendations for monitoring in production.

Design Documents

AI design documents follow the standard structure (context, goals, non-goals, design, alternatives considered) with ML-specific additions:

Data strategy section. Describe the data sources, data pipeline, feature engineering approach, and data quality requirements. This section often determines feasibility and should be reviewed as carefully as the system design.

Experiment plan. Before building the full system, describe the experiments that will validate the approach. What baselines will be compared? What metrics will determine success? What is the timeline for experimentation?

ML-specific trade-offs. Document the accuracy/latency/cost trade-off space and where the proposed design sits within it. A design doc that proposes a transformer model should explain why the accuracy gain justifies the latency and cost increase over a simpler model.

Runbooks

AI system runbooks must cover failure modes unique to ML:

Model performance degradation. Steps to diagnose whether degradation is caused by data drift, code changes, or infrastructure issues. Include commands to check input data distributions, compare against baseline metrics, and determine whether retraining is needed.

Feature pipeline failures. Steps to identify which features are stale or missing, assess the impact on predictions, and either fix the pipeline or activate fallback features.

Model rollback. Step-by-step instructions to revert to a previous model version, including any feature pipeline changes required to support the older model.

Data quality incidents. Steps to identify corrupted or anomalous data, quarantine affected records, assess whether the production model was affected, and determine whether retraining is necessary.

Writing Principles for AI Documentation

Be precise about uncertainty. Say “the model achieves 92% accuracy on the test set” not “the model is highly accurate.” Quantify everything that can be quantified.

Write for the operator, not just the developer. The person debugging the system at 3am needs runbooks with specific commands, not architectural prose.

Update documentation as part of the PR process. If a model change requires a documentation update, the PR should not be merged without it. Stale documentation is worse than no documentation because it creates false confidence.