Evaluation

25 articles Use search to find specific topics

Recent articles Showing 24 of 25

LLM-as-a-Judge Using a language model as an automated evaluator of another model's outputs: methodology, calibration with …

Testing RAG Systems How to test Retrieval-Augmented Generation systems: unit testing chunking, integration testing retrieval …

Testing LLM Applications LLM-specific testing strategies: prompt template testing, structured output validation, guardrail …

Testing and Evaluating AI Agent Performance Frameworks for evaluating AI agents that plan, use tools, and take actions, covering correctness, reliability, …

ai-agents evaluation

Red Teaming and Adversarial Testing for AI Systems How to plan and execute red team exercises that systematically probe AI systems for vulnerabilities, biases, …

red-teaming adversarial-testing

Red Teaming What red teaming is in AI, how adversarial testing discovers vulnerabilities and failure modes before …

red-teaming adversarial-testing

RAG Evaluation Methods and metrics for measuring the quality of Retrieval Augmented Generation systems, covering retrieval …

LLMOps Pipeline Production pipeline design for LLM-specific operations: prompt management, evaluation, deployment, monitoring, …

llmops pipeline

LLM Evaluation Methods - Measuring Language Model Quality A comprehensive guide to evaluating large language models, covering automated metrics (BLEU, ROUGE, …

Ground Truth What ground truth is in machine learning, how verified correct labels are obtained, and why ground truth …

ground-truth labels

Golden Dataset What a golden dataset is, how it serves as a curated evaluation benchmark for measuring AI model quality, and …

golden-dataset testing

Full-Stack Observability for AI Systems How to implement comprehensive observability for AI applications covering traces, evaluations, metrics, and …

observability monitoring

Framework for Evaluating and Selecting AI Vendors A structured approach to evaluating AI vendors covering technical capabilities, data handling, compliance, …

vendor-selection evaluation

Evaluator-Optimizer Pattern Automated evaluation loops where one model generates output and another evaluates it, driving iterative …

evaluation optimization

Evaluating RAG System Quality How to measure and improve both retrieval quality and generation quality in RAG systems, with practical …

Embedding Model Comparison and Selection Guide How to choose embedding models for semantic search, RAG, and similarity tasks, comparing popular models across …

embeddings models

DeepEval vs Promptfoo for LLM Evaluation in CI Comparing DeepEval and Promptfoo for automated LLM evaluation: metrics, CI integration, configuration, …

deepeval promptfoo

Architecture Decision Records and Evaluation Methods Using ADRs and architecture evaluation methods like ATAM to document and assess architecture decisions in …

architecture ADR

AI Spark: Smart Vendor Evaluation Scoring Use AI to standardize vendor evaluation by scoring proposals against weighted criteria and generating …

procurement vendor-management

AI Red Team A dedicated adversarial testing team that probes AI systems for vulnerabilities, biases, safety failures, and …

red-team ai-security

A/B Testing for AI Systems How to design and run A/B tests for AI models and features, covering experiment design, traffic splitting, …

a-b-testing experimentation

Testing AI Systems - Unit Tests to Production Monitoring A practical testing strategy for AI systems: property-based testing, integration testing with mocked models, …

software-engineering intermediate

The Use Case Scoring Framework - From 57 Ideas to 3 Prototypes A structured WSJF-inspired scoring methodology to cut through workshop noise and identify the AI use cases …

project-management intermediate

Model Cards - AI Transparency Documentation What model cards document, why they matter for AI governance, and how to create one.

25 articles in this section. Search for a specific topic.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session