Evaluation

25 articles Use search to find specific topics
Showing 24 of 25
LLM-as-a-Judge Using a language model as an automated evaluator of another model's outputs: methodology, calibration with …Testing RAG Systems How to test Retrieval-Augmented Generation systems: unit testing chunking, integration testing retrieval …Testing LLM Applications LLM-specific testing strategies: prompt template testing, structured output validation, guardrail …Testing and Evaluating AI Agent Performance Frameworks for evaluating AI agents that plan, use tools, and take actions, covering correctness, reliability, …Red Teaming and Adversarial Testing for AI Systems How to plan and execute red team exercises that systematically probe AI systems for vulnerabilities, biases, …Red Teaming What red teaming is in AI, how adversarial testing discovers vulnerabilities and failure modes before …RAG Evaluation Methods and metrics for measuring the quality of Retrieval Augmented Generation systems, covering retrieval …LLMOps Pipeline Production pipeline design for LLM-specific operations: prompt management, evaluation, deployment, monitoring, …LLM Evaluation Methods - Measuring Language Model Quality A comprehensive guide to evaluating large language models, covering automated metrics (BLEU, ROUGE, …Ground Truth What ground truth is in machine learning, how verified correct labels are obtained, and why ground truth …Golden Dataset What a golden dataset is, how it serves as a curated evaluation benchmark for measuring AI model quality, and …Full-Stack Observability for AI Systems How to implement comprehensive observability for AI applications covering traces, evaluations, metrics, and …Framework for Evaluating and Selecting AI Vendors A structured approach to evaluating AI vendors covering technical capabilities, data handling, compliance, …Evaluator-Optimizer Pattern Automated evaluation loops where one model generates output and another evaluates it, driving iterative …Evaluating RAG System Quality How to measure and improve both retrieval quality and generation quality in RAG systems, with practical …Embedding Model Comparison and Selection Guide How to choose embedding models for semantic search, RAG, and similarity tasks, comparing popular models across …DeepEval vs Promptfoo for LLM Evaluation in CI Comparing DeepEval and Promptfoo for automated LLM evaluation: metrics, CI integration, configuration, …Architecture Decision Records and Evaluation Methods Using ADRs and architecture evaluation methods like ATAM to document and assess architecture decisions in …AI Spark: Smart Vendor Evaluation Scoring Use AI to standardize vendor evaluation by scoring proposals against weighted criteria and generating …AI Red Team A dedicated adversarial testing team that probes AI systems for vulnerabilities, biases, safety failures, and …A/B Testing for AI Systems How to design and run A/B tests for AI models and features, covering experiment design, traffic splitting, …Testing AI Systems - Unit Tests to Production Monitoring A practical testing strategy for AI systems: property-based testing, integration testing with mocked models, …The Use Case Scoring Framework - From 57 Ideas to 3 Prototypes A structured WSJF-inspired scoring methodology to cut through workshop noise and identify the AI use cases …Model Cards - AI Transparency Documentation What model cards document, why they matter for AI governance, and how to create one.

25 articles in this section. Search for a specific topic.