CHAI Agentic AI — Unified T&E Framework
Interactive view of the unified agentic AI testing and evaluation framework.
Agentic AI — Unified T&E Framework
Metrics mapped across 6 action types × 10 risk domains. The first 3 columns (Tool Calling, Computer Use, Code Generation) represent agentic action mechanisms, while the second 3 (Data Interaction, Context & Hand-off, Task Execution) capture broader agent workflow capabilities. The framework combines operational, technical, safety, and governance-oriented risk domains, with metrics re-mapped where needed using domain-specific rationales. The cross-cutting metrics section complements this structure by grouping evaluations that apply broadly across agentic systems rather than a single action type. T&E coverage gaps are marked explicitly throughout the framework. Click any card to open its full definition, rationale, and benchmark in a detail panel. Gap cards identify where evaluation coverage is currently missing. Each risk domain is tagged as Structural, Operating, or Hybrid to distinguish pre-deployment governance controls from runtime monitoring and oversight needs.
Tool Calling
Pre-defined API/MCP calls. Bounded & auditable.
Computer Use
Screen nav, clicks, typing. Broad action space.
Code Gen & Exec
Writes & runs novel code. Hardest to audit.
Data Interaction
How agents access & interpret healthcare data.
Context & Hand-off
How information is structured for humans & other agents.
Task Execution
How agents perform real workflows & deliverables.
Data Privacy Structural
Access scope, exposure, least-privilege
Liability & Irreversibility Structural
Accountability, irreversible actions, oversight
Prompt Injection Operating
Adversarial robustness, hijacked behavior
Malicious Use Operating
Harmful outputs, weaponizable capabilities
Third-Party & Supply Chain Structural
External tools, APIs, data source trust
Agent-to-Agent Operating
Multi-agent interactions, coordination, error propagation
Life & Patient Safety Hybrid
Proximity to care, harm severity, vulnerable populations, override availability
Technology & Data Structural
CIA triad, PHI exposure, data quality, security, lifecycle, AI traceability
Autonomy & Delegated Authority Hybrid
Action scope, escalation triggers, delegation depth, oversight boundaries, behavioral drift
Financial Harm & Market Integrity Operating
Payment errors, unauthorized transactions, discriminatory financial outcomes, market stability
Cross-cutting metrics — verified to apply across all 6 action types
Only metrics whose definition in the CHAI T&E framework is mechanism-agnostic are listed here. Metrics specific to multi-agent interactions, a single interface type, or a clinical sub-domain are placed in their respective cells above.