CHAI Agentic AI — Unified T&E Framework

Interactive view of the unified agentic AI testing and evaluation framework.

CHAI Agentic AI — Unified T&E Framework

Metrics mapped across 6 action types × 10 risk domains. The first 3 columns (Tool Calling, Computer Use, Code Generation) represent agentic action mechanisms, while the second 3 (Data Interaction, Context & Hand-off, Task Execution) capture broader agent workflow capabilities. The framework combines operational, technical, safety, and governance-oriented risk domains, with metrics re-mapped where needed using domain-specific rationales. The cross-cutting metrics section complements this structure by grouping evaluations that apply broadly across agentic systems rather than a single action type. T&E coverage gaps are marked explicitly throughout the framework. Click any card to open its full definition, rationale, and benchmark in a detail panel. Gap cards identify where evaluation coverage is currently missing. Each risk domain is tagged as Structural, Operating, or Hybrid to distinguish pre-deployment governance controls from runtime monitoring and oversight needs.

Filter by principle

Tool Calling

Computer Use

Code Gen & Exec

Data Interaction

Context & Hand-off

Task Execution

All types

Coverage gap

Structural Before agents run

Operating While agents run

Hybrid Before + during use

Risk domain ↓ Action type →

Tool Calling

Pre-defined API/MCP calls. Bounded & auditable.

Computer Use

Screen nav, clicks, typing. Broad action space.

Code Gen & Exec

Writes & runs novel code. Hardest to audit.

Data Interaction

How agents access & interpret healthcare data.

Context & Hand-off

How information is structured for humans & other agents.

Task Execution

How agents perform real workflows & deliverables.

Data Privacy Structural

Access scope, exposure, least-privilege

Liability & Irreversibility Structural

Accountability, irreversible actions, oversight

Prompt Injection Operating

Adversarial robustness, hijacked behavior

Malicious Use Operating

Harmful outputs, weaponizable capabilities

Third-Party & Supply Chain Structural

External tools, APIs, data source trust

Agent-to-Agent Operating

Multi-agent interactions, coordination, error propagation

Life & Patient Safety Hybrid

Proximity to care, harm severity, vulnerable populations, override availability

Technology & Data Structural

CIA triad, PHI exposure, data quality, security, lifecycle, AI traceability

Autonomy & Delegated Authority Hybrid

Action scope, escalation triggers, delegation depth, oversight boundaries, behavioral drift

Financial Harm & Market Integrity Operating

Payment errors, unauthorized transactions, discriminatory financial outcomes, market stability

Cross-cutting metrics — verified to apply across all 6 action types

Only metrics whose definition in the CHAI T&E framework is mechanism-agnostic are listed here. Metrics specific to multi-agent interactions, a single interface type, or a clinical sub-domain are placed in their respective cells above.