CHAI Agentic AI — Unified T&E Framework

Interactive view of the unified agentic AI testing and evaluation framework.

CHAI Agentic AI — Unified T&E Framework
CHAI

Agentic AI — Unified T&E Framework

Metrics mapped across 6 action types × 10 risk domains. The first 3 columns (Tool Calling, Computer Use, Code Generation) represent agentic action mechanisms, while the second 3 (Data Interaction, Context & Hand-off, Task Execution) capture broader agent workflow capabilities. The framework combines operational, technical, safety, and governance-oriented risk domains, with metrics re-mapped where needed using domain-specific rationales. The cross-cutting metrics section complements this structure by grouping evaluations that apply broadly across agentic systems rather than a single action type. T&E coverage gaps are marked explicitly throughout the framework. Click any card to open its full definition, rationale, and benchmark in a detail panel. Gap cards identify where evaluation coverage is currently missing. Each risk domain is tagged as Structural, Operating, or Hybrid to distinguish pre-deployment governance controls from runtime monitoring and oversight needs.

Filter by principle
Tool Calling
Computer Use
Code Gen & Exec
Data Interaction
Context & Hand-off
Task Execution
All types
Coverage gap
Structural Before agents run
Operating While agents run
Hybrid Before + during use
Risk domain ↓   Action type →

Tool Calling

Pre-defined API/MCP calls. Bounded & auditable.

Computer Use

Screen nav, clicks, typing. Broad action space.

Code Gen & Exec

Writes & runs novel code. Hardest to audit.

Data Interaction

How agents access & interpret healthcare data.

Context & Hand-off

How information is structured for humans & other agents.

Task Execution

How agents perform real workflows & deliverables.

Data Privacy Structural

Access scope, exposure, least-privilege

Liability & Irreversibility Structural

Accountability, irreversible actions, oversight

Prompt Injection Operating

Adversarial robustness, hijacked behavior

Malicious Use Operating

Harmful outputs, weaponizable capabilities

Third-Party & Supply Chain Structural

External tools, APIs, data source trust

Agent-to-Agent Operating

Multi-agent interactions, coordination, error propagation

Life & Patient Safety Hybrid

Proximity to care, harm severity, vulnerable populations, override availability

Technology & Data Structural

CIA triad, PHI exposure, data quality, security, lifecycle, AI traceability

Autonomy & Delegated Authority Hybrid

Action scope, escalation triggers, delegation depth, oversight boundaries, behavioral drift

Financial Harm & Market Integrity Operating

Payment errors, unauthorized transactions, discriminatory financial outcomes, market stability

Cross-cutting metrics — verified to apply across all 6 action types

Only metrics whose definition in the CHAI T&E framework is mechanism-agnostic are listed here. Metrics specific to multi-agent interactions, a single interface type, or a clinical sub-domain are placed in their respective cells above.