Methodology for T&E Framework Development

Methodology for Literature Review

1. Objective and Scope The objective of this literature review is to systematically identify, evaluate, and synthesize peer-reviewed and authoritative grey literature related to ambient artificial intelligence systems in healthcare. For the purposes of this review, ambient AI systems are defined as AI technologies that passively capture, process, and summarize clinical interactions or environmental data in real time or near-real time, typically with minimal explicit user input. The review focuses on ambient AI applications such as:

Ambient clinical documentation and scribing
Passive speech recognition and summarization during clinical encounters
Context-aware clinical assistance integrated into workflows

The primary goal is to surface evidence relevant to responsible AI evaluation, with particular emphasis on usefulness, usability, and efficacy, safety and reliability, and fairness and bias management in real-world clinical environments. Systems intended solely for general consumer voice assistants or non-clinical transcription were excluded unless they presented transferable evaluation insights relevant to clinical ambient AI.

2. Review Design A structured narrative literature review methodology was employed. This approach was selected due to the applied, workflow-embedded nature of ambient AI systems and the heterogeneity of evaluation methods used in this domain. Unlike traditional clinical decision support tools, ambient AI systems are often evaluated using simulation studies, workflow analyses, clinician satisfaction studies, error audits, and deployment pilots, rather than randomized controlled trials. A narrative synthesis approach enables integration of these diverse evidence types while maintaining a consistent responsible AI evaluation lens. The review was conducted iteratively, allowing refinement of scope as recurring evaluation challenges and deployment risks emerged.

3. Information Sources Literature was identified through searches across the following sources:

Biomedical and Health Informatics Databases
- PubMed / MEDLINE
- Embase
- Scopus
Interdisciplinary and Technical Databases
- Google Scholar
- IEEE Xplore
Grey Literature and Practice-Oriented Sources
- Health informatics and clinical AI journals (e.g., NEJM AI, JAMIA)
- Conference proceedings and preprints where peer-reviewed evidence was limited
- Health system pilot evaluations and white papers
- Responsible AI and health IT evaluation frameworks

Reference lists of high-impact studies were manually reviewed to identify additional relevant sources.

4. Search Strategy Search strategies combined controlled vocabulary terms and free-text keywords across three core concept domains:

Ambient and Passive AI Concepts
- “ambient AI,” “ambient clinical documentation,” “digital scribe,” “passive speech recognition,” “clinical summarization”
Clinical Workflow and User Impact
- “clinical workflow,” “physician or clinician documentation burden,” “burnout,” “usability,” “human factors”
Evaluation and Risk Concepts
- “accuracy,” “error rates,” “safety,” “bias,” “fairness,” “reliability,” “workflow disruption”

Search strings were adapted for each database and limited to English-language publications.

5. Inclusion and Exclusion Criteria Inclusion Criteria

Direct relevance to ambient AI systems used in clinical or care-delivery contexts
Empirical evaluations, simulation studies, pilot deployments, or structured audits
Assessment of documentation quality, workflow impact, safety, or bias
Peer-reviewed publications or authoritative grey literature
Articles (cia arxiv.org or similar) which are preprints/postprints in scientific fields, acting as a free, open-access repository for immediate sharing of research papers, often before formal peer-reviewed journal publication

Exclusion Criteria

Consumer voice assistants without healthcare relevance
Opinion pieces without analytic or empirical grounding
Studies focused solely on speech recognition benchmarks without clinical context

6. Screening and Selection Process Titles and abstracts were screened for relevance to passive, workflow-embedded AI systems rather than general NLP or transcription tools. Full-text review was conducted for sources meeting initial inclusion criteria. Priority was given to studies that:

Evaluated real or simulated clinical encounters
Compared ambient AI outputs against clinician-generated documentation
Identified error patterns, omissions, or hallucinations
Examined clinician trust, usability, or workflow impact

Studies were retained even when clinical outcome data was limited, provided they offered meaningful insight into deployment risks or evaluation challenges.

7. Data Extraction and Synthesis For each included source, the following information was extracted:

Description of the ambient AI system and deployment context
Clinical setting and user population
Evaluation methodology and outcome measures
Reported benefits, limitations, and failure modes
Identified safety, bias, or reliability concerns
Implications for clinician oversight and patient safety

Findings were synthesized thematically and mapped to responsible AI evaluation dimensions relevant to ambient AI systems.

8. Quality and Relevance Assessment Studies were assessed using a pragmatic relevance-focused framework. Evaluation criteria included:

Realism of clinical workflow representation
Transparency of evaluation methodology
Appropriateness of outcome measures for passive AI systems
Explicit acknowledgment of limitations and risks

Given the workflow-centric nature of ambient AI, clinical realism and usability relevance were weighted alongside technical performance.

9. Limitations The literature on ambient AI is evolving rapidly, and many evaluations are based on short-term pilots, simulated encounters, or single-site deployments. Proprietary evaluations and internal health system audits are often inaccessible. Additionally, standardized benchmarks for ambient AI safety and effectiveness are still emerging, limiting cross-study comparability.

10. Output and Use The findings of this literature review are intended to inform:

Selection of responsible AI methods and metrics for ambient AI systems
Identification of workflow, safety, and bias risks
Pre-deployment evaluation and post-deployment monitoring strategies
Cross-use-case comparison with other healthcare AI deployments

Methodology for CHAI Member Submissions

In addition to the literature review conducted to gather published methods/metrics, the CHAI Program Management team queried members of the Ambient AI Work Group asking for additional methods/metrics. The methodology of this approach is included below. Work Group members:

Reviewed the use case charter and a standardized PowerPoint (PPT) template (developed by CHAI Program Management) to understand the scope of the Ambient AI work group.
Identified methods and metrics that can be used by Developers and/or Implementers to objectively evaluate AI solutions within this work group.
Populated the PPT template with:
- Methods and metrics currently used within the member organization, and/or
- Relevant methods and metrics identified through published literature, industry guidance, or other credible sources.
Followed the instructions provided within the PPT template for documenting each method and metric, including any supporting details, definitions, benchmarks, or references.
Submitted the completed PPT template for consolidation, generalization, and anonymization into the work group’s T&E Framework.