Methodology for T&E Framework Development

Methodology for Literature Review

1. Objective and Scope The objective of this literature review is to systematically identify, evaluate, and synthesize peer-reviewed and authoritative grey literature related to the use of artificial intelligence systems in mental health care and mental health–adjacent applications. The review focuses on AI systems used for screening, risk detection, symptom monitoring, decision support, therapeutic assistance, and administrative or documentation support in mental health and wellness application contexts. The primary aim is to surface evidence that informs responsible AI evaluation, with emphasis on safety and reliability, fairness and bias management, usefulness and efficacy, usability, and risks unique to mental health and wellness populations, such as vulnerability, misclassification, over-reliance, and inappropriate automation.

2. Review Design A structured narrative literature review methodology was employed. This approach was selected due to the heterogeneity of mental health and wellness AI applications, which span clinical decision support, conversational agents, risk prediction tools, and population-level screening systems. Given variability in study designs, outcome measures, and deployment contexts, a narrative synthesis approach enabled integration of empirical findings, implementation reports, and ethical analyses while maintaining a consistent evaluation lens. The review was conducted iteratively, allowing refinement of scope as key risk themes and evaluation gaps emerged.

3. Information Sources Literature was identified through the following sources:

Biomedical and Mental Health Databases
- PubMed / MEDLINE
- PsycINFO
- Embase
Interdisciplinary and Technical Databases
- Google Scholar
- IEEE Xplore
Grey Literature and Policy Sources * Professional association reports and guidelines * Health AI evaluation frameworks and white papers * Regulatory discussions and consensus statements * Conference proceedings and preprints where peer-reviewed evidence was limited

Backward citation tracking was used to identify foundational and highly cited studies.

4. Search Strategy Search strategies combined controlled vocabulary terms and free-text keywords across three core concept areas:

Mental Health Concepts
- “mental health,” “psychiatry,” “depression,” “anxiety,” “suicide risk,” “psychological assessment,” “behavioral health”
Wellness Concepts
- “wellness,” “wellness applications,” “distress,” “anxiety”
Artificial Intelligence Concepts
- “artificial intelligence,” “machine learning,” “natural language processing,” “large language models,” “chatbots,” “clinical decision support”
Evaluation and Risk Concepts
- “safety,” “bias,” “fairness,” “reliability,” “validation,” “harm,” “misclassification,” “human oversight”

Search strings were adapted per database and limited to English-language publications.

5. Inclusion and Exclusion Criteria

Inclusion Criteria

Direct relevance to AI systems used in mental health, wellness or behavioral health contexts
Empirical evaluation, validation studies, or real-world deployment analyses
Explicit discussion of performance, limitations, or risks
Relevance to responsible AI dimensions such as safety, fairness, or clinical usefulness
Peer-reviewed publications or authoritative grey literature
Articles (cia arxiv.org or similar) which are preprints/postprints in scientific fields, acting as a free, open-access repository for immediate sharing of research papers, often before formal peer-reviewed journal publication

Exclusion Criteria

Opinion pieces lacking empirical or analytic grounding
Tools without algorithmic inference or decision-support components

6. Screening and Selection Process Titles and abstracts were screened for relevance to mental health and wellness AI use cases. Full-text review was conducted for sources meeting initial inclusion criteria. Given the sensitive nature of mental health and wellness applications, priority was given to studies that:

Evaluated clinical or behavioral impact rather than technical accuracy alone
Examined error patterns, false positives, or false negatives
Addressed vulnerable populations or high-risk scenarios
Discussed safeguards, escalation pathways, or human-in-the-loop mechanisms

Ambiguous sources were retained when they contributed insight into ethical, safety, or governance considerations.

7. Data Extraction and Synthesis

For each included study, the following information was extracted:

Intended use and deployment context of the AI system
Target population and mental health and/or wellness condition or task
Evaluation design and outcome measures
Reported benefits, risks, and limitations
Identified concerns related to bias, safety, or misuse
Implications for clinical oversight and patient trust

Findings were synthesized thematically and mapped to responsible AI evaluation domains relevant to mental health and/or wellness use cases.

8. Quality and Relevance Assessment

Studies were assessed using a pragmatic relevance-based framework rather than a single formal bias checklist. Evaluation criteria included:

Transparency of methodology and data sources
Appropriateness of evaluation metrics for mental health and/or wellness outcomes
Realism of deployment setting and user population
Explicit acknowledgment of risks and failure modes

9. Limitations The review may underrepresent proprietary systems and unpublished internal evaluations. Additionally, rapid deployment of conversational and generative AI tools in mental health and wellness contexts has outpaced peer-reviewed validation, limiting available long-term outcome data.

10. Output and Use The results of this literature review are intended to inform:

Selection of responsible AI evaluation metrics for mental health and/or wellness use cases
Identification of high-risk deployment scenarios
Comparison across healthcare AI use cases
Governance, oversight, and safeguard recommendations

This methodology supports transparency, reproducibility, and alignment with responsible AI principles in mental health and/or wellness contexts.

Methodology for CHAI Member Submissions

In addition to the literature review conducted to gather published methods/metrics, the CHAI Program Management team queried members of the Mental Health Chatbot Work Group asking for additional methods/metrics. The methodology of this approach is included below. Work Group members:

Reviewed the use case charter and a standardized PowerPoint (PPT) template (developed by CHAI Program Management) to understand the scope of the Mental Health Chatbot work group and the genAI-enabled wellness application use case.
Identified methods and metrics that can be used by Developers and/or Implementers to objectively evaluate AI solutions within this work group.
Populated the PPT template with:
- Methods and metrics currently used within the member organization, and/or
- Relevant methods and metrics identified through published literature, industry guidance, or other credible sources.
Followed the instructions provided within the PPT template for documenting each method and metric, including any supporting details, definitions, benchmarks, or references.
Submitted the completed PPT template for consolidation, generalization, and anonymization into the work group’s T&E Framework.