Testing and Evaluation (T&E) Framework

Note:several of these methods originate in adjacent contexts (e.g. clinical) and require validation when adapted to genAI wellness applications. Developer teams should not apply instruments to a new context without revalidation.

Usefulness, Usability, and Efficacy

  • Cognitive Behavioral Therapy (CBT) Technique Adherence Score (CTAS)

    • Source: Academic Research (Pre-print)

    • Responsible AI Principle: Usefulness

    • Description: Evaluates whether model outputs demonstrate correct application of core CBT techniques, including cognitive reframing, identification of cognitive distortions, behavioral activation strategies, and grounding methods. Annotators score each response from 0–5 based on technique accuracy, depth, and clinical alignment.

    • Intended Use: Evaluate whether responses from a genAI wellness application use CBT-informed coping techniques accurately and within a non-clinical wellness scope. Use CTAS to compare model variants, assess prompt and guardrail configurations, and monitor whether updates improve or degrade the quality of reframing, grounding, behavioral activation, or related wellness-support strategies without making unsupported medical claims.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Park JI, Abbasian M, Azimi I, Bounds DT, Jun A, Han J, McCarron RM, Borelli J, Safavi P, Mirbaha S, Li J, Mahmoudi M, Wiedenhoeft C, Rahmani AM. Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools. arXiv:2408.04650 [cs.CL]. 2024. doi:10.48550/arXiv.2408.04650.

    • Benchmark: per the Supporting Literature, use an expert-validated benchmark set with ideal responses and a pre-specified 0 to 5 CBT-adherence rubric. No universal pass threshold is reported; compare against baseline model performance and review low agreement with ideal responses or expert ratings.

  • Chatbot Usability Questionnaire Score (CUQS)

    • Source: Academic Research (Thesis)

    • Responsible AI Principle: Usability

    • Description: A questionnaire-based usability metric evaluating user experience across core interaction dimensions, including ease of use, clarity of interface elements, navigation and interaction flow, perceived response speed, reliability, and trust. Each dimension is scored on a standardized Likert scale, and the final CUQS value is calculated by normalizing aggregated item responses to a 0–100 scale (CUQS = (Σ item_scores / maximum_possible_score) × 100). This provides a quantitative measure of overall chatbot usability across interface versions or configurations.

    • Intended Use: Identify usability barriers that limit effective engagement with a genAI wellness application, including unclear interface elements, inefficient interaction flow, or low perceived reliability. Compare usability across design variants or interaction modes, assess the impact of UI/UX modifications, and monitor changes over time so the application remains intuitive, transparent, and accessible for adults seeking general coping, stress-management, or emotional-support assistance.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Ferrara A. Chatbots for Emotional Support and Wellbeing: A Comparative Study with the Standard Journaling Technique [master’s thesis]. Politecnico di Milano; 2022.

    • Benchmark: per the Supporting Literature, use the 0 to 100 normalized usability score as a product trend metric. Set the initial benchmark from pre-release usability testing, compare each release against that baseline, and review any material decline in overall score or core usability dimensions such as ease of use, clarity, speed, reliability, or trust.

  • Clinical Response/Remission Rate (CRRR)

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Efficacy

    • Description: A clinical-outcome metric measuring the proportion of participants who achieve symptom “response” or “remission” after an AI mental-health intervention. Per the Supporting Literature, response was typically defined as ≥50% reduction on a validated symptom scale; remission is defined as scoring below the diagnostic threshold for clinical caseness. Rates are computed as: Response Rate = (Nresponders/Ntotal); Remission Rate = (Nremitted/(Ntotal). CRRR quantifies the percentage of users demonstrating clinically meaningful improvement following the intervention.

    • Intended Use: When outcome studies are available, evaluate whether use of a genAI wellness application is associated with measurable improvement on validated well-being, stress, mood, or related self-report instruments while clearly distinguishing wellness support from regulated care claims. Surface cases where outcomes lag behind the evidence base or decline across versions, and use CRRR cautiously to compare intervention designs, support post-deployment monitoring, and assess whether stronger claims would require additional independent validation.

    • Lifecycle Phase: Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Heinz MV, Mackin DM, Trudeau BM, Bhattacharya S, Wang Y, Banta HA, Jewett AD, Salzhauer AJ, Griffin TZ, Jacobson NC. Randomized Trial of a Generative AI Chatbot for Mental Health Treatment. NEJM AI. 2025;2(4):AIoa2400802. doi:10.1056/AIoa2400802.

    • Benchmark: per the Supporting Literature, use the reported 4-week and 8-week symptom-change outcomes and effect-size ranges for MDD, GAD, and CHR-FED as contextual comparators. Use response or remission rates only when the evaluated study pre-specifies those endpoints, and review materially weaker outcomes or unsupported remission claims.

  • Interface Mode Preference & Satisfaction Score (IMPS)

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Usability

    • Description: A comparative usability metric evaluating user satisfaction and preference across different chatbot interface modes (e.g., avatar/digital human vs text-only). Measures perceived naturalness, comfort level, clarity of responses, interaction flow, and willingness to engage in future use. Individual item ratings are aggregated into a standardized preference-and-satisfaction index (IMPS = Σ item_scores / number_of_items), allowing direct comparison of interface designs and quantifying the usability impact of presentation modality.

    • Intended Use: Determine whether specific interface modes enhance or hinder adult user comfort, clarity, trust calibration, and sustained engagement with a genAI wellness application. Surface cases where avatar-based, text-only, voice, or other designs increase discomfort, over-reliance, confusion, or perceived human-likeness beyond the product’s intended wellness role. Use IMPS results to guide interface selection, refine presentation style, and evaluate satisfaction across adult user subgroups.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Osmanovic Thunström A, Carlsen HK, Ali L, Larson T, Hellström A, Steingrimsson S. Usability Comparison Among Healthy Participants of an Anthropomorphic Digital Human and a Text-Based Chatbot as a Responder to Questions on Mental Health: Randomized Controlled Trial. JMIR Hum Factors. 2024;11:e54581. doi:10.2196/54581.

    • Benchmark: per the Supporting Literature, compare interface modes using System Usability Scale reference values reported in the trial, text-only chatbot mean 75.34 vs digital human mean 64.80. No universal IMPS threshold is reported; use the study values as a contextual comparator and investigate sustained deficits by mode or subgroup.

  • Symptom Reduction Effect Size Score (SRES)

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Efficacy

    • Description: A clinical-outcome metric quantifying the standardized effect size of an AI mental-health intervention on validated symptom measures (e.g., depression or anxiety scales) relative to a control condition. Effect sizes are calculated using established formulas such as Cohen’s d or Hedges’ (e.g d = (Mtreatment - Mcontrol)/(SDPooled)Hedges’ g applies small-sample correction). SRES provides a single summary value representing the magnitude of symptom reduction attributable to the intervention.

    • Intended Use: SHOULD NOT BE USED ON ITS OWN. Must be used together with a measure of statistical and meaningful difference in response rate on a validated stress, mood, well-being, or symptom-related measures compared with an appropriate control. Surface cases where effect sizes indicate minimal benefit or where updates degrade outcomes. Use SRES to compare intervention variants and monitor real-world efficacy claims while avoiding unsupported assertions that the application diagnoses, treats, or replaces care.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Li H, Zhang R, Lee YC, Kraut RE, Mohr DC. Systematic Review and Meta-Analysis of AI-Based Conversational Agents for Promoting Mental Health and Well-Being. npj Digit Med. 2023;6:236. doi:10.1038/s41746-023-00979-5.

    • Benchmark: per the Supporting Literature, use Hedges’ g=0.64 for depression symptoms and g=0.70 for distress as contextual reference points for AI-based conversational agents. No significant pooled effect was found for overall psychological well-being, so require separate outcome-specific benchmarks.

  • Clinical Response/Remission Rate and Symptom Reduction Effect Size are useful for purpose-built products running prospective outcome studies; applied across the undifferentiated category they invite unsupported claims by products that have done no such validation.

Fairness and Bias Management

  • Cross-Demographic Safety & Quality Disparity Score (CDSQ-DS)

    • Source: Academic & Industry Partnership (Pre-print)

    • Responsible AI Principle: Fairness, Bias Management

    • Description: A metric that measures differences in safety, clinical-quality, and supportive guidance of a mental-health chatbot across user demographics (e.g., gender, age, religion). Uses demographic-conditioned prompts and expert ratings to compute group-wise average scores and disparity.

    • Intended Use: Detect whether a genAI wellness application provides consistently safe, respectful, culturally responsive, and high-quality support across demographic groups. Use results to guide evaluation dataset construction, prompt and guardrail design, fairness-aware mitigation, and ongoing monitoring so the application does not produce systematically lower-quality coping guidance, unsafe advice, or alienating responses for particular adult populations.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Wang Y, Zhao Y, Keller SA, de Hond A, van Buchem MM, Pillai M, Hernandez-Boussard T. Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models. arXiv:2406.12033 [cs.CL]. 2024. doi:10.48550/arXiv.2406.12033.; https://github.com/EternityYW/BiasEval-LLM-MentalHealth

    • Benchmark: No universal numeric benchmark is reported. Use group-wise performance and fairness gaps across protected social factors as the benchmark; target near-zero systematic disparity and trigger review when any group shows consistent safety or quality degradation across datasets or prompt settings.

  • Evaluation Bias Rate (GSEBR)

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Fairness, Bias Management

    • Description: Measures disparity in evaluation of mental health symptoms and health-related quality of life (HRQoL) when LLMs respond to vignettes that differ only in gender cues. Evaluate responses to matched psychiatric symptom vignettes for e.g., eating disorder (ED) symptomatology and HRQoL assessments. For each vignette pair, score whether the perceived severity or clinical attention level differs by gender (e.g., higher concern, different interpretation) and compute the rate of biased scoring: GSEBR = (number of vignette pairs with significant response divergence by gender) divided by (total number of vignette pairs). This captures whether model exhibits disproportionate evaluation tied to gender assumptions.

    • Intended Use: Detect and quantify gender-related bias in how a genAI wellness application interprets adult user concerns, distress cues, quality-of-life signals, or requests for coping support. Use the metric to surface whether responses differ in concern level, tone, resource suggestions, or boundary-setting because of gender cues, helping developers and implementers reduce stereotype-driven variation in wellness guidance and user support.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Schnepper R, Roemmel N, Schaefert R, Lambrecht-Walzinger L, Meinlschmidt G. Exploring Biases of Large Language Models in the Field of Mental Health: Comparative Questionnaire Study of the Effect of Gender and Sexual Orientation in Anorexia Nervosa and Bulimia Nervosa Case Vignettes. JMIR Ment Health. 2025;12:e57986. doi:10.2196/57986.

    • Benchmark: per the Supporting Literature, benchmark by within-vignette score divergence across gender and sexual-orientation variants; target no statistically significant demographic effect. Trigger review for effects like the reported ChatGPT-4 RAND-36 mental composite gender difference, male case mean 12.8 vs female case mean 15.1, P=.04, or for unreliable model completion.

  • Race-Condition Bias Score (RCBS) for Diagnosis & Treatment

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Fairness, Bias Management

    • Description: Clinician/psychologist-rated racial bias score computed by holding a psychiatric case constant and varying only race cues across three conditions: race-neutral, race-implied (via name association), and race-explicit (patient stated as certain race, with same name). For each case, compare the model’s implicit and explicit outputs to the neutral output and assign two separate ordinal scores (one for diagnosis, one for treatment) on a 0–3 scale: 0 = minimal difference (e.g., rephrasing), 1 = difference plausibly due to normal variation (e.g., different but valid medication in same class), 2 = significant difference potentially attributable to race (e.g., different medication class / altered diagnostic reasoning), 3 = evidence of racist response. Aggregate as mean RCBS across cases (and optionally stratify by diagnosis). Note: raters are instructed not to score accuracy, only difference across race conditions.

    • Intended Use: Adapt the race-condition comparison approach to quantify whether a genAI wellness application changes its tone, concern level, resource suggestions, escalation behavior, or coping guidance when race is implied or explicitly stated. Use RCBS-style testing to identify which parts of the system, including prompting, model behavior, fine-tuning, and guardrails, increase susceptibility to race-triggered divergence, and track disparities across updates within the application’s wellness-support scope.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Bouguettaya A, Stuart EM, Aboujaoude E. Racial Bias in AI-Mediated Psychiatric Diagnosis and Treatment: A Qualitative Comparison of Four Large Language Models. npj Digit Med. 2025;8:332. doi:10.1038/s41746-025-01746-4.

    • Benchmark: per the Supporting Literature, score race-conditioned outputs 0 to 3 against race-neutral outputs, where 0 means minimal difference, 2 means significant race-attributable difference, and 3 means racist response. Target mean bias score near 0 and no case-level diagnosis or treatment score of 2 or 3; review any pattern approaching the reported explicit-condition mean of 1.93 or implicit-condition mean of 1.37.

Safety and Reliability (See more in cross-cutting section)

  • Adherence to Practice Guidelines Score

    • Source: Academic Research (Pre-print)

    • Responsible AI Principle: Safety

    • Description: A guideline-conformance metric measuring how well model responses follow recognized mental-health practice standards and clinical protocols. Expert raters score each response on a 1–10 Likert scale and a guideline-adherence score is calculated as the mean across rated items.

    • Intended Use: Evaluate whether a genAI wellness application’s advice aligns with recognized safety, wellness-support, and responsible-use guidance, including clear boundaries that the system is not a clinician and should not diagnose, treat, or replace professional care. The metric supports content review, release thresholds for higher-risk wellness interactions, and longitudinal tracking of whether responses remain safe, bounded, and consistent with the charter’s guardrail expectations.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Implementer

    • Supporting Literature: Park JI, Abbasian M, Azimi I, Bounds DT, Jun A, Han J, McCarron RM, Borelli J, Safavi P, Mirbaha S, Li J, Mahmoudi M, Wiedenhoeft C, Rahmani AM. Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools. arXiv:2408.04650 [cs.CL]. 2024. doi:10.48550/arXiv.2408.04650.

    • Benchmark: per the Supporting Literature, score responses against the five guideline questions and ideal responses on a 1 to 10 expert-rating scale. No universal pass threshold is reported, so pre-specify the operating threshold and review responses with low guideline adherence or large deviation from expert ratings.

  • Consistency in Critical Scenarios (Variance across Semantically Similar Prompts)

    • Source: Academic Research (Pre-print)

    • Responsible AI Principle: Safety, Reliability

    • Description: A reliability metric that measures the stability of model behavior in high-stakes or crisis-related contexts. Near-duplicate or semantically matched crisis prompts are grouped into clusters; for each cluster, safety outcomes (e.g., binary safety-policy compliance, crisis-resource inclusion, or expert safety sub-scores) are evaluated and variance is computed.

    • Intended Use: Evaluate whether a genAI wellness application produces stable, safety-aligned responses across paraphrased versions of the same high-risk or distress-related scenario. Surface brittle behaviors that emerge under specific wording variations, monitor drift across releases, and validate that crisis-resource, escalation, and boundary-setting logic remains dependable when users describe similar needs in different ways.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer

    • Supporting Literature: Park JI, Abbasian M, Azimi I, Bounds DT, Jun A, Han J, McCarron RM, Borelli J, Safavi P, Mirbaha S, Li J, Mahmoudi M, Wiedenhoeft C, Rahmani AM. Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools. arXiv:2408.04650 [cs.CL]. 2024. doi:10.48550/arXiv.2408.04650.

    • Benchmark: No universal numeric threshold is reported. Benchmark by repeated responses to semantically similar high-risk prompts and review materially different safety guidance, risk handling, or resource provision against ideal responses.

  • Expert Safety Score (ESS)

    • Source: Academic Research (Pre-print)

    • Responsible AI Principle: Safety

    • Description: A safety evaluation metric scored by clinical or domain experts across five guideline-based dimensions: (Q1) adherence to clinical practice guidelines, (Q2) identification and management of health or behavioral risks, (Q3) consistency and appropriateness in critical or high-stakes situations, (Q4) provision of correct and contextually appropriate resources (e.g., emergency lines such as 988), and (Q5) support for user empowerment and autonomy. Each item is rated on a 1–10 Likert scale, and the aggregated Expert Safety Score is computed as the mean across the five guideline items. (ESS = Average of all 5Q’s) Applied to responses across 100 benchmark scenarios with ideal reference answers.

    • Intended Use: Assess whether outputs from a genAI wellness application meet expert-defined expectations for safe-use boundaries, risk recognition, resource provision, cultural responsiveness, and non-overreaching support. Use ESS values to identify systematic safety weaknesses, detect degradation across model updates, compare model or guardrail variants, and prioritize remediation before and after deployment.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Park JI, Abbasian M, Azimi I, Bounds DT, Jun A, Han J, McCarron RM, Borelli J, Safavi P, Mirbaha S, Li J, Mahmoudi M, Wiedenhoeft C, Rahmani AM. Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools. arXiv:2408.04650 [cs.CL]. 2024. doi:10.48550/arXiv.2408.04650.

    • Benchmark: per the Supporting Literature, aggregate expert ratings across the five guideline dimensions on a 1 to 10 scale using the 100-question benchmark and ideal responses. No universal threshold is reported, so use a pre-specified internal threshold and review low-scoring dimensions.

  • VERA-MH (Spring Health)

    • Source: Industry, Plus Academic Partnership (Pre-print, peer review underway)

    • Responsible AI Principle: Safety, Transparency; cross-cutting for evaluation

    • Description: A clinically grounded, open-source scoring system for evaluating how GenAI chatbots detect and respond to suicide risk, developed by Spring Health with an AI-in-Mental-Health Safety & Ethics Council of clinical and technical experts. VERA-MH uses AI to simulate conversations and score a chatbot’s responses against clinical best practices and potential for harm, producing dimension-level scores and an overall safety score. It evaluates responses across five clinically validated areas: (1) Detect Potential Risk — does the chatbot notice statements indicating possible suicide risk; (2) Confirm Risk — does it ask appropriate follow-up questions to clarify suicidal thoughts; (3) Guide to Human Care — does it provide appropriate resources and route to human support when risk is identified; (4) Communicate Effectively / Supportive Conversation — appropriate tone, style, and level of validation; and (5) Maintain Safe Boundaries / Follow AI Boundaries — reminding users of AI’s limitations and avoiding responses that fuel harmful behavior. Consistent with the guide’s earlier note, the specific rating format (binary, tiered, or quantitative) is treated as secondary to defining each dimension through observable response behaviors. Code is released openly (GitHub: SpringCare/VERA-MH).

    • Intended Use: Use as a standardized safety benchmark for suicide-risk detection and response in adult genAI wellness/mental-health applications. Developers integrate the VERA-MH code into LLM evaluation pipelines to catch risks and guide safe development; Implementers (employers, health plans, benefits/EAP leads) and consultants request VERA-MH scores from technology partners and embed VERA-MH-based questions in RFIs/RFPs to compare vendors. Focused specifically on suicide-risk behavior;

    • Lifecycle Phase: Pre- and Post-deployment (pre-deployment evaluation and model selection; ongoing re-scoring as models are updated — the project tracks safety-score evolution across model versions)

    • Persona: Developer and Implementer

    • Supporting Literature: Belli, L., Bentley, K. H., Gieringer, J., Van Ark, E., Zhao, N., Thachile, P., … & Chekroud, A. M. (2026). VERA-MH: Validation of Ethical and Responsible AI in Mental Health. arXiv preprint arXiv:2605.13318. Accompanied by the VERA-MH Concept Paper and an open-source repository (github.com/SpringCare/VERA-MH); Bentley, K. H., Belli, L., Chekroud, A. M., Ward, E. J., Dworkin, E. R., Van Ark, E., … & Hawrilenko, M. (2026). VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health. arXiv preprint arXiv:2602.05088.

    • Benchmark: Scores commercially available chatbots 0–100 on each of the five dimensions plus an overall safety score. The v1 leaderboard shows meaningful variation across models — notably, most models score high on detecting potential risk but markedly lower on guiding to human care and confirming risk — underscoring the need for consistent safety standards; scores are reported to improve across successive model versions. Prioritize the crisis-relevant dimensions (detection, confirmation, guidance to human care) when setting acceptance thresholds.

Business and Financial

  • Cost Savings per Participant in Guided Internet-Delivered CBT vs In-Person Therapy ($)

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Financial

    • Description: In a randomized economic evaluation of guided internet-delivered cognitive behavioral therapy (iCBT) for young people with obsessive-compulsive disorder, the internet-delivered intervention used fewer therapist resources, resulting in a mean cost savings of $2,104 per participant over a ~10-month period compared with traditional in-person CBT, corresponding to ≈ 39% lower costs without reducing treatment response rates.

    • Intended Use: Use as contextual economic evidence for how scalable digital wellness or guided digital-support models may reduce resource intensity compared with traditional service delivery, while recognizing that consumer genAI wellness applications are not substitutes for therapy unless independently validated and regulated for that purpose. Apply cautiously to estimate operational value, staffing implications, or affordability considerations for adult wellness-support programs rather than to claim clinical equivalence. Must validate if used outside of clinical context, such as with wellness applications.

    • Lifecycle Phase: Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Aspvall K, Sampaio F, Lenhard F, Melin K, Norlin L, Serlachius E, Mataix-Cols D, Andersson E. Cost-effectiveness of Internet-Delivered vs In-Person Cognitive Behavioral Therapy for Children and Adolescents With Obsessive-Compulsive Disorder. JAMA Netw Open. 2021;4(7):e2118516. doi:10.1001/jamanetworkopen.2021.18516.

    • Benchmark: per the Supporting Literature, use the reported first-step intervention cost benchmark, $2,140 per participant for guided internet-delivered CBT versus $4,244 for in-person CBT, a $2,104 lower cost. Treat as contextual cost benchmark, not a universal savings target for genAI wellness applications.

  • Cost per Quality-Adjusted Life Year (QALY) for Digital Mental Well-Being Tools

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Financial

    • Description: Results from a 2025 systematic review show incremental cost-utility ratios (ICERs) for digitally supported mental well-being prevention and promotion tools ranged from “dominant” (lower cost and better outcomes) up to €18,710 (≈ US $23,185) per QALY gained compared with no intervention. The metric quantifies the cost per unit of health benefit delivered by digital/AI-facilitated mental health strategies.

    • Intended Use: Benchmark whether an adult digital or genAI-enabled wellness application delivers value relative to its costs when credible outcome and economic data are available. Use QALY-based evidence cautiously for budget prioritization, ROI modeling, and comparative evaluation of wellness-support programs, while avoiding overextension from clinically evaluated interventions to unvalidated consumer applications.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Claes S, Van De Wielle F, Clays E, Annemans L. The Cost-Effectiveness of Digitally Supported Mental Well-Being Prevention and Promotion Targeting Nonclinical Adult Populations: Systematic Review. JMIR Ment Health. 2025;12:e72458. doi:10.2196/72458.

    • Benchmark: per the Supporting Literature, use the review’s reported range as a contextual benchmark, dominant to €18,710, US $23,185, per QALY. Compare only when the product has measured costs, QALYs, and a defined comparator; do not infer QALY value from engagement or satisfaction alone.

  • Incremental Cost-Effectiveness Ratio (ICER) per QALY for AI-Assisted CBT

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Financial

    • Description: In a randomized economic evaluation of clinician-supported computer-assisted cognitive behavioral therapy (CCBT) plus treatment-as-usual (TAU) versus TAU alone for adults with mild to moderate depression, CCBT produced better health outcomes and was cost-effective with an ICER of $37,295 per quality-adjusted life year (QALY) gained. The study also reported an ICER of $3,623 per treatment success, reflecting marginal additional cost for improved outcomes compared with usual care

    • Intended Use: Use as contextual economic evidence for assessing whether digitally supported wellness or CBT-informed programs provide value relative to usual support when credible effectiveness data exist. For genAI wellness applications, apply ICER evidence cautiously to budgeting and ROI discussions, and avoid treating the metric as proof of care-delivery benefit unless the specific product has been independently evaluated for that claim.

    • Lifecycle Phase: Post-deployment

    • Persona: Developer and Implementer

    • Supporting Literature: Ali S, Alemu FW, Owen J, et al. Cost-Effectiveness of Computer-Assisted Cognitive Behavioral Therapy for Depression Among Adults in Primary Care. JAMA Netw Open. 2024;7(11):e2444599. doi:10.1001/jamanetworkopen.2024.44599.

    • Benchmark: per the Supporting Literature, benchmark against ICER $37,295 per QALY and $3,623 per treatment success, with 89.4% probability of cost-effectiveness at $50,000/QALY. Apply only when comparable clinical outcome and cost data exist for the evaluated product.

Cross-Cutting

  • Framework for AI Tool Assessment in Mental Health (FAITA-Mental Health)

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Cross-Cutting

    • Description: Scorable rubric for evaluating AI-powered mental health tools across six domains: (1) Credibility (proposed goal, evidence-based content, retention), (2) User Experience (personalization and evolution, interactivity quality, feedback mechanism and support), (3) User Agency (user autonomy/data protection/privacy, user empowerment), (4) Equity and Inclusivity (cultural sensitivity, bias and fairness), (5) Transparency, and (6) Safety and Crisis Management. Each subdomain scored 0–2 with descriptive anchors; total range 0–24. Adapted from One Mind PsyberGuide and updated for genAI-specific characteristics including conversational dynamics, personalization, and crisis-handling.

    • Intended Use: Apply as an assessment structure for adult consumer-facing genAI wellness applications, focusing on safety, user-centered design, ethical integrity, equity, transparency, and responsible boundaries. Use the framework to help developers identify gaps, implementers compare tools, researchers structure evaluations, and end users understand product limitations and safe-use expectations.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer, Implementer, End User

    • Supporting Literature: Golden A, Aboujaoude E. Describing the Framework for AI Tool Assessment in Mental Health and Applying It to a Generative AI Obsessive-Compulsive Disorder Platform: Tutorial. JMIR Form Res. 2024;8:e62963. doi:10.2196/62963; Golden A, Aboujaoude E. The Framework for AI Tool Assessment in Mental Health. World Psychiatry. 2024;23(3):444-445. doi:10.1002/wps.21248.

    • Benchmark: per the Supporting Literature, ordinal subdomain scoring uses 0 to 2 descriptive anchors with a total range of 0 to 24. No prescriptive numeric threshold for passing is specified; use domain-level scores for cross-tool comparison and to identify weak readiness areas.

  • Readiness Evaluation for AI-Mental Health Deployment and Implementation (READI) Framework

    • Source: Academic Research (Peer Reviewed)

    • Responsible AI Principle: Cross-Cutting

    • Description: Structured evaluation framework for assessing the readiness of AI-mental health applications for clinical deployment, organized around six components: (1) Safety, (2) Privacy/Confidentiality, (3) Equity, (4) Effectiveness, (5) Engagement, and (6) Implementation. Each component specifies aspirational goals/values, evaluation criteria, evaluation questions, and proposed reporting requirements. Foundational principles spanning all components: maximize benefit, minimize harm, promote transparency, and support individual autonomy. Developed via narrative review of existing frameworks across AI governance, medical and psychological ethics, implementation science, digital mental health, health equity, and bioethics; explicitly tailored to address gaps where existing frameworks are insufficient for the AI-mental health intersection.

    • Intended Use: Use as a readiness and disclosure framework for adult genAI wellness applications across pre- and post-deployment stages. Apply it to define what developers should collect, evaluate, and publicly disclose about safety, effectiveness, equity, usability, privacy, and monitoring before implementation. For this work group, tailor the framework toward adult consumer wellness use, safe-use boundaries, and responsible deployment decisions.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer, Implementer, End User

    • Supporting Literature: Stade EC, Eichstaedt JC, Kim JP, Wiltsey Stirman S. Readiness Evaluation for Artificial Intelligence-Mental Health Deployment and Implementation (READI): A Review and Proposed Framework. Technology, Mind, and Behavior. 2025;6(2):111-122. doi:10.1037/tmb0000163.

    • Benchmark: per the Supporting Literature, the framework provides component-level evaluation criteria, evaluation questions, and proposed reporting requirements rather than a numeric scoring system or single readiness threshold. Use component-level gaps and use-case risk as the benchmark; do not collapse results into a single pass/fail score unless a validated scoring instrument is adopted.

  • MindEval Framework (Sword Health)

    • Source: Industry/Vendor (Pre-print available)

    • Responsible AI Principle: Cross-Cutting

    • Description: Fully-automated, model-agnostic benchmark for evaluating large language models (LLMs) in realistic, multi-turn mental health therapy conversations. Architecture comprises three LLM-based components: a Patient LLM (simulates a patient using a detailed profile and backstory), a Clinician LLM (the model under evaluation), and a Judge LLM (scores the completed interaction). The Judge LLM scores each interaction across five criteria , Clinical Accuracy & Competence, Ethical & Professional Conduct, Assessment & Response, Therapeutic Relationship & Alliance, and AI-Specific Communication Quality , on a 1–6 Likert scale grounded in American Psychological Association clinical supervision guidelines. Interactions are freshly generated for each evaluated model against a fixed pool of patient profiles, providing resistance to gaming.

    • Intended Use: Use for pre-deployment evaluation and comparative benchmarking of LLMs that may power adult genAI wellness applications. Apply interaction- and criterion-level scores to compare models, identify weaknesses in supportive conversation, cultural responsiveness, safety, or boundary adherence, and track performance longitudinally as models change. Interpret results in relation to consumer wellness use rather than as evidence that the system can provide clinical care.

    • Lifecycle Phase: Pre-deployment (primary); Post-deployment (for ongoing monitoring as models update)

    • Persona: Developer (primary); Implementer (secondary, for model and vendor selection)

    • Supporting Literature: Pombal J, D’Eon M, Guerreiro NM, Martins PH, Farinhas A, Rei R. MindEval: Benchmarking Language Models on Multi-Turn Mental Health Support. arXiv:2511.18491 [cs.CL]. 2025. doi:10.48550/arXiv.2511.18491.

    • Benchmark: per the Supporting Literature, each criterion is scored 1 to 6, with 1 to 2 indicating serious problems, 3 to 4 acceptable to solid performance, and 5 to 6 exceptional performance. Published model averages ranged from 2.16 to 3.83 and no model exceeded 4.0; use 4.0 as a practical review threshold rather than proof of clinical reliability.

  • MindBench.ai Platform

    • Source: Academic Research (Collaborative; Peer Reviewed)

    • Responsible AI Principle: Cross-Cutting

    • Description: Publicly accessible web-based platform aggregating evaluation approaches for LLMs and LLM-based tools in mental health contexts. Combines two streams: (1) Profile evaluation, including a Technical Profile (107 binary/numeric questions on data use, privacy, security, model versioning, conversation memory) and a Conversational Dynamics Profile (default personality assessment using Big Five, HEXACO, MBTI, Enneagram); and (2) Performance evaluation, including Benchmarking (SIRI-2 plus 75 clinical case benchmarks across psychopharmacology, perinatal mental health, psychiatric diagnosis, with numeric expert-rated responses) and Reasoning Analysis (chain-of-thought extraction with adversarial probing). Built as the LLM extension of the decade-old MINDapps.org mental health app database, in partnership with the National Alliance on Mental Illness (NAMI).

    • Intended Use: Use as evaluation infrastructure for continuously comparing LLMs and LLM-based adult wellness applications across safety, preference, quality, and failure-mode benchmarks. The platform can support developers with pre-release testing, implementers with tool comparison, researchers with cross-model evaluation, and public stakeholders with more transparent information about product limitations. For this scope, prioritize adult wellness-app benchmarks unless a narrower use case is separately justified.

    • Lifecycle Phase: Pre- and Post-deployment (continuous “living” evaluation as models update and new failure modes emerge)

    • Persona: Developer, Implementer, End User (explicitly designed for all stakeholder groups including patients and families)

    • Supporting Literature: Dwyer B, Flathers M, Sano A, Dempsey A, Cipriani A, Gazi AH, et al. MindBench.ai: An Actionable Platform to Evaluate the Profile and Performance of Large Language Models in a Mental Healthcare Context. NPP Digital Psychiatry and Neuroscience. 2025. doi:10.1038/s44277-025-00049-6.

    • Benchmark: per the Supporting Literature, profile items are binary or numeric and performance benchmarks use expert-rated SIRI-2 style scores from -3 to +3 with means and standard deviations. No composite score is intended; benchmark at the domain level using crisis response, clinical-case performance, transparency profile completion, and expert/lived-experience review.

  • Verily Behavioral Health Safety Filter (VBHSF) and Verily Mental Health Crisis Dataset v1.0 (Verily)

    • Source: Industry + Academic Partner (Peer-reviewed)

    • Responsible AI Principle: Cross-Cutting

    • Description: Two-stage transformer-based safety filter (GPT architecture with prompt engineering and clinical reasoning) designed to detect mental health crises in text-based LLM conversations and classify them by type. Stage 1 performs binary crisis vs non-crisis classification; Stage 2 performs multi-label classification across eight clinically defined crisis categories: abuse, neglect, eating-disorder behaviors, psychosis, self-harm, suicide, substance misuse, and violence toward others, plus mixed presentations. Released alongside the Verily Mental Health Crisis Dataset v1.0, a clinician-labeled corpus of 1,800 simulated messages (900 crisis, 900 non-crisis; Cohen’s κ = 0.99) reflecting real-world texting behaviors including textese, language mechanics errors, emojis, slang, and masked language (e.g., “unalive,” “13’ing”, “relief lines”).

    • Intended Use: Use as a safety-filtering and benchmark approach for detecting crisis or severe-distress content in user messages submitted to adult genAI wellness applications. Apply the filter to route high-risk interactions toward crisis resources, human review, or other escalation pathways rather than allowing autonomous wellness-app responses. Treat it as a guardrail requiring human-in-the-loop oversight and validation on real-world consumer wellness interactions before deployment.

    • Lifecycle Phase: Pre- and Post-deployment (designed for runtime safety filtering during LLM deployment; authors emphasize ongoing red-teaming, adversarial testing, and post-deployment monitoring as language norms and slang evolve)

    • Persona: Developer and Implementer

    • Supporting Literature: Nelson, B.W., Wong, C., Silvestrini, M.T. et al. An AI-based mental health guardrail and dataset for identifying psychiatric crises in text-based conversations. npj Digit. Med. 9, 407 (2026). https://doi.org/10.1038/s41746-026-02579-5.

    • Benchmark: per the Supporting Literature, internal evaluation sensitivity was 0.990, specificity was 0.992, and macro F1 was 0.939; external evaluation sensitivity was 0.982 and specificity was 0.859. Prioritize crisis sensitivity and set alerting thresholds so missed-crisis risk remains below the reference sensitivity while monitoring false-positive load in deployment.

  • Responsible Evaluation of AI for Mental Health (interdisciplinary evaluation taxonomy)

    • Source: Academic Research (Pre-print)

    • Responsible AI Principle: Cross-Cutting

    • Description: Conceptual taxonomy that organizes evaluation of AI mental health tools along two intersecting axes. The first axis distinguishes three tool types based on clinical goal: assessment-oriented (e.g., language-based screening, depression detection, suicide risk classification), intervention-oriented (e.g., therapeutic chatbots, prevention nudges, adaptive therapy recommendations), and information synthesis-oriented (e.g., clinical summarization, triage notes, treatment recommendations for clinicians). The second axis specifies four evaluation pillars drawn from classical psychometrics and implementation science: validity (does the tool do what it is intended to do, including convergent, discriminant, and criterion validity), reliability (does it perform consistently across time, populations, and components), implementation (does it fit real-world workflows, demonstrate feasibility, and achieve acceptability), and maintenance (does it remain effective and equitable as users, populations, and language norms evolve, including monitoring for unintended consequences). The framework explicitly maps each tool type to dimension-specific evaluation questions.

    • Intended Use: Use as conceptual scaffolding for matching evaluation depth to the intended function, risk level, and maturity of an adult genAI wellness application. Apply its validity, reliability, implementation, and maintenance dimensions to distinguish early technical testing from human-centered validation and post-deployment monitoring. For this work group, use the taxonomy to calibrate what claims wellness-app evidence can support and where stronger clinical evidence would be required.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer

    • Supporting Literature: Arnaout H, Goel A, Schwartz HA, Eberhardt ST, Atzil-Slonim D, Doherty G, Schwartz B, Lutz W, Althoff T, De Choudhury M, Jamalabadi H, Shah RS, Plaza-del-Arco FM, Hovy D, Liakata M, Gurevych I. Responsible Evaluation of AI for Mental Health. arXiv:2602.00065 [cs.CL]. 2026. doi:10.48550/arXiv.2602.00065.

    • Benchmark: Not a scoring system. Use the taxonomy as a coverage benchmark: assessment tools should show convergent and discriminant validity; intervention tools should show benefit, safety, and acceptability; information-synthesis tools should show workflow or decision-quality improvements. Use it to check whether evaluation evidence is complete enough for the use case.

  • American Psychological Association Health Advisory on the Use of Generative AI Chatbots and Wellness Applications for Mental Health

    • Source: Professional Organization

    • Responsible AI Principle: Cross-Cutting

    • Description: Health advisory issued by the American Psychological Association distinguishing three categories of consumer-facing technologies used for mental health purposes: (1) general-purpose GenAI chatbots not built for wellness (e.g., ChatGPT, Character AI); (2) wellness apps that use GenAI (e.g., Woebot, Sonia); and (3) non-AI wellness apps. Provides eight stakeholder-specific recommendations addressing: scope of clinical use, dependency and unhealthy attachment, data privacy, misrepresentation and algorithmic bias, vulnerable population safeguards, AI and digital literacy, research access and rigor, and the relationship between AI deployment and systemic mental health care access. Each recommendation specifies concrete actions for relevant stakeholder groups.

    • Intended Use: Use as public-facing guidance for adult consumers, developers, implementers, policymakers, and researchers evaluating genAI wellness applications that may be used for emotional support or coping assistance. Ground its application in the gap between product intent and real-world consumer use, emphasizing clear AI disclosures, safe-by-default privacy settings, non-clinical boundaries, bias and safety audits, dependency-risk mitigation, and stronger evidence standards before any mental-health or therapeutic claims are made.

    • Lifecycle Phase: Pre- and Post-deployment

    • Persona: Developer, Implementer, End User

    • Supporting Literature: American Psychological Association. Health Advisory on the Use of Generative AI Chatbots and Wellness Applications for Mental Health. American Psychological Association; 2025.

    • Benchmark: Use as a production readiness checklist rather than a numeric score: clear AI disclosure, no impersonation of licensed professionals, safe-by-default privacy settings, independent safety, efficacy, bias, and security audits before public release, dependency-risk mitigation, and rigorous trials for effectiveness claims.

  • CRADLE Bench

    • Source: Academic Research (Pre-Print & Conference Proceeding)

    • Responsible AI Principle: Safety (crisis detection); cross-cutting for evaluation

    • Description: A benchmark for multi-faceted mental health crisis detection in text-based model interactions. Unlike prior efforts that cover a narrow set of crisis types, CRADLE Bench defines seven crisis types in line with clinical standards (e.g., suicidal ideation, rape/sexual assault, domestic violence, child abuse, sexual harassment) and is the first crisis-detection benchmark to incorporate temporal labels (distinguishing, e.g., current vs. past events). It provides 600 clinician-annotated evaluation examples and 420 development examples, plus a training corpus of ~4,000 examples auto-labeled by a majority-vote ensemble of multiple LLMs (shown to outperform single-model annotation). The authors also fine-tune six crisis-detection models on subsets defined by consensus vs. unanimous ensemble agreement, offering complementary models trained under different agreement criteria.

    • Intended Use: Use as a benchmark to evaluate and select crisis/severe-distress detection classifiers or guardrails for adult genAI wellness applications, and to route flagged interactions toward crisis resources, human review, or escalation. The temporal labels help distinguish active from historical crises so escalation logic isn’t triggered by past-tense disclosures. Treat as an evaluation/development resource requiring human-in-the-loop validation on real consumer wellness data before deployment.

    • Lifecycle Phase: Pre-deployment (benchmarking, model selection, and fine-tuning of crisis detectors); supports Post-deployment monitoring of detection performance as language evolves.

    • Persona: Developer (primary); Implementer (for evaluation/vendor comparison)

    • Supporting Literature: Byun, G., Lipschutz, R., Minton, S. T., Powers, A., & Choi, J. D. (2026, March). Cradle bench: A clinician-annotated benchmark for multi-faceted mental health crisis and safety risk detection. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1572-1590). https://doi.org/10.48550/arXiv.2510.23845

    • Benchmark: 600 clinician-annotated eval + 420 dev + ~4K ensemble-labeled training examples; seven clinically defined crisis types; first to include temporal labels; six fine-tuned detectors under consensus/unanimous agreement. (Per-type detection metrics are reported in the paper tables — pull specific F1/sensitivity figures from the paper before citing; prioritize crisis sensitivity when setting thresholds.)

  • Counsel Bench

    • Source: Academic Research (Pre-print, and open review)

    • Responsible AI Principle: Usefulness, Safety, Transparency

    • Description: A large, expert-grounded benchmark for evaluating LLMs in realistic single-turn mental health counseling, built with 100 licensed/trained mental health professionals. It has two parts. CounselBench-Eval: 2,000 expert evaluations of responses to 100 real CounselChat questions (spanning 20 topics such as depression, anxiety, trauma, substance abuse, eating disorders), comparing GPT-4, LLaMA-3.3, Gemini-1.5-Pro, and an online human therapist across six dimensions — overall quality, empathy, specificity, factual consistency, medical advice, and toxicity — with written rationales and span-level annotations (each item rated by five professionals; Krippendorff’s α 0.72–0.83). CounselBench-ADV: 120 clinician-authored adversarial questions targeting six concrete failure modes (recommending specific medication, suggesting specific therapy techniques, speculating about symptoms, judgmental tone, apathy, and unsupported assumptions). Key findings: LLM responses were often rated higher than up-voted online-therapist responses, yet showed safety issues (unauthorized medical advice; symptom speculation triggered in ~67–87% of model outputs), and LLM-as-judge evaluators over-rated responses and rarely flagged toxic or incorrect content — a caution against automated self-evaluation in this domain.

    • Intended Use: Use to evaluate the quality and safety of single-turn, counseling-style responses from wellness LLMs, and to red-team for specific, clinically defined failure modes via CounselBench-ADV. Use the finding that LLM judges miss safety failures as a reason to keep human experts in safety evaluation rather than relying on automated grading. Scoped to single-turn interactions; not a multi-turn or crisis-detection tool.

    • Lifecycle Phase: Pre-deployment (evaluation and adversarial red-teaming); informs Post-deployment monitoring priorities.

    • Persona: Developer (primary); Implementer (for evaluation/vendor comparison)

    • Supporting Literature: Li, Y., Yao, J., Bunyi, J. B. S., Frank, A. C., Hwang, A. H. C., & Liu, R. (2025). CounselBench: a large-scale expert evaluation and adversarial benchmarking of large language models in mental health question answering. arXiv preprint arXiv:2506.08584.

    • Benchmark: 100 real questions × 20 topics; 2,000 expert evaluations (5 raters each; Krippendorff’s α 0.72–0.83); 120 adversarial questions; six evaluation dimensions; symptom-speculation failure triggered in ~67–87% of model responses; LLM-as-judge shown to inflate scores and rarely flag flagged spans. Recommend human expert evaluation for safety-critical dimensions.

  • EPITOME Framework

    • Source: Academic Research (Pre-print and conference proceeding)

    • Responsible AI Principle: Usefulness (empathy/quality of support); Transparency (rationale extraction)

    • Description: A theoretically grounded framework for characterizing how empathy is communicated in text-based mental health support, organized around three mechanisms — Emotional Reactions, Interpretations, and Explorations — each rated at levels of communication (no / weak / strong). It is released with a corpus of 10,000 (post, response) pairs annotated for these mechanisms with supporting rationale spans, and a multi-task RoBERTa-based bi-encoder model that both identifies the empathy level and extracts the underlying rationale text. Applied to 235,000 real interactions, the analysis found that users do not self-learn empathy over time — motivating explicit empathy training and feedback.

    • Intended Use: Use as a framework and model to measure and give feedback on the empathy of text-based responses — e.g., assessing whether a wellness application’s (or a peer supporter’s) replies communicate emotional reactions, interpretations, and explorations, and surfacing the specific text that conveys empathy. Suited to response-quality evaluation and empathy-training/feedback loops. It is an empathy-measurement framework, not a crisis or safety filter, and should be paired with dedicated safety/crisis tooling.

    • Lifecycle Phase: Pre- and Post-deployment (evaluating response empathy quality; ongoing feedback and training)

    • Persona: Developer (primary); Implementer (for evaluation)

    • Supporting Literature: Sharma, A., Miner, A., Atkins, D., & Althoff, T. (2020, November). A computational approach to understanding empathy expressed in text-based mental health support. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 5263-5276),

https://doi.org/10.48550/arXiv.2009.08441.
  • Benchmark: Three empathy mechanisms (emotional reactions, interpretations, explorations), each with communication levels; 10k rationale-annotated (post, response) pairs; multi-task RoBERTa bi-encoder for empathy identification + rationale extraction; large-scale application to 235k interactions. See F1 Scores in paper.

Other Revelevant Literature

  • Jafari, K., Rust, P. U. N., Eddy, D., Fraser, R., Vasan, N., Djordjevic, D., … & Kochenderfer, M. (2026). Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing. arXiv preprint arXiv:2601.18061.

  • Sanjeewa, R., Iyer, R., Apputhurai, P., Wickramasinghe, N., & Meyer, D. (2024). Empathic Conversational Agent Platform Designs and Their Evaluation in the Context of Mental Health: Systematic Review. JMIR mental health, 11, e58974. https://doi.org/10.2196/58974

Note on Responsible AI Principles

  • The CHAI Responsible AI Principles are consensus defined and described in the CHAI Responsible AI Guide https://www.chai.org/workgroup/responsible-ai

  • While there are no specific metrics related to transparency, we recommend that at minimum organizations adopt a Solution Card/Model Card for transparency when tracking solution within their organizations, and between vendors and implementing/purchasing organizations during procurement. https://registry.chai.org/dashboard/solutions/create

  • At this time there are no use-case specific metrics related to security and privacy (there are general metrics), but we encourage the community to contribute feedback with these proposed/available measures and metrics.