This page was generated from clinical-decision-support/evaluation-notebooks/Evidence_Concordance_Eval.ipynb. Download notebook.

2. Evidence (Guideline) Concordance — Evaluation Notebook 

Purpose: Evaluate evidence (guideline) concordance by comparing model recommendations against a versioned, case‑specific guideline answer key using exact‑ID and semantic matching, with transparent audit trails and summary metrics.

Date: 10/16/2025

Author: Ishaan Kejriwal

2.1. Overview 

This notebook implements an Evidence (Guideline) Concordance evaluation for Clinical Decision Support (CDS) systems.

It follows the same structured auto-eval pattern used in other CHAI metric notebooks:

Load a gold evidence key of guideline-aligned recommendations for each test case.
Load model outputs containing the system’s predicted recommendations and any cited guideline IDs.
Load a concept map linking aliases or synonyms (e.g., “ACE inhibitor”) to canonical guideline IDs.
Perform exact-ID and semantic matching between model recommendations and guideline standards.
Assign a categorical label (Exact, Semantic, For-consideration, or Not) and compute scores (binary, three-bucket, or weighted).
Summarize results as overall concordance rates and mean scores, generate basic visualizations, and export JSON/CSV audit tables for transparent review.

2.2. Metric Definition 

Metric: Evidence (Guideline) Concordance

Objective: Evaluate how well CDS or AI-generated recommendations align with established, evidence-based clinical guidelines.

What we check:

Whether the model’s recommendation matches a reference guideline ID or approved equivalent.
Exact-ID matches (model cites the correct guideline or equivalent).
Semantic matches (model uses a correct recommendation phrase or synonym mapped to the guideline).
For-consideration cases (related but less precise or lower-evidence recommendations).
Detection of non-concordant outputs that contradict or omit the guideline.

Recommended summary KPIs:

% cases concordant (Exact + Semantic)
Mean weighted concordance score
Distribution of label categories (Exact / Semantic / For-consideration / Not)
(Optional) Per-guideline or per-condition concordance rates

Suggested pass criteria (tunable):

≥ 90% of cases show Exact or Semantic concordance
Mean weighted concordance score ≥ 0.85
Minimal (<5%) Not-concordant outputs for critical guidelines

2.3. Setup & Imports 

[10]:

# Core
from dataclasses import dataclass
from typing import List, Dict, Optional, Tuple, Any
import ast
import json
import os
import math

# Data
import pandas as pd
import numpy as np

# Viz
import matplotlib.pyplot as plt

# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

2.4. Configuration & Inputs 

This section defines the file paths and core parameters used for the evaluation.

Users can point the notebook to their own datasets by updating the CSV paths below:

Evidence key → the reference “gold” guidelines for each case.
Concept map → alias-to-guideline mapping (for semantic matching).
Model outputs → system-generated recommendations with guideline references or text.

You can also configure scoring behavior:

TOP_K controls how many top recommendations per case are evaluated.
SCHEME selects the scoring method (binary, three_bucket, or weighted).
WEIGHTS adjusts the contribution of each label type to the final concordance score.

[ ]:

# Input paths (edit as needed)
DATA_DIR = "."  # If running in same folder as the CSVs; change to an absolute path if needed
EVIDENCE_KEY_CSV = os.path.join(DATA_DIR, "csvs/concept_map.csv")
CONCEPT_MAP_CSV = os.path.join(DATA_DIR, "csvs/concept_map.csv")
MODEL_OUTPUTS_CSV = os.path.join(DATA_DIR, "csvs/model_outputs.csv")

# Scoring options
TOP_K = 3  # consider top-K model recommendations per case
SCHEME = "weighted"  # one of: "binary", "three_bucket", "weighted"

# Label weights (used if SCHEME='weighted')
WEIGHTS = {"Exact": 1.0, "Semantic": 0.75, "For-consideration": 0.5, "Not": 0.0}

# For three     -bucket scoring, anything in {"Exact","Semantic"} -> Recommended,
# "For-consideration" stays as is, "Not" is not-recommended.

2.5. Data Schema & Loading 

This section loads the three core input files and prepares them for evaluation.

Each file has a defined schema to ensure consistent comparisons:

``evidence_key.csv`` → reference guideline data for each test case (ID, text, version, allowed equivalents, contraindications, evidence strength).
``concept_map.csv`` → mapping of aliases or synonyms to canonical guideline IDs for semantic matching.
``model_outputs.csv`` → model-predicted recommendations, each with rank, confidence, and optional guideline references.

The notebook reads each CSV into a DataFrame, converts list-like strings (e.g., ["A","B"]) into actual Python lists, and displays sample rows to confirm correct structure before scoring begins.

[13]:

# Expected columns:
# evidence_key.csv:
#   case_id, guideline_id, guideline_text, version, allowed_equivalents, contraindications, strength
# concept_map.csv:
#   alias, canonical_id
# model_outputs.csv:
#   case_id, rank, pred_id, pred_text, references, confidence

def _load_csv(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    return df

evidence_key = _load_csv(EVIDENCE_KEY_CSV)
concept_map = _load_csv(CONCEPT_MAP_CSV)
model_outputs = _load_csv(MODEL_OUTPUTS_CSV)

# Parse list-like strings into Python lists (if present)
def _maybe_parse_list_like(x):
    if isinstance(x, str) and x.strip().startswith("[") and x.strip().endswith("]"):
        try:
            return ast.literal_eval(x)
        except Exception:
            return []
    return x

for col in ["allowed_equivalents", "contraindications"]:
    if col in evidence_key.columns:
        evidence_key[col] = evidence_key[col].apply(_maybe_parse_list_like)

if "references" in model_outputs.columns:
    model_outputs["references"] = model_outputs["references"].apply(_maybe_parse_list_like)

display(evidence_key.head(3))
display(concept_map.head(3))
display(model_outputs.head(3))

	case_id	guideline_id	guideline_text	version	allowed_equivalents	contraindications	strength
0	C001	ADA_2024_9_4	Start metformin for T2D unless eGFR<30	ADA_2024	[]	[]	Strong
1	C002	NCCN_BR_2025_A1	Avoid NSAIDs in stage 4-5 CKD	NCCN_2025	[]	[]	Strong
2	C003	ACC_AHA_HTN_2022_ACE1	Use ACE inhibitor for hypertension	ACC_AHA_2022	[]	[]	Strong

	alias	canonical_id
0	metformin	ADA_2024_9_4
1	biguanide	BIGUANIDE_CLASS
2	ace inhibitor	ACC_AHA_HTN_2022_ACE1

	case_id	rank	pred_id	pred_text	references	confidence
0	C001	1	Pred_Met_01	Start metformin	[]	0.82
1	C001	2	Pred_SGLT2_01	Consider SGLT2 inhibitor	[]	0.41
2	C002	1	Pred_NSAID_01	Avoid NSAIDs	[]	0.70

2.6. Normalization & Vocabulary 

This step standardizes text and builds a lightweight vocabulary map for semantic matching.
Aliases (e.g., “ACE inhibitor”, “metformin”) are linked to their canonical guideline IDs using the concept_map.csv file.
All text is lowercased and stripped of whitespace for consistent comparison.
The helper function map_to_canonical_id() performs a simple substring match to connect model recommendation text to its corresponding guideline ID — semantic alignment layer before scoring

[4]:

# Build a quick alias -> canonical_id map
ALIAS2CANON = dict(zip(concept_map["alias"].str.lower().str.strip(),
                       concept_map["canonical_id"].str.strip()))

def normalize_text(text: str) -> str:
    if not isinstance(text, str):
        return ""
    return text.lower().strip()

def map_to_canonical_id(text: str) -> Optional[str]:
    t = normalize_text(text)
    # naive contains match; for production, prefer a tokenizer + ontology
    for alias, can_id in ALIAS2CANON.items():
        if alias in t:
            return can_id
    return None

2.7. Matching Logic 

This section determines how a model recommendation is judged against the guideline key.

We evaluate each case’s top-K predictions in a fixed priority order:

Exact: the model cites a guideline ID that appears in the allowed set (primary ID or approved equivalents).
Semantic: the model’s text maps (via the alias→canonical vocabulary) to a guideline in the allowed set.
For-consideration: the text maps to a related canonical ID (e.g., class-level or nearby alias) but not listed as primary/allowed.
Not: no acceptable match (or disqualifiers would apply if patient gates are enabled).

The best label found among the top-K is used to compute a case score according to the selected scheme

(binary, three_bucket, or weighted). The result is a per-case record with the chosen label, score, and match metadata.

[14]:

MATCH_ORDER = ["Exact", "Semantic", "For-consideration", "Not"]

def _allowed_ids(row: pd.Series) -> set:
    ids = set([row["guideline_id"]])
    eqv = row.get("allowed_equivalents", []) or []
    for e in eqv:
        if isinstance(e, str):
            ids.add(e)
    return ids

def evaluate_one(case_row: pd.Series, pred_row: pd.Series) -> Tuple[str, Dict[str, Any]]:
    # 0) Disqualifiers demo (stub): in a real setup, you'd check patient-level features
    # For this template, we don't have patient attributes; keep placeholder logic.
    # If you add patient features (e.g., eGFR), evaluate here and return ("Not", {"reason": "..."})

    # 1) Exact-ID match: any overlap in references and allowed IDs
    ids_allowed = _allowed_ids(case_row)
    refs = set(pred_row.get("references", []) or [])
    if ids_allowed & refs:
        return "Exact", {"matched_ids": list(ids_allowed & refs)}

    # 2) Semantic concept match via alias->canonical mapping
    can_from_text = map_to_canonical_id(pred_row.get("pred_text", ""))
    if can_from_text and can_from_text in ids_allowed:
        return "Semantic", {"canonical_id": can_from_text}

    # 3) For-consideration (optional heuristic): same class or nearby alias not listed as primary
    if can_from_text:
        return "For-consideration", {"canonical_id": can_from_text}

    return "Not", {}

def score_case(case_id: str, top_k: int = 3) -> Dict[str, Any]:
    # Retrieve the reference (gold) row for the given case
    case = evidence_key.loc[evidence_key["case_id"] == case_id]
    if case.empty:
        return {"case_id": case_id, "label": "Not", "score": 0.0, "meta": {"error": "missing case"}}
    case_row = case.iloc[0]

    # Retrieve all model predictions for this case
    preds = model_outputs[model_outputs["case_id"] == case_id].copy()
    if preds.empty:
        return {"case_id": case_id, "label": "Not", "score": 0.0, "meta": {"error": "no predictions"}}

    # Sort predictions by rank (ascending) and confidence (descending)
    preds.sort_values(["rank", "confidence"], ascending=[True, False], inplace=True)
    preds = preds.head(top_k)  # keep only top-k predictions for scoring

    # Initialize best match tracking
    best_label, best_meta = "Not", {}
    for _, r in preds.iterrows():
        label, meta = evaluate_one(case_row, r)
        # Keep the highest-priority match according to MATCH_ORDER
        if MATCH_ORDER.index(label) < MATCH_ORDER.index(best_label):
            best_label, best_meta = label, meta

    # Compute numeric score based on the evaluation scheme
    if SCHEME == "binary":
        score = 1.0 if best_label in {"Exact", "Semantic"} else 0.0
    elif SCHEME == "three_bucket":
        # No numeric needed; but return a convenience score
        score = 1.0 if best_label in {"Exact", "Semantic"} else (0.5 if best_label == "For-consideration" else 0.0)
    else:  # weighted
        score = WEIGHTS.get(best_label, 0.0)

    # Return structured evaluation result for the given case

    return {"case_id": case_id, "label": best_label, "score": score, "meta": best_meta}

# Run scoring over all cases in the evidence key
results = pd.DataFrame([score_case(cid, top_k=TOP_K) for cid in evidence_key["case_id"].tolist()])
results

[14]:

	case_id	label	score	meta
0	C001	Semantic	0.75	{'canonical_id': 'ADA_2024_9_4'}
1	C002	Semantic	0.75	{'canonical_id': 'NCCN_BR_2025_A1'}
2	C003	Semantic	0.75	{'canonical_id': 'ACC_AHA_HTN_2022_ACE1'}

2.8. Scoring & Aggregation 

Here we convert per-case labels into headline metrics.

Binary concordance counts Exact and Semantic as concordant and reports the overall percentage.
Weighted score averages label weights (configurable via WEIGHTS) to provide a single summary index.
We also include a label distribution to show how often each category occurs.

The summarize() function aggregates these values across all cases, returning a compact table with:

total n_cases,
% cases concordant (Exact + Semantic),
mean_weighted_score, and
a JSON-style label_breakdown for quick QA and reporting.

[15]:

def summarize(results_df: pd.DataFrame) -> pd.DataFrame:
    n = len(results_df)
    pct_concordant_binary = (results_df["label"].isin(["Exact", "Semantic"]).mean() * 100.0) if n > 0 else 0.0
    mean_weighted = results_df["score"].mean() if n > 0 else 0.0
    breakdown = results_df["label"].value_counts(normalize=True).to_dict() if n > 0 else {}
    return pd.DataFrame({
        "n_cases":[n],
        "pct_concordant_binary":[pct_concordant_binary],
        "mean_weighted_score":[mean_weighted],
        "label_breakdown":[breakdown]
    })

summary = summarize(results)
summary

[15]:

	n_cases	pct_concordant_binary	mean_weighted_score	label_breakdown
0	3	100.0	0.75	{'Semantic': 1.0}

2.9. Reporting & Visuals 

This section provides quick visuals for QA and stakeholder review:

A bar chart of the label distribution (Exact / Semantic / For-consideration / Not) to spot mismatch patterns.
A per-case score plot to see spread, outliers, and any clustering by case index.

Use these to verify that most cases fall into concordant categories, and to flag cases with low scores for audit.

(For deeper analysis, consider stratifying by guideline version, condition, or site and adding confidence intervals.)

[16]:

# Bar chart: label distribution
label_counts = results["label"].value_counts().sort_index()
plt.figure()
label_counts.plot(kind="bar")
plt.title("Evidence Concordance — Label Distribution")
plt.xlabel("Label")
plt.ylabel("Count")
plt.show()

# Scatter: score by case
plt.figure()
plt.scatter(range(len(results)), results["score"])
plt.title("Per-Case Evidence Concordance Score")
plt.xlabel("Case Index")
plt.ylabel("Score")
plt.show()

../../_images/clinical-decision-support_evaluation-notebooks_Evidence_Concordance_Eval_17_0.png

../../_images/clinical-decision-support_evaluation-notebooks_Evidence_Concordance_Eval_17_1.png

2.10. Audit Tables & Exports 

This section produces a transparent audit view and saves machine-readable outputs:

Audit table: joins each case’s guideline ID/text/version with the model’s top recommendation and any cited guideline references, plus the final label and score. Use this to spot false positives/negatives and verify mapping logic.
Exports: writes a compact JSON summary and two CSVs (results and audit) into outputs/ for downstream dashboards or CI checks.

Files written:

outputs/evidence_concordance_summary.json — headline KPIs (n_cases, % concordant, mean score).
outputs/evidence_concordance_results.csv — per-case labels and scores.
outputs/evidence_concordance_audit.csv — side-by-side guideline vs. model details for human review.

[17]:

# Join back key columns for a readable audit
audit = results.merge(evidence_key[["case_id","guideline_id","guideline_text","version","strength"]],
                      on="case_id", how="left")

# Show side-by-side: top prediction text/refs for transparency
def pick_top_pred_text(cid: str) -> Dict[str, Any]:
    preds = model_outputs[model_outputs["case_id"] == cid].copy()
    if preds.empty:
        return {"top_pred_text": None, "top_refs": []}
    preds.sort_values(["rank", "confidence"], ascending=[True, False], inplace=True)
    r = preds.iloc[0]
    return {"top_pred_text": r.get("pred_text"), "top_refs": r.get("references", [])}

tops = [pick_top_pred_text(cid) for cid in audit["case_id"]]
audit["top_pred_text"] = [t["top_pred_text"] for t in tops]
audit["top_refs"] = [t["top_refs"] for t in tops]

display(audit)

# Export JSON + CSV
export_dir = "outputs"
os.makedirs(export_dir, exist_ok=True)

summary_path = os.path.join(export_dir, "evidence_concordance_summary.json")
results_path = os.path.join(export_dir, "evidence_concordance_results.csv")
audit_path   = os.path.join(export_dir, "evidence_concordance_audit.csv")

with open(summary_path, "w", encoding="utf-8") as f:
    json.dump(json.loads(summary.to_json(orient="records")), f, indent=2)

results.to_csv(results_path, index=False, encoding="utf-8")
audit.to_csv(audit_path, index=False, encoding="utf-8")

print("Wrote:")
print(" -", summary_path)
print(" -", results_path)
print(" -", audit_path)

	case_id	label	score	meta	guideline_id	guideline_text	version	strength	top_pred_text	top_refs
0	C001	Semantic	0.75	{'canonical_id': 'ADA_2024_9_4'}	ADA_2024_9_4	Start metformin for T2D unless eGFR<30	ADA_2024	Strong	Start metformin	[]
1	C002	Semantic	0.75	{'canonical_id': 'NCCN_BR_2025_A1'}	NCCN_BR_2025_A1	Avoid NSAIDs in stage 4-5 CKD	NCCN_2025	Strong	Avoid NSAIDs	[]
2	C003	Semantic	0.75	{'canonical_id': 'ACC_AHA_HTN_2022_ACE1'}	ACC_AHA_HTN_2022_ACE1	Use ACE inhibitor for hypertension	ACC_AHA_2022	Strong	Start ACE inhibitor	[]

Wrote:
 - outputs\evidence_concordance_summary.json
 - outputs\evidence_concordance_results.csv
 - outputs\evidence_concordance_audit.csv

2.11. Appendix: Helper Utilities 

This section serves as a lightweight sandbox for optional or future add-ons.

Users can place reusable helper functions here—for example:

Patient-feature gates (e.g., skip metformin if eGFR < 30).
Version-drift policies (treat 2023 vs 2025 guideline IDs as “For-consideration”).
Inter-rater checks (e.g., Cohen’s κ for manual labels).
Conversion helpers (e.g., map three-bucket labels → binary).

Keeping this section isolated ensures the core evaluation pipeline stays clean, while still leaving space for modular extensions later.

[18]:

# Placeholder for future helpers, e.g., patient-feature gating, version drift policies,
# or inter-rater agreement scaffolding.
#
# Example: simple policy to map three-bucket labels to a binary for headline reporting
def to_binary(label: str) -> int:
    return 1 if label in {"Exact","Semantic"} else 0