Spaces:

Rom89823974978
/

RAG_Eval

Sleeping

App Files Files Community

Rom89823974978 commited on Jun 5

Commit

e8c3964

1 Parent(s): bdb49ae

Added dashboard and experiments

Browse files

Files changed (8) hide show

README.md +166 -3
evaluation/__init__.py +1 -0
evaluation/config.py +8 -0
evaluation/stats/__init__.py +1 -0
evaluation/utils/logger.py +56 -0
scripts/dashboard.py +104 -0
scripts/run_experiments.py +251 -0
scripts/run_grid_experiments.py +239 -0

README.md CHANGED Viewed

@@ -1,5 +1,168 @@
-# RAG Evaluation Framework for Regulated Domains - Master's Thesis
-This repository contains a modular implementation of an evaluation framework for Retrieval‑Augmented Generation (RAG) systems.
-See `evaluation/` for library code and `tests/` for smoke tests.

+Below is a complete **README.md** you can drop into the repository root.
+It walks through the codebase, explains how each layer aligns with the research-proposal objectives, and gives practical “getting-started” steps for building indexes, running experiments, and producing statistical analyses.
+---
+````markdown
+# Retrieval-Augmented Generation Evaluation Framework
+*(Legal & Financial domains, with full regulatory-grade metrics)*
+> **Project context** – This code implements the software artefacts promised in the research proposal
+> “**Toward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains**.”
+> Each folder corresponds to a work-package from the proposal: retrieval pipelines, metric library
+> , robustness & statistical analysis, plus automation for Docker / CI.
+---
+## 1. Quick start
+```bash
+# Clone and bootstrap
+git clone https://github.com/<your-org>/rag-eval-framework.git
+cd rag-eval-framework
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+pre-commit install             # optional: local lint hooks
+# Download / prepare a small corpus (makes ~200 docs)
+bash scripts/download_data.sh
+# Build sparse & dense indexes automatically on first run
+python scripts/run_experiments.py \
+  --config configs/pipeline_hybrid_ce.yaml \
+  --queries data/sample_queries.jsonl
+````
+The first invocation embeds documents, builds a **FAISS** dense index, and a **Pyserini** (Lucene) sparse index. Subsequent runs reuse them.
+---
+## 2. Repository layout
+```
+evaluation/                  ← ⚙️  Core library
+├── config.py                ⇢ Typed dataclasses (retriever, generator, stats, reranker)
+├── pipeline.py              ⇢ Orchestrates retrieval → (optional) re-ranking → generation
+│   └── … logs every stage to dict → downstream eval
+├── retrievers/              ⇢ BM25, Dense (Sentence-Transformers + FAISS), Hybrid
+├── rerankers/               ⇢ Cross-encoder re-ranker (optional second stage)
+├── generators/              ⇢ Hugging Face generator wrapper (T5/Flan/BART…)
+├── metrics/                 ⇢ Retrieval, generation, composite RAG score
+└── stats/                   ⇢ Correlation, significance, robustness utilities
+configs/                     ← YAML templates (pipeline & stats settings)
+scripts/                     ← CLI helpers: run_experiments.py, download_data.sh …
+tests/                       ← PyTest smoke tests cover every public module
+.github/workflows/ci.yml     ← Lint + tests on push / PR
+Dockerfile                   ← Slim runtime ready for reproducibility
+```
+---
+## 3. How each module maps to proposal tasks
+| Proposal section                       | Code artefact                       | Purpose                                                                                                                                         |
+| -------------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Retrievers** (BM25, dense, hybrid)   | `evaluation/retrievers/`            | Implements **RQ1** experiments on classic vs. dense retrieval. Auto-builds indexes to ease replication.                                         |
+| **Generator** (Fixed seq2seq backbone) | `evaluation/generators/`            | Holds the controlled decoding backend so retrieval changes are isolated.                                                                        |
+| **Cross-encoder re-ranker**            | `evaluation/rerankers/`             | Optional “advanced RAG” from Fig. 2 of proposal; improves evidence precision.                                                                   |
+| **Metric taxonomy**                    | `evaluation/metrics/`               | Classical IR metrics, semantic generation scores, and composite `rag_score` per WP3.                                                            |
+| **Statistical tests & sensitivity**    | `evaluation/stats/` + `StatsConfig` | Spearman/ Kendall correlations (**RQ1, RQ2**), Wilcoxon + Holm-Bonferroni (**RQ2**), error-propagation χ² and robustness deltas (**RQ3, RQ4**). |
+| **Reproducibility**                    | Dockerfile, CI, pre-commit          | Meets EU AI Act’s “technical documentation & traceability” clauses (Articles 14-15).                                                            |
+---
+## 4. Configuration at a glance
+```yaml
+# configs/pipeline_hybrid_ce.yaml
+retriever:
+  name: hybrid                 # bm25 | dense | hybrid
+  bm25_index: indexes/legal_bm25
+  faiss_index: indexes/legal_dense.faiss
+  doc_store: data/legal_docs.jsonl
+  top_k: 10
+  alpha: 0.6
+reranker:
+  enable: true                 # cross-encoder stage
+  model_name: cross-encoder/ms-marco-MiniLM-L-6-v2
+  first_stage_k: 50
+  final_k: 10
+  device: cuda:0
+generator:
+  model_name: google/flan-t5-base
+  device: cuda:0
+  max_new_tokens: 256
+  temperature: 0.0
+stats:
+  correlation_method: spearman
+  n_boot: 5000
+  ci: 0.95
+  wilcoxon_alternative: two-sided
+  multiple_correction: holm-bonferroni
+  alpha: 0.05
+```
+All fields are documented in `evaluation/config.py`.  You can override any flag via CLI (`--retriever.top_k 20`) if you parse with Hydra or OmegaConf.
+---
+## 5. Index generation details
+* **Sparse (BM25 / Lucene)**
+  If `bm25_index` dir is absent, the `BM25Retriever` calls *Pyserini’s* CLI to build it from `doc_store` (JSONL with `{"id", "text"}`).
+* **Dense (FAISS)**
+  Likewise, `DenseRetriever` embeds every document using the Sentence-Transformers model in the config, normalises vectors, and builds an IP-metric FAISS index.
+Both steps cache artefacts, so future runs start instantly.
+---
+## 6. Running the statistical evaluation
+Each experiment run dumps a JSONL (`results.jsonl`) with per-query fields:
+```jsonc
+{
+  "question": "...",
+  "answer": "...",
+  "contexts": ["..."],
+  "metrics": {
+    "precision@10": 0.9,
+    "rag_score": 0.71,
+    ...
+  },
+  "human_correct": true,        // optional gold labels
+  "human_faithful": 0.8         // optional expert rating 0-1
+}
+```
+You can feed that into a notebook or CLI script:
+```python
+from evaluation.stats import (
+    corr_ci, wilcoxon_signed_rank, holm_bonferroni,
+    delta_metric, conditional_failure_rate
+)
+from evaluation import StatsConfig
+cfg = StatsConfig(n_boot=5000)
+# example: correlation of MRR vs. human correctness
+mrr = [r["metrics"]["mrr"] for r in rows]
+gold = [1.0 if r["human_correct"] else 0.0 for r in rows]
+rho, (lo, hi), p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
+print(f"Spearman ρ={rho:.2f} 95% CI=({lo:.2f},{hi:.2f})  p={p:.3g}")
+```
+All statistical primitives are implemented in pure NumPy+SciPy, ensuring compatibility with lightweight Docker images.
+---
+### Happy evaluating!
+Questions or suggestions? Open an issue or discussion on the GitHub repo.
+```
+```

evaluation/__init__.py CHANGED Viewed

@@ -13,3 +13,4 @@ The public API re‑exports :class:`evaluation.pipeline.RAGPipeline`.
 """
 from .pipeline import RAGPipeline, PipelineConfig  # noqa: F401

 """
 from .pipeline import RAGPipeline, PipelineConfig  # noqa: F401
+from .config import LoggingConfig

evaluation/config.py CHANGED Viewed

@@ -4,6 +4,13 @@ from dataclasses import dataclass
 from pathlib import Path
 from typing import Optional, Literal
 @dataclass
 class CrossEncoderConfig:
     enable: bool = False                          # master switch
@@ -64,6 +71,7 @@ class StatsConfig:
 @dataclass
 class PipelineConfig:
     """Top‑level pipeline configuration."""
     reranker: CrossEncoderConfig = CrossEncoderConfig()
     retriever: RetrieverConfig = RetrieverConfig()
     generator: GeneratorConfig = GeneratorConfig()

 from pathlib import Path
 from typing import Optional, Literal
+@dataclass
+class LoggingConfig:
+    log_dir: Path = Path("logs")
+    level: str = "INFO"            # DEBUG | INFO | WARNING | ERROR | CRITICAL
+    max_mb: int = 5                # per-file size before rotation
+    backups: int = 5               # number of rotated files to keep
 @dataclass
 class CrossEncoderConfig:
     enable: bool = False                          # master switch
 @dataclass
 class PipelineConfig:
     """Top‑level pipeline configuration."""
+    logging: LoggingConfig = LoggingConfig()
     reranker: CrossEncoderConfig = CrossEncoderConfig()
     retriever: RetrieverConfig = RetrieverConfig()
     generator: GeneratorConfig = GeneratorConfig()

evaluation/stats/__init__.py CHANGED Viewed

@@ -1,5 +1,6 @@
 """Statistical utilities for analysis scripts."""
 from .correlation import corr_ci
 from .significance import wilcoxon_signed_rank, holm_bonferroni
 from .robustness import (

 """Statistical utilities for analysis scripts."""
+from ..config import StatsConfig
 from .correlation import corr_ci
 from .significance import wilcoxon_signed_rank, holm_bonferroni
 from .robustness import (

evaluation/utils/logger.py ADDED Viewed

	@@ -0,0 +1,56 @@

+"""Centralised logging initialisation (console + rotating file)."""
+from __future__ import annotations
+import logging
+import logging.handlers
+import os
+import sys
+from datetime import datetime
+from pathlib import Path
+from typing import Optional
+__all__ = ["init_logging"]
+def init_logging(
+    *,
+    log_dir: str | os.PathLike = "logs",
+    level: str | int = "INFO",
+    fmt: str = "%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+    max_mb: int = 5,
+    backups: int = 5,
+) -> Path:
+    """Configure root logger for both console *and* rotating-file output.
+    Returns
+    -------
+    Path to the log file.
+    """
+    log_dir = Path(log_dir)
+    log_dir.mkdir(parents=True, exist_ok=True)
+    logfile = log_dir / f"{datetime.now(datetime.timezone.utc):%Y%m%d_%H%M%S}.log"
+    if isinstance(level, str):
+        level = logging._nameToLevel.get(level.upper(), logging.INFO)
+    formatter = logging.Formatter(fmt)
+    root = logging.getLogger()
+    root.setLevel(level)
+    root.handlers.clear()           # avoid duplicate handlers on re-init
+    # Console
+    ch = logging.StreamHandler(sys.stderr)
+    ch.setLevel(level)
+    ch.setFormatter(formatter)
+    root.addHandler(ch)
+    # Rotating file
+    fh = logging.handlers.RotatingFileHandler(
+        logfile, maxBytes=max_mb * 1024 * 1024, backupCount=backups
+    )
+    fh.setLevel(level)
+    fh.setFormatter(formatter)
+    root.addHandler(fh)
+    root.info("Logging initialised. File=%s  Level=%s", logfile, logging.getLevelName(level))
+    return logfile

scripts/dashboard.py ADDED Viewed

	@@ -0,0 +1,104 @@

+#!/usr/bin/env python
+"""
+dashboard.py
+============
+Launch with:
+    streamlit run scripts/dashboard.py
+Relies on the directory structure produced by run_grid_experiments.py:
+outputs/grid/<dataset>/<config>/{aggregates.yaml, rq1.yaml, ...}
+"""
+from __future__ import annotations
+import json
+import yaml
+from pathlib import Path
+import pandas as pd
+import streamlit as st
+import matplotlib.pyplot as plt
+BASE_DIR = Path("outputs/grid")         # change if you store runs elsewhere
+METRIC_KEY = "rag_score"               # bar/box plots focus on this
+# --------------------------------------------------------------------- Sidebar
+st.sidebar.title("RAG-Eval Dashboard")
+if not BASE_DIR.exists():
+    st.sidebar.error(f"Folder {BASE_DIR} not found – run experiments first.")
+    st.stop()
+datasets = sorted([p.name for p in BASE_DIR.iterdir() if p.is_dir()])
+dataset = st.sidebar.selectbox("Dataset", datasets)
+conf_dir = BASE_DIR / dataset
+configs = sorted([p.name for p in conf_dir.iterdir() if p.is_dir()])
+sel_cfgs = st.sidebar.multiselect("Configurations", configs, default=configs)
+if not sel_cfgs:
+    st.warning("Select at least one configuration.")
+    st.stop()
+# ---------------------------------------------------------------- Load helpers
+def _yaml(path: Path): return yaml.safe_load(path.read_text())
+def _jsonl(path: Path): return [json.loads(l) for l in path.read_text().splitlines()]
+# ---------------------------------------------------------------- Main view
+st.title(f"Dataset: {dataset}")
+# ── Aggregated metrics table ────────────────────────────────────────────────
+agg = {c: _yaml(conf_dir / c / "aggregates.yaml") for c in sel_cfgs}
+agg_df = pd.DataFrame(agg).T
+st.subheader("Aggregated metrics")
+st.dataframe(agg_df, use_container_width=True)
+# ── Bar chart of rag_score means ────────────────────────────────────────────
+st.subheader(f"Mean {METRIC_KEY}")
+fig, ax = plt.subplots()
+agg_df[METRIC_KEY].plot.bar(ax=ax)
+ax.set_ylabel(METRIC_KEY)
+ax.set_ylim(0, 1)
+st.pyplot(fig)
+# ── Scatter MRR vs Correctness per config ───────────────────────────────────
+st.subheader("MRR vs Human Correctness")
+cols = st.columns(len(sel_cfgs))
+for col, cfg in zip(cols, sel_cfgs):
+    rows = _jsonl(conf_dir / cfg / "results.jsonl")
+    x = [r["metrics"].get("mrr", float("nan")) for r in rows]
+    y = [1 if r.get("human_correct") else 0 for r in rows]
+    fig, ax = plt.subplots()
+    ax.scatter(x, y, alpha=0.5)
+    ax.set(title=cfg, xlabel="MRR", ylabel="Correct?")
+    col.pyplot(fig)
+# ── Pairwise Wilcoxon-Holm table (rag_score) ────────────────────────────────
+wh_path = conf_dir / "wilcoxon_rag_holm.yaml"
+if wh_path.exists():
+    st.subheader("Pairwise Wilcoxon-Holm (rag_score)")
+    wh_df = pd.Series(_yaml(wh_path), name="p_adj").to_frame()
+    st.dataframe(wh_df)
+else:
+    st.info("Wilcoxon table not found – run_grid_experiments.py computes it.")
+# ── Research-question YAMLs ─────────────────────────────────────────────────
+rq_tabs = st.tabs([f"{cfg}" for cfg in sel_cfgs])
+for tab, cfg in zip(rq_tabs, sel_cfgs):
+    with tab:
+        for rq in ("rq1", "rq2", "rq3", "rq4"):
+            path = conf_dir / cfg / f"{rq}.yaml"
+            if path.exists():
+                st.markdown(f"**{rq.upper()}**")
+                st.json(_yaml(path))
+            else:
+                st.markdown(f"*{rq.upper()} – not available*")
+# ── Raw results download ────────────────────────────────────────────────────
+st.sidebar.subheader("Download")
+for cfg in sel_cfgs:
+    st.sidebar.download_button(
+        label=f"{cfg} results.jsonl",
+        data=(conf_dir / cfg / "results.jsonl").read_bytes(),
+        file_name=f"{dataset}_{cfg}_results.jsonl",
+        mime="application/jsonl",
+    )

scripts/run_experiments.py ADDED Viewed

	@@ -0,0 +1,251 @@

+#!/usr/bin/env python
+"""
+run_experiments.py
+==================
+High-level driver that wires together:
+1.  YAML / CLI → `PipelineConfig` + `LoggingConfig`
+2.  Initialises dual-sink logging (console + rotating file)
+3.  Builds a `RAGPipeline`
+4.  Streams a list of questions through the pipeline
+5.  Logs progress, writes per-query JSONL results, and
+    (optionally) prints aggregate statistics.
+You can keep it minimal – or expand the marked TODO sections to:
+* compute metrics immediately
+* push results to a tracker (W&B, MLflow, etc.)
+* spawn multiple configs in parallel.
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Mapping
+import yaml
+from evaluation import (
+    PipelineConfig,
+    RetrieverConfig,
+    GeneratorConfig,
+    CrossEncoderConfig,
+    StatsConfig,
+    LoggingConfig,
+    RAGPipeline,
+)
+from evaluation.utils.logger import init_logging
+from evaluation.stats import (
+    corr_ci,
+    wilcoxon_signed_rank,
+    holm_bonferroni,
+)
+import matplotlib.pyplot as plt
+# ──────────────────────────────────────────────────────────────────────────────
+# Helpers
+# ──────────────────────────────────────────────────────────────────────────────
+def _merge_dataclass(dc_cls, default, override: Mapping[str, Any]):
+    """Return a new *dc_cls* where fields from *override* overwrite *default*."""
+    from dataclasses import asdict
+    merged = asdict(default)
+    merged.update({k: v for k, v in override.items() if v is not None})
+    return dc_cls(**merged)
+def _load_pipeline_config(yaml_path: Path | None) -> PipelineConfig:
+    """Parse YAML into nested dataclasses; fall back to defaults."""
+    if yaml_path is None:
+        return PipelineConfig()  # all defaults
+    data = yaml.safe_load(yaml_path.read_text())
+    retr_cfg = _merge_dataclass(
+        RetrieverConfig(), RetrieverConfig(), data.get("retriever", {})
+    )
+    gen_cfg = _merge_dataclass(
+        GeneratorConfig(), GeneratorConfig(), data.get("generator", {})
+    )
+    rr_cfg = _merge_dataclass(
+        CrossEncoderConfig(), CrossEncoderConfig(), data.get("reranker", {})
+    )
+    stats_cfg = _merge_dataclass(StatsConfig(), StatsConfig(), data.get("stats", {}))
+    log_cfg = _merge_dataclass(LoggingConfig(), LoggingConfig(), data.get("logging", {}))
+    return PipelineConfig(
+        retriever=retr_cfg,
+        generator=gen_cfg,
+        reranker=rr_cfg,
+        stats=stats_cfg,
+        logging=log_cfg,
+    )
+def _read_jsonl(path: Path) -> List[Dict[str, Any]]:
+    with path.open() as f:
+        return [json.loads(line) for line in f]
+def _write_jsonl(path: Path, rows: Iterable[Mapping[str, Any]]):
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w") as f:
+        for row in rows:
+            f.write(json.dumps(row) + "\n")
+# Stats Helper
+def aggregate_metrics(rows: list[dict[str, Any]]) -> dict[str, float]:
+    """Return mean of every numeric metric found under row['metrics']."""
+    import numpy as np
+    keys = rows[0]["metrics"].keys()
+    return {k: float(np.mean([r["metrics"][k] for r in rows])) for k in keys}
+def correlation_with_gold(rows: list[dict[str, Any]], cfg: StatsConfig):
+    """Spearman/Kendall correlation between retrieval scores and correctness flag."""
+    if "human_correct" not in rows[0]:
+        return None  # nothing to correlate
+    mrr = [r["metrics"].get("mrr", float("nan")) for r in rows]
+    gold = [1.0 if r["human_correct"] else 0.0 for r in rows]
+    r, (lo, hi), p = corr_ci(
+        mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot, ci=cfg.ci
+    )
+    return dict(r=r, ci_low=lo, ci_high=hi, p=p)
+def wilcoxon_against_baseline(
+    cur: list[dict[str, Any]],
+    base: list[dict[str, Any]],
+    cfg: StatsConfig,
+):
+    """Paired Wilcoxon + Holm-Bonferroni across all metric keys."""
+    from evaluation.stats import wilcoxon_signed_rank, holm_bonferroni
+    assert len(cur) == len(base), "Runs must have same #queries"
+    metrics = cur[0]["metrics"].keys()
+    p_raw = {}
+    for m in metrics:
+        cur_m = [r["metrics"][m] for r in cur]
+        base_m = [r["metrics"][m] for r in base]
+        _, p = wilcoxon_signed_rank(cur_m, base_m, alternative=cfg.wilcoxon_alternative)
+        p_raw[m] = p
+    return holm_bonferroni(p_raw)
+# Plot helper
+def save_scatter(rows, out_dir: Path):
+    out_dir.mkdir(parents=True, exist_ok=True)
+    x = [r["metrics"]["mrr"] for r in rows if "mrr" in r["metrics"]]
+    y = [1.0 if r.get("human_correct") else 0.0 for r in rows]
+    plt.figure()
+    plt.scatter(x, y, alpha=0.6)
+    plt.xlabel("MRR")
+    plt.ylabel("Correct (1=yes)")
+    plt.title("MRR vs. Human Correctness")
+    path = out_dir / "mrr_vs_correct.png"
+    plt.savefig(path, bbox_inches="tight")
+    plt.close()
+    return path
+# ──────────────────────────────────────────────────────────────────────────────
+# Main
+# ──────────────────────────────────────────────────────────────────────────────
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description="Run RAG evaluation experiments.")
+    ap.add_argument("--config", type=Path, help="YAML config with pipeline settings")
+    ap.add_argument(
+        "--queries",
+        type=Path,
+        required=True,
+        help="JSONL file – each line must contain at least {'question': ...}",
+    )
+    ap.add_argument(
+        "--output",
+        type=Path,
+        default=Path("outputs/results.jsonl"),
+        help="Where to write JSONL results",
+    )
+    ap.add_argument("--dry-run", action="store_true", help="Do not execute pipeline")
+    ap.add_argument(
+        "--baseline",
+        type=Path,
+        help="Optional: JSONL with baseline run for significance tests",
+    )
+    ap.add_argument(
+        "--plots",
+        action="store_true",
+        help="Save diagnostic plots (PNG) alongside results",
+    )
+    args = ap.parse_args(argv)
+    # 1. Parse configuration
+    cfg = _load_pipeline_config(args.config)
+    # 2. Initialise logging (file + stderr)
+    init_logging(
+        log_dir=cfg.logging.log_dir,
+        level=cfg.logging.level,
+        max_mb=cfg.logging.max_mb,
+        backups=cfg.logging.backups,
+    )
+    import logging
+    logger = logging.getLogger(__name__)
+    logger.info("Loaded PipelineConfig:\n%s", cfg)
+    # 3. Build pipeline (retrieval → (rerank) → generation)
+    pipeline = RAGPipeline(cfg)
+    # 4. Load queries
+    rows = _read_jsonl(args.queries)
+    logger.info("Loaded %d queries from %s", len(rows), args.queries)
+    if args.dry_run:
+        logger.warning("Dry-run flag active – exiting before execution.")
+        sys.exit(0)
+    # 5. Execute pipeline
+    results: List[Dict[str, Any]] = []
+    for i, row in enumerate(rows, 1):
+        q = row["question"]
+        logger.info("[%d/%d] Q: %s", i, len(rows), q)
+        out = pipeline.run(q)
+        merged = {**row, **out}  # keep any gold labels or metadata
+        results.append(merged)
+    # 6. Persist results
+    _write_jsonl(args.output, results)
+    logger.info("Wrote %d results to %s", len(results), args.output)
+    # 7. Aggregate statistics, significance tests, plots
+    agg = aggregate_metrics(results)
+    logger.info("Mean metrics: %s", json.dumps(agg, indent=2))
+    corr = correlation_with_gold(results, cfg.stats)
+    if corr:
+        logger.info(
+            "Correlation MRR↔gold  %s=%.3f  95%%CI=[%.3f, %.3f]  p=%.3g",
+            cfg.stats.correlation_method,
+            corr["r"],
+            corr["ci_low"],
+            corr["ci_high"],
+            corr["p"],
+        )
+    if args.baseline:
+        baseline_rows = _read_jsonl(args.baseline)
+        p_adj = wilcoxon_against_baseline(results, baseline_rows, cfg.stats)
+        logger.info("Wilcoxon vs baseline (Holm-Bonferroni α=%s): %s", cfg.stats.alpha, p_adj)
+    if args.plots:
+        plot_path = save_scatter(results, args.output.parent)
+        logger.info("Saved plot → %s", plot_path)
+if __name__ == "__main__":
+    main()

scripts/run_grid_experiments.py ADDED Viewed

	@@ -0,0 +1,239 @@

+#!/usr/bin/env python
+"""
+run_grid_experiments.py
+=======================
+Batch driver for *config × dataset* evaluation, including:
+* RQ1  – Correlation of classical retrieval metrics with factual-correctness
+* RQ2  – Correlation of faithfulness metrics with expert judgements
+* RQ3  – Retrieval-error ➜ hallucination propagation (χ² + conditional rates)
+* RQ4  – Robustness under adversarial perturbations (Δ-metrics, Cohen d)
+Features
+--------
+* Incremental mode – pass **one** new --config, it is compared to all
+  previous runs already found under --outdir/<dataset>/.
+* Saves:
+  - `results.jsonl`
+  - `aggregates.yaml`
+  - `rq1.yaml`, `rq2.yaml`, `rq3.yaml`, `rq4.yaml`
+  - pairwise Wilcoxon/ Holm tables
+  - bar-, box-, scatter-plots (if --plots flag)
+"""
+from __future__ import annotations
+import argparse
+import itertools
+import json
+import logging
+import os
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Mapping
+import matplotlib.pyplot as plt
+import numpy as np
+import yaml
+from evaluation import (
+    PipelineConfig,
+    RetrieverConfig,
+    GeneratorConfig,
+    CrossEncoderConfig,
+    StatsConfig,
+    LoggingConfig,
+    RAGPipeline,
+)
+from evaluation.stats import (
+    corr_ci,
+    wilcoxon_signed_rank,
+    holm_bonferroni,
+    conditional_failure_rate,
+    chi2_error_propagation,
+    delta_metric,
+)
+from evaluation.utils.logger import init_logging
+# ─────────────────────────────── I/O helpers ────────────────────────────────
+def read_jsonl(path: Path) -> List[Dict[str, Any]]:
+    with path.open() as f:
+        return [json.loads(line) for line in f]
+def write_jsonl(path: Path, rows: Iterable[Mapping[str, Any]]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w") as f:
+        for row in rows:
+            f.write(json.dumps(row) + "\n")
+def save_yaml(path: Path, obj: Mapping[str, Any]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(yaml.safe_dump(obj, sort_keys=False))
+# ─────────────────────── config merge (same as earlier) ─────────────────────
+def merge_dataclass(dc_cls, override: Mapping[str, Any]):
+    from dataclasses import asdict
+    base = asdict(dc_cls())
+    base.update({k: v for k, v in override.items() if v is not None})
+    return dc_cls(**base)
+def load_pipeline_config(yaml_path: Path) -> PipelineConfig:
+    data = yaml.safe_load(yaml_path.read_text())
+    return PipelineConfig(
+        retriever=merge_dataclass(RetrieverConfig, data.get("retriever", {})),
+        generator=merge_dataclass(GeneratorConfig, data.get("generator", {})),
+        reranker=merge_dataclass(CrossEncoderConfig, data.get("reranker", {})),
+        stats=merge_dataclass(StatsConfig, data.get("stats", {})),
+        logging=merge_dataclass(LoggingConfig, data.get("logging", {})),
+    )
+# ───────────────────────────── stats helpers ────────────────────────────────
+def agg_mean(rows: List[dict[str, Any]]) -> dict[str, float]:
+    keys = rows[0]["metrics"].keys()
+    return {k: float(np.mean([r["metrics"][k] for r in rows])) for k in keys}
+def rq1_correlation(rows, cfg: StatsConfig):
+    if "human_correct" not in rows[0]:
+        return {}
+    retrieval_keys = [k for k in rows[0]["metrics"] if k in {"mrr", "map", "precision@10"}]
+    gold = [1.0 if r["human_correct"] else 0.0 for r in rows]
+    out = {}
+    for k in retrieval_keys:
+        vec = [r["metrics"][k] for r in rows]
+        r, (lo, hi), p = corr_ci(vec, gold, method=cfg.correlation_method,
+                                 n_boot=cfg.n_boot, ci=cfg.ci)
+        out[k] = dict(r=r, ci=[lo, hi], p=p)
+    return out
+def rq2_faithfulness(rows, cfg: StatsConfig):
+    if "human_faithful" not in rows[0]:
+        return {}
+    faith_keys = [k for k in rows[0]["metrics"] if k.lower().startswith(("faith", "qags", "fact", "ragas"))]
+    gold = [r["human_faithful"] for r in rows]
+    out = {}
+    for k in faith_keys:
+        vec = [r["metrics"][k] for r in rows]
+        r, (lo, hi), p = corr_ci(vec, gold, method=cfg.correlation_method,
+                                 n_boot=cfg.n_boot, ci=cfg.ci)
+        out[k] = dict(r=r, ci=[lo, hi], p=p)
+    return out
+def rq3_error_propagation(rows):
+    if "retrieval_error" not in rows[0] or "hallucination" not in rows[0]:
+        return {}
+    ret_err = [r["retrieval_error"] for r in rows]
+    halluc = [r["hallucination"] for r in rows]
+    cond = conditional_failure_rate(ret_err, halluc)
+    chi2 = chi2_error_propagation(ret_err, halluc)
+    return {"conditional": cond, "chi2": chi2}
+def rq4_robustness(orig_rows, pert_rows):
+    if pert_rows is None:
+        return {}
+    metrics = orig_rows[0]["metrics"].keys()
+    out = {}
+    for m in metrics:
+        d, eff = delta_metric(
+            [r["metrics"][m] for r in orig_rows],
+            [r["metrics"][m] for r in pert_rows],
+        )
+        out[m] = dict(delta=d, cohen_d=eff)
+    return out
+# ─────────────────────────── plotting helpers ───────────────────────────────
+def scatter_mrr_vs_correct(rows, path: Path):
+    x = [r["metrics"].get("mrr", np.nan) for r in rows]
+    y = [1 if r.get("human_correct") else 0 for r in rows]
+    plt.figure()
+    plt.scatter(x, y, alpha=0.5)
+    plt.xlabel("MRR"); plt.ylabel("Correct (1)")
+    plt.title("MRR vs. Human Correctness")
+    plt.tight_layout(); plt.savefig(path); plt.close()
+# ────────────────────────────────── main ────────────────────────────────────
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--configs", nargs="+", type=Path, required=True,
+                    help="One or more YAML configs; if one, compared against prior runs.")
+    ap.add_argument("--datasets", nargs="+", type=Path, required=True)
+    ap.add_argument("--outdir", type=Path, default=Path("outputs/grid"))
+    ap.add_argument("--plots", action="store_true")
+    ap.add_argument("--perturbed-suffix", default="_pert",
+                    help="If dataset perturbed version exists (name+suffix.jsonl) it's used for RQ4.")
+    args = ap.parse_args(argv)
+    init_logging(log_dir=args.outdir / "logs", level="INFO")
+    log = logging.getLogger("grid")
+    for dataset in args.datasets:
+        log.info("Dataset: %s", dataset.name)
+        queries = read_jsonl(dataset)
+        pert_path = dataset.with_stem(dataset.stem + args.perturbed_suffix)
+        pert_rows = read_jsonl(pert_path) if pert_path.exists() else None
+        # discover historical configs to compare against if incremental mode
+        hist_dirs = (args.outdir / dataset.stem).glob("*") if len(args.configs) == 1 else []
+        historical = {d.name: read_jsonl(d / "results.jsonl") for d in hist_dirs if d.is_dir()}
+        for cfg_yaml in args.configs:
+            cfg_name = cfg_yaml.stem
+            log.info("  Config: %s", cfg_name)
+            cfg = load_pipeline_config(cfg_yaml)
+            pipe = RAGPipeline(cfg)
+            # skip if results already exist
+            run_dir = args.outdir / dataset.stem / cfg_name
+            if (run_dir / "results.jsonl").exists():
+                log.info("    results already present – loading.")
+                rows = read_jsonl(run_dir / "results.jsonl")
+            else:
+                rows = [pipe.run(q["question"]) | q for q in queries]
+                write_jsonl(run_dir / "results.jsonl", rows)
+            # aggregates & RQ1–4
+            save_yaml(run_dir / "aggregates.yaml", agg_mean(rows))
+            save_yaml(run_dir / "rq1.yaml", rq1_correlation(rows, cfg.stats))
+            save_yaml(run_dir / "rq2.yaml", rq2_faithfulness(rows, cfg.stats))
+            save_yaml(run_dir / "rq3.yaml", rq3_error_propagation(rows))
+            if pert_rows:
+                save_yaml(run_dir / "rq4.yaml", rq4_robustness(rows, pert_rows))
+            if args.plots:
+                scatter_mrr_vs_correct(rows, run_dir / "mrr_vs_correct.png")
+            historical[cfg_name] = rows  # include current for pairwise tests
+        # pairwise Wilcoxon on rag_score
+        if len(historical) > 1:
+            pairs = {}
+            names = list(historical)
+            for a, b in itertools.combinations(names, 2):
+                x = [r["metrics"]["rag_score"] for r in historical[a]]
+                y = [r["metrics"]["rag_score"] for r in historical[b]]
+                _, p = wilcoxon_signed_rank(x, y)
+                pairs[f"{a}~{b}"] = p
+            save_yaml(args.outdir / dataset.stem / "wilcoxon_rag_raw.yaml", pairs)
+            save_yaml(args.outdir / dataset.stem / "wilcoxon_rag_holm.yaml",
+                      holm_bonferroni(pairs))
+            log.info("  Pairwise rag_score significance stored (Holm adjusted).")
+if __name__ == "__main__":
+    main()