{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "intro-header",
   "metadata": {},
   "source": [
    "# Statistical Methods Tutorial: Understanding WASP2's Beta-Binomial Framework\n",
    "\n",
    "**Estimated time:** ~45 minutes\n",
    "\n",
    "This interactive tutorial provides a deep dive into the statistical methods used by WASP2 for detecting allelic imbalance. You will learn:\n",
    "\n",
    "1. **Why overdispersion matters** - Why the simple binomial model fails for sequencing data\n",
    "2. **Beta-binomial distributions** - The statistical foundation of WASP2\n",
    "3. **Dispersion estimation** - MLE vs Method of Moments approaches\n",
    "4. **QQ plots** - Visualizing model fit and calibration\n",
    "5. **FDR correction** - Benjamini-Hochberg and alternatives\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "This tutorial assumes familiarity with:\n",
    "- Basic probability distributions (binomial)\n",
    "- Hypothesis testing concepts (p-values)\n",
    "- Python data analysis (numpy, pandas, matplotlib)\n",
    "\n",
    "No prior knowledge of beta-binomial distributions or overdispersion is required.\n",
    "\n",
    "## Relationship to WASP2 Source Code\n",
    "\n",
    "The functions in this tutorial mirror the implementations in:\n",
    "- `src/analysis/as_analysis.py` - Core statistical functions (`clamp_rho`, `opt_prob`, `opt_linear`)\n",
    "- `src/analysis/compare_ai.py` - Comparative analysis between groups\n",
    "- `src/analysis/as_analysis_sc.py` - Single-cell extensions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "setup-imports",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy import stats\n",
    "from scipy.stats import binom, betabinom, chi2, false_discovery_control\n",
    "from scipy.special import expit, logit\n",
    "from scipy.optimize import minimize_scalar, minimize\n",
    "from typing import Tuple, Union\n",
    "from numpy.typing import NDArray\n",
    "import warnings\n",
    "\n",
    "# Configure plotting\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "plt.rcParams['figure.figsize'] = (10, 6)\n",
    "plt.rcParams['font.size'] = 11\n",
    "np.random.seed(42)\n",
    "\n",
    "print(\"Statistical Methods Tutorial\")\n",
    "print(\"=\" * 40)\n",
    "print(f\"NumPy version: {np.__version__}\")\n",
    "print(f\"SciPy version: {stats.scipy.__version__ if hasattr(stats, 'scipy') else 'N/A'}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "constants-header",
   "metadata": {},
   "source": [
    "## Critical Constants and Helper Functions\n",
    "\n",
    "WASP2 defines critical numerical constants to prevent division by zero and numerical overflow. These are defined in `src/analysis/as_analysis.py` (Issue #228).\n",
    "\n",
    "**Why this matters:** The beta-binomial parameterization uses $\\alpha = \\mu \\cdot \\frac{1-\\rho}{\\rho}$, which:\n",
    "- Causes **division by zero** when $\\rho = 0$\n",
    "- Produces **zero alpha/beta** when $\\rho = 1$\n",
    "\n",
    "WASP2 clamps $\\rho$ to $(\\epsilon, 1-\\epsilon)$ where $\\epsilon = 10^{-10}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "constants-definition",
   "metadata": {},
   "outputs": [],
   "source": [
    "# =============================================================================\n",
    "# BETA-BINOMIAL RHO PARAMETER BOUNDS (Issue #228)\n",
    "# =============================================================================\n",
    "# Matches WASP2's src/analysis/as_analysis.py:33\n",
    "# =============================================================================\n",
    "\n",
    "RHO_EPSILON: float = 1e-10\n",
    "\n",
    "\n",
    "def clamp_rho(rho: Union[float, NDArray[np.float64]]) -> Union[float, NDArray[np.float64]]:\n",
    "    \"\"\"\n",
    "    Clamp dispersion parameter rho to safe range (epsilon, 1-epsilon).\n",
    "    \n",
    "    This function mirrors WASP2's src/analysis/as_analysis.py:36-50.\n",
    "    \n",
    "    The beta-binomial parameterization uses alpha = mu * (1-rho) / rho, which\n",
    "    causes division by zero when rho=0 and produces zero alpha/beta when rho=1.\n",
    "    This function prevents these boundary issues.\n",
    "    \n",
    "    Args:\n",
    "        rho: Dispersion parameter (scalar or array), expected in [0, 1]\n",
    "    \n",
    "    Returns:\n",
    "        Clamped rho in range (RHO_EPSILON, 1 - RHO_EPSILON)\n",
    "    \n",
    "    Example:\n",
    "        >>> clamp_rho(0.0)  # Would cause division by zero\n",
    "        1e-10\n",
    "        >>> clamp_rho(1.0)  # Would produce zero alpha/beta\n",
    "        0.9999999999\n",
    "    \"\"\"\n",
    "    return np.clip(rho, RHO_EPSILON, 1.0 - RHO_EPSILON)\n",
    "\n",
    "\n",
    "# Demonstrate the importance of clamping\n",
    "print(\"Demonstrating rho clamping (Issue #228):\")\n",
    "print(f\"  RHO_EPSILON = {RHO_EPSILON}\")\n",
    "print(f\"  clamp_rho(0.0) = {clamp_rho(0.0)}\")\n",
    "print(f\"  clamp_rho(1.0) = {clamp_rho(1.0)}\")\n",
    "print(f\"  clamp_rho(0.05) = {clamp_rho(0.05)}  # No change needed\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section1-header",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 1: Understanding Overdispersion\n",
    "\n",
    "### 1.1 The Naive Binomial Model\n",
    "\n",
    "When counting alleles at a heterozygous SNP, the simplest model assumes each read is an independent coin flip with probability 0.5 of coming from each allele:\n",
    "\n",
    "$$X \\sim \\text{Binomial}(N, p=0.5)$$\n",
    "\n",
    "where $X$ is the reference allele count and $N$ is the total read count.\n",
    "\n",
    "**Expected variance:** $\\text{Var}(X) = N \\cdot p \\cdot (1-p) = N/4$\n",
    "\n",
    "Let's simulate what this looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "binomial-simulation",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simulate ideal binomial data (balanced, no overdispersion)\n",
    "n_snps = 1000\n",
    "read_depth = 50  # Total reads per SNP\n",
    "\n",
    "# Input validation\n",
    "assert n_snps > 0, \"n_snps must be positive\"\n",
    "assert read_depth > 0, \"read_depth must be positive\"\n",
    "\n",
    "# Perfect binomial sampling\n",
    "ideal_ref_counts = np.random.binomial(n=read_depth, p=0.5, size=n_snps)\n",
    "ideal_ratios = ideal_ref_counts / read_depth\n",
    "\n",
    "# Calculate observed vs expected variance\n",
    "expected_var = read_depth * 0.5 * 0.5\n",
    "observed_var = np.var(ideal_ref_counts)\n",
    "\n",
    "print(f\"Binomial Model (N={read_depth}, p=0.5)\")\n",
    "print(f\"  Expected variance: {expected_var:.2f}\")\n",
    "print(f\"  Observed variance: {observed_var:.2f}\")\n",
    "print(f\"  Ratio (observed/expected): {observed_var/expected_var:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "overdispersion-intro",
   "metadata": {},
   "source": [
    "### 1.2 The Problem: Real Data Shows Overdispersion\n",
    "\n",
    "In real sequencing data, the observed variance is typically **larger** than expected from a binomial model. This is called **overdispersion**.\n",
    "\n",
    "**Sources of overdispersion in allele counting:**\n",
    "1. **PCR amplification bias** - Some fragments amplify better than others\n",
    "2. **Library preparation effects** - Batch effects during sample prep\n",
    "3. **Technical variability** - Across lanes, flowcells, and sequencing runs\n",
    "4. **Mapping bias** - Reference allele may map slightly better (even after WASP correction)\n",
    "\n",
    "Let's simulate data with overdispersion:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "overdispersed-simulation",
   "metadata": {},
   "outputs": [],
   "source": [
    "def simulate_overdispersed(\n",
    "    n_snps: int, \n",
    "    read_depth: int, \n",
    "    rho: float = 0.05\n",
    ") -> NDArray[np.int64]:\n",
    "    \"\"\"\n",
    "    Simulate overdispersed allele counts using beta-binomial.\n",
    "    \n",
    "    Args:\n",
    "        n_snps: Number of SNPs to simulate\n",
    "        read_depth: Total read depth per SNP\n",
    "        rho: Dispersion parameter (0 = binomial, higher = more overdispersed)\n",
    "    \n",
    "    Returns:\n",
    "        Array of reference allele counts\n",
    "    \n",
    "    Raises:\n",
    "        ValueError: If parameters are out of valid range\n",
    "    \"\"\"\n",
    "    # Input validation\n",
    "    if n_snps <= 0:\n",
    "        raise ValueError(f\"n_snps must be positive, got {n_snps}\")\n",
    "    if read_depth <= 0:\n",
    "        raise ValueError(f\"read_depth must be positive, got {read_depth}\")\n",
    "    if not 0 < rho < 1:\n",
    "        raise ValueError(f\"rho must be in (0, 1), got {rho}\")\n",
    "    \n",
    "    mu = 0.5  # Balanced (no true imbalance)\n",
    "    rho = clamp_rho(rho)  # Apply WASP2's clamping\n",
    "    \n",
    "    alpha = mu * (1 - rho) / rho\n",
    "    beta = (1 - mu) * (1 - rho) / rho\n",
    "    \n",
    "    return betabinom.rvs(n=read_depth, a=alpha, b=beta, size=n_snps)\n",
    "\n",
    "\n",
    "# Simulate with different dispersion levels\n",
    "rho_values = [0.001, 0.02, 0.05, 0.10]\n",
    "overdispersed_data = {rho: simulate_overdispersed(n_snps, read_depth, rho) \n",
    "                      for rho in rho_values}\n",
    "\n",
    "# Compare variances\n",
    "print(f\"Expected binomial variance: {expected_var:.2f}\")\n",
    "print(\"\\nObserved variances by dispersion (rho):\")\n",
    "for rho, data in overdispersed_data.items():\n",
    "    obs_var = np.var(data)\n",
    "    print(f\"  rho={rho:.3f}: variance={obs_var:.2f} (ratio={obs_var/expected_var:.2f}x)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "overdispersion-visual",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize the effect of overdispersion\n",
    "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n",
    "\n",
    "for idx, (rho, data) in enumerate(overdispersed_data.items()):\n",
    "    ax = axes.flat[idx]\n",
    "    ratios = data / read_depth\n",
    "    \n",
    "    # Histogram of observed ratios\n",
    "    ax.hist(ratios, bins=30, density=True, alpha=0.7, color='steelblue', \n",
    "            edgecolor='black', label='Observed')\n",
    "    \n",
    "    # Overlay expected binomial distribution\n",
    "    x = np.arange(0, read_depth + 1)\n",
    "    binomial_pmf = binom.pmf(x, read_depth, 0.5)\n",
    "    ax.plot(x/read_depth, binomial_pmf * read_depth, 'r-', lw=2, \n",
    "            label='Expected (Binomial)')\n",
    "    \n",
    "    obs_var = np.var(data)\n",
    "    ax.set_title(f'rho = {rho:.3f}\\nVariance ratio: {obs_var/expected_var:.2f}x', fontsize=11)\n",
    "    ax.set_xlabel('Reference Allele Frequency')\n",
    "    ax.set_ylabel('Density')\n",
    "    ax.legend(fontsize=9)\n",
    "    ax.set_xlim(0, 1)\n",
    "\n",
    "plt.suptitle('Effect of Overdispersion on Allele Ratio Distributions', fontsize=14, y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"\\nKey observation: As rho increases, the distribution becomes wider\")\n",
    "print(\"than expected from binomial sampling alone.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "consequences-header",
   "metadata": {},
   "source": [
    "### 1.3 Consequences of Ignoring Overdispersion\n",
    "\n",
    "If we use a binomial model on overdispersed data, we will:\n",
    "\n",
    "1. **Underestimate variance** → p-values too small\n",
    "2. **Inflate false positive rate** → Many spurious \"significant\" results\n",
    "3. **Poor calibration** → QQ plots show massive inflation\n",
    "\n",
    "Let's demonstrate this problem:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "false-positive-demo",
   "metadata": {},
   "outputs": [],
   "source": [
    "def binomial_test_pvalue(ref_count: int, total_count: int) -> float:\n",
    "    \"\"\"\n",
    "    Two-sided binomial test for deviation from 0.5.\n",
    "    \n",
    "    Args:\n",
    "        ref_count: Number of reference allele reads\n",
    "        total_count: Total number of reads\n",
    "    \n",
    "    Returns:\n",
    "        Two-sided p-value\n",
    "    \"\"\"\n",
    "    # Input validation\n",
    "    if ref_count < 0 or ref_count > total_count:\n",
    "        raise ValueError(f\"Invalid counts: ref={ref_count}, total={total_count}\")\n",
    "    if total_count <= 0:\n",
    "        raise ValueError(f\"total_count must be positive, got {total_count}\")\n",
    "    \n",
    "    result = stats.binomtest(ref_count, total_count, p=0.5, alternative='two-sided')\n",
    "    return result.pvalue\n",
    "\n",
    "\n",
    "# Test on overdispersed data (rho=0.05) - no TRUE imbalance!\n",
    "test_data = overdispersed_data[0.05]\n",
    "pvalues_binomial = [binomial_test_pvalue(int(k), read_depth) for k in test_data]\n",
    "\n",
    "# Count \"significant\" results at different thresholds\n",
    "pvalues_binomial = np.array(pvalues_binomial)\n",
    "\n",
    "print(\"False positive rates using binomial test on overdispersed data:\")\n",
    "print(\"(Remember: there is NO true imbalance in this data!)\\n\")\n",
    "for alpha in [0.05, 0.01, 0.001]:\n",
    "    fp_rate = (pvalues_binomial < alpha).mean()\n",
    "    print(f\"  Alpha = {alpha:.3f}: {fp_rate*100:.1f}% significant (expected: {alpha*100:.1f}%)\")\n",
    "\n",
    "print(f\"\\nThis represents a {(pvalues_binomial < 0.05).mean() / 0.05:.1f}x inflation!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section2-header",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 2: The Beta-Binomial Distribution\n",
    "\n",
    "### 2.1 Mathematical Foundation\n",
    "\n",
    "The **beta-binomial distribution** extends the binomial by allowing the success probability $p$ to vary according to a Beta distribution:\n",
    "\n",
    "$$p \\sim \\text{Beta}(\\alpha, \\beta)$$\n",
    "$$X | p \\sim \\text{Binomial}(N, p)$$\n",
    "\n",
    "The marginal distribution of $X$ is the beta-binomial.\n",
    "\n",
    "**WASP2's parameterization** uses mean ($\\mu$) and dispersion ($\\rho$):\n",
    "\n",
    "$$\\alpha = \\mu \\cdot \\frac{1 - \\rho}{\\rho}, \\quad \\beta = (1 - \\mu) \\cdot \\frac{1 - \\rho}{\\rho}$$\n",
    "\n",
    "**Key properties:**\n",
    "- Mean: $E[X] = N \\cdot \\mu$ (same as binomial)\n",
    "- Variance: $\\text{Var}(X) = N \\cdot \\mu \\cdot (1-\\mu) \\cdot [1 + (N-1) \\cdot \\rho]$ (inflated!)\n",
    "\n",
    "When $\\rho \\to 0$, the beta-binomial converges to the binomial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "betabinom-params",
   "metadata": {},
   "outputs": [],
   "source": [
    "def mu_rho_to_alpha_beta(\n",
    "    mu: float, \n",
    "    rho: float\n",
    ") -> Tuple[float, float]:\n",
    "    \"\"\"\n",
    "    Convert WASP2's (mu, rho) parameterization to (alpha, beta).\n",
    "    \n",
    "    This mirrors WASP2's parameterization in src/analysis/as_analysis.py:104-105.\n",
    "    \n",
    "    Args:\n",
    "        mu: Mean parameter (allele frequency), in (0, 1)\n",
    "        rho: Dispersion parameter, in (0, 1)\n",
    "    \n",
    "    Returns:\n",
    "        Tuple of (alpha, beta) parameters for scipy.stats.betabinom\n",
    "    \n",
    "    Warning:\n",
    "        When rho is near 0 or 1, numerical instability can occur.\n",
    "        Use clamp_rho() to ensure safe values.\n",
    "    \"\"\"\n",
    "    # Apply WASP2's clamping to prevent numerical issues\n",
    "    rho = clamp_rho(rho)\n",
    "    \n",
    "    # Validate mu\n",
    "    if not 0 < mu < 1:\n",
    "        raise ValueError(f\"mu must be in (0, 1), got {mu}\")\n",
    "    \n",
    "    alpha = mu * (1 - rho) / rho\n",
    "    beta = (1 - mu) * (1 - rho) / rho\n",
    "    return alpha, beta\n",
    "\n",
    "\n",
    "def betabinom_variance(n: int, mu: float, rho: float) -> float:\n",
    "    \"\"\"\n",
    "    Compute beta-binomial variance.\n",
    "    \n",
    "    Args:\n",
    "        n: Number of trials (total read count)\n",
    "        mu: Mean parameter (allele frequency)\n",
    "        rho: Dispersion parameter\n",
    "    \n",
    "    Returns:\n",
    "        Variance of the beta-binomial distribution\n",
    "    \"\"\"\n",
    "    return n * mu * (1 - mu) * (1 + (n - 1) * rho)\n",
    "\n",
    "\n",
    "# Demonstrate variance inflation\n",
    "N = 50\n",
    "mu = 0.5\n",
    "binomial_var = N * mu * (1 - mu)\n",
    "\n",
    "print(f\"Variance comparison (N={N}, mu={mu}):\\n\")\n",
    "print(f\"Binomial variance: {binomial_var:.2f}\")\n",
    "print(\"\\nBeta-binomial variance by rho:\")\n",
    "for rho in [0.001, 0.01, 0.05, 0.10, 0.20]:\n",
    "    bb_var = betabinom_variance(N, mu, rho)\n",
    "    inflation = bb_var / binomial_var\n",
    "    print(f\"  rho={rho:.3f}: {bb_var:.2f} ({inflation:.2f}x inflation)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edge-cases-header",
   "metadata": {},
   "source": [
    "### 2.1.1 Edge Cases and Numerical Stability\n",
    "\n",
    "**Warning:** The beta-binomial parameterization has dangerous edge cases that can cause numerical errors. This section demonstrates why WASP2's `clamp_rho()` function is essential."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "edge-cases-demo",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Demonstrate edge cases that clamp_rho() prevents\n",
    "print(\"Edge Cases in Beta-Binomial Parameterization\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "def unsafe_mu_rho_to_alpha_beta(mu, rho):\n",
    "    \"\"\"Unsafe version WITHOUT clamping - for demonstration only.\"\"\"\n",
    "    alpha = mu * (1 - rho) / rho\n",
    "    beta = (1 - mu) * (1 - rho) / rho\n",
    "    return alpha, beta\n",
    "\n",
    "# Case 1: rho = 0 causes division by zero\n",
    "print(\"\\n1. rho = 0 (division by zero):\")\n",
    "try:\n",
    "    with warnings.catch_warnings():\n",
    "        warnings.simplefilter(\"ignore\")\n",
    "        alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 0.0)\n",
    "    print(f\"   alpha = {alpha}, beta = {beta}\")\n",
    "    print(\"   Result: inf values - causes downstream errors!\")\n",
    "except Exception as e:\n",
    "    print(f\"   Error: {e}\")\n",
    "\n",
    "# Case 2: rho = 1 produces zero alpha/beta\n",
    "print(\"\\n2. rho = 1 (zero parameters):\")\n",
    "alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1.0)\n",
    "print(f\"   alpha = {alpha}, beta = {beta}\")\n",
    "print(\"   Result: zero values - invalid for Beta distribution!\")\n",
    "\n",
    "# Case 3: Very small rho produces huge alpha/beta\n",
    "print(\"\\n3. rho = 1e-15 (numerical overflow risk):\")\n",
    "alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1e-15)\n",
    "print(f\"   alpha = {alpha:.2e}, beta = {beta:.2e}\")\n",
    "print(\"   Result: huge values - may overflow in gamma functions!\")\n",
    "\n",
    "# Safe version with clamping\n",
    "print(\"\\n\" + \"=\" * 50)\n",
    "print(\"With WASP2's clamp_rho():\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "for rho_input in [0.0, 1.0, 1e-15]:\n",
    "    rho_safe = clamp_rho(rho_input)\n",
    "    alpha, beta = mu_rho_to_alpha_beta(0.5, rho_safe)\n",
    "    print(f\"\\nInput rho={rho_input} -> clamped to {rho_safe}\")\n",
    "    print(f\"   alpha = {alpha:.4f}, beta = {beta:.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "betabinom-visual",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize beta-binomial vs binomial PMFs\n",
    "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n",
    "\n",
    "N = 50\n",
    "x = np.arange(0, N + 1)\n",
    "\n",
    "# Panel 1: Different rho values (mu=0.5)\n",
    "ax = axes[0]\n",
    "ax.plot(x, binom.pmf(x, N, 0.5), 'k-', lw=2, label='Binomial')\n",
    "for rho, color in [(0.02, 'blue'), (0.05, 'orange'), (0.10, 'red')]:\n",
    "    alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n",
    "    ax.plot(x, betabinom.pmf(x, N, alpha, beta), '--', lw=2, \n",
    "            color=color, label=f'rho={rho}')\n",
    "ax.set_xlabel('Reference Count')\n",
    "ax.set_ylabel('Probability')\n",
    "ax.set_title('Effect of Dispersion (mu=0.5)')\n",
    "ax.legend()\n",
    "\n",
    "# Panel 2: Different mu values (rho=0.05)\n",
    "ax = axes[1]\n",
    "rho = 0.05\n",
    "for mu, color in [(0.5, 'gray'), (0.6, 'blue'), (0.7, 'orange'), (0.8, 'red')]:\n",
    "    alpha, beta = mu_rho_to_alpha_beta(mu, rho)\n",
    "    ax.plot(x, betabinom.pmf(x, N, alpha, beta), '-', lw=2, \n",
    "            color=color, label=f'mu={mu}')\n",
    "ax.set_xlabel('Reference Count')\n",
    "ax.set_ylabel('Probability')\n",
    "ax.set_title(f'Effect of Imbalance (rho={rho})')\n",
    "ax.legend()\n",
    "\n",
    "# Panel 3: Log-scale tails comparison\n",
    "ax = axes[2]\n",
    "ax.semilogy(x, binom.pmf(x, N, 0.5), 'k-', lw=2, label='Binomial')\n",
    "for rho, color in [(0.02, 'blue'), (0.10, 'red')]:\n",
    "    alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n",
    "    ax.semilogy(x, betabinom.pmf(x, N, alpha, beta), '--', lw=2, \n",
    "                color=color, label=f'rho={rho}')\n",
    "ax.set_xlabel('Reference Count')\n",
    "ax.set_ylabel('Probability (log scale)')\n",
    "ax.set_title('Tail Behavior (log scale)')\n",
    "ax.legend()\n",
    "ax.set_ylim(1e-15, 1)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"Key insight: Beta-binomial has heavier tails than binomial,\")\n",
    "print(\"making extreme counts more likely under the null hypothesis.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lrt-header",
   "metadata": {},
   "source": [
    "### 2.2 Likelihood Ratio Test for Imbalance\n",
    "\n",
    "WASP2 uses a **likelihood ratio test (LRT)** to detect allelic imbalance:\n",
    "\n",
    "- **Null hypothesis ($H_0$):** $\\mu = 0.5$ (balanced)\n",
    "- **Alternative ($H_1$):** $\\mu \\neq 0.5$ (imbalanced)\n",
    "\n",
    "The test statistic is:\n",
    "\n",
    "$$\\Lambda = -2 \\left[ \\log L(H_0) - \\log L(H_1) \\right] \\sim \\chi^2_1$$\n",
    "\n",
    "This follows a chi-squared distribution with 1 degree of freedom under $H_0$.\n",
    "\n",
    "**WASP2 Implementation:** See `src/analysis/as_analysis.py:322-323`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "lrt-implementation",
   "metadata": {},
   "outputs": [],
   "source": [
    "def betabinom_loglik(\n",
    "    ref_count: int, \n",
    "    total_count: int, \n",
    "    mu: float, \n",
    "    rho: float\n",
    ") -> float:\n",
    "    \"\"\"\n",
    "    Compute beta-binomial log-likelihood.\n",
    "    \n",
    "    Mirrors WASP2's src/analysis/as_analysis.py:81-112 (opt_prob function).\n",
    "    \n",
    "    Args:\n",
    "        ref_count: Reference allele count\n",
    "        total_count: Total read count\n",
    "        mu: Mean parameter (allele frequency)\n",
    "        rho: Dispersion parameter\n",
    "    \n",
    "    Returns:\n",
    "        Log-likelihood value\n",
    "    \"\"\"\n",
    "    alpha, beta = mu_rho_to_alpha_beta(mu, rho)\n",
    "    return betabinom.logpmf(ref_count, total_count, alpha, beta)\n",
    "\n",
    "\n",
    "def find_mle_mu(\n",
    "    ref_count: int, \n",
    "    total_count: int, \n",
    "    rho: float\n",
    ") -> float:\n",
    "    \"\"\"\n",
    "    Find MLE of mu given fixed rho.\n",
    "    \n",
    "    Mirrors WASP2's src/analysis/as_analysis.py:184-248 (parse_opt function).\n",
    "    \n",
    "    Args:\n",
    "        ref_count: Reference allele count\n",
    "        total_count: Total read count\n",
    "        rho: Fixed dispersion parameter\n",
    "    \n",
    "    Returns:\n",
    "        Maximum likelihood estimate of mu\n",
    "    \"\"\"\n",
    "    def neg_loglik(mu):\n",
    "        return -betabinom_loglik(ref_count, total_count, mu, rho)\n",
    "    \n",
    "    # Use bounded optimization with safe mu range\n",
    "    result = minimize_scalar(\n",
    "        neg_loglik, \n",
    "        bounds=(0.001, 0.999), \n",
    "        method='bounded'\n",
    "    )\n",
    "    return result.x\n",
    "\n",
    "\n",
    "def likelihood_ratio_test(\n",
    "    ref_count: int, \n",
    "    total_count: int, \n",
    "    rho: float\n",
    ") -> Tuple[float, float, float]:\n",
    "    \"\"\"\n",
    "    Perform LRT for allelic imbalance.\n",
    "    \n",
    "    Mirrors WASP2's calculation in src/analysis/as_analysis.py:322-323.\n",
    "    \n",
    "    Args:\n",
    "        ref_count: Reference allele count\n",
    "        total_count: Total read count\n",
    "        rho: Dispersion parameter\n",
    "    \n",
    "    Returns:\n",
    "        Tuple of (lrt_statistic, p_value, mle_mu)\n",
    "    \"\"\"\n",
    "    # Input validation\n",
    "    if ref_count < 0 or ref_count > total_count:\n",
    "        raise ValueError(f\"Invalid: ref_count={ref_count}, total_count={total_count}\")\n",
    "    if total_count <= 0:\n",
    "        raise ValueError(f\"total_count must be positive, got {total_count}\")\n",
    "    \n",
    "    # Null likelihood (mu = 0.5)\n",
    "    null_ll = betabinom_loglik(ref_count, total_count, 0.5, rho)\n",
    "    \n",
    "    # Alternative likelihood (mu = MLE)\n",
    "    mle_mu = find_mle_mu(ref_count, total_count, rho)\n",
    "    alt_ll = betabinom_loglik(ref_count, total_count, mle_mu, rho)\n",
    "    \n",
    "    # LRT statistic (matches WASP2: -2 * (null_ll - alt_ll))\n",
    "    lrt = -2 * (null_ll - alt_ll)\n",
    "    lrt = max(0, lrt)  # Ensure non-negative due to numerical precision\n",
    "    \n",
    "    # P-value from chi-squared distribution (df=1)\n",
    "    pvalue = chi2.sf(lrt, df=1)\n",
    "    \n",
    "    return lrt, pvalue, mle_mu\n",
    "\n",
    "\n",
    "# Example calculation\n",
    "ref, total = 35, 50  # Observed: 35 ref out of 50 total\n",
    "rho = 0.05\n",
    "\n",
    "lrt, pval, mu_hat = likelihood_ratio_test(ref, total, rho)\n",
    "\n",
    "print(f\"Example LRT calculation:\")\n",
    "print(f\"  Observed: {ref} ref / {total} total (ratio = {ref/total:.2f})\")\n",
    "print(f\"  Dispersion (rho): {rho}\")\n",
    "print(f\"  MLE of mu: {mu_hat:.3f}\")\n",
    "print(f\"  LRT statistic: {lrt:.3f}\")\n",
    "print(f\"  P-value: {pval:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section3-header",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 3: Dispersion Estimation\n",
    "\n",
    "A critical step is estimating the dispersion parameter $\\rho$ from the data. WASP2 offers two approaches:\n",
    "\n",
    "1. **Single dispersion model** - One $\\rho$ for the entire dataset\n",
    "2. **Linear dispersion model** - $\\rho$ varies with read depth: $\\text{logit}(\\rho) = \\beta_0 + \\beta_1 \\cdot N$\n",
    "\n",
    "### 3.1 Maximum Likelihood Estimation (MLE)\n",
    "\n",
    "MLE finds the $\\rho$ that maximizes the likelihood under $H_0$ (balanced).\n",
    "\n",
    "**WASP2 Implementation:** See `src/analysis/as_analysis.py:251-325` (single_model function)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "mle-dispersion",
   "metadata": {},
   "outputs": [],
   "source": [
    "def estimate_dispersion_mle(\n",
    "    ref_counts: NDArray[np.int64], \n",
    "    total_counts: NDArray[np.int64]\n",
    ") -> float:\n",
    "    \"\"\"\n",
    "    Estimate single dispersion parameter via MLE.\n",
    "    \n",
    "    Mirrors WASP2's single_model in src/analysis/as_analysis.py:251-325.\n",
    "    Assumes mu=0.5 (null hypothesis) for all observations.\n",
    "    \n",
    "    Args:\n",
    "        ref_counts: Array of reference allele counts\n",
    "        total_counts: Array of total read counts\n",
    "    \n",
    "    Returns:\n",
    "        MLE estimate of dispersion parameter rho\n",
    "    \n",
    "    Raises:\n",
    "        ValueError: If arrays have different lengths or invalid values\n",
    "    \"\"\"\n",
    "    # Input validation\n",
    "    ref_counts = np.asarray(ref_counts)\n",
    "    total_counts = np.asarray(total_counts)\n",
    "    \n",
    "    if len(ref_counts) != len(total_counts):\n",
    "        raise ValueError(\"ref_counts and total_counts must have same length\")\n",
    "    if len(ref_counts) == 0:\n",
    "        raise ValueError(\"Arrays must not be empty\")\n",
    "    if np.any(ref_counts < 0) or np.any(ref_counts > total_counts):\n",
    "        raise ValueError(\"Invalid counts: ref_count must be in [0, total_count]\")\n",
    "    \n",
    "    def neg_loglik(rho):\n",
    "        rho = clamp_rho(rho)  # Apply WASP2's clamping\n",
    "        alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n",
    "        ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))\n",
    "        return -ll\n",
    "    \n",
    "    result = minimize_scalar(\n",
    "        neg_loglik, \n",
    "        bounds=(RHO_EPSILON, 0.5),  # Use RHO_EPSILON as lower bound\n",
    "        method='bounded'\n",
    "    )\n",
    "    return result.x\n",
    "\n",
    "\n",
    "# Estimate dispersion from our simulated data\n",
    "true_rho = 0.05\n",
    "test_data = overdispersed_data[true_rho]\n",
    "total_counts = np.full(len(test_data), read_depth)\n",
    "\n",
    "estimated_rho = estimate_dispersion_mle(test_data, total_counts)\n",
    "\n",
    "print(f\"Dispersion estimation (MLE):\")\n",
    "print(f\"  True rho: {true_rho:.4f}\")\n",
    "print(f\"  Estimated rho: {estimated_rho:.4f}\")\n",
    "print(f\"  Relative error: {abs(estimated_rho - true_rho) / true_rho * 100:.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "mom-header",
   "metadata": {},
   "source": [
    "### 3.2 Method of Moments (MoM)\n",
    "\n",
    "An alternative is the **Method of Moments**, which matches the observed variance to the expected beta-binomial variance:\n",
    "\n",
    "$$\\hat{\\rho}_{\\text{MoM}} = \\frac{\\text{Var}(X/N) - \\mu(1-\\mu)/\\bar{N}}{\\mu(1-\\mu)(1 - 1/\\bar{N})}$$\n",
    "\n",
    "MoM is faster but may be less efficient than MLE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "mom-dispersion",
   "metadata": {},
   "outputs": [],
   "source": [
    "def estimate_dispersion_mom(\n",
    "    ref_counts: NDArray[np.int64], \n",
    "    total_counts: NDArray[np.int64]\n",
    ") -> float:\n",
    "    \"\"\"\n",
    "    Estimate dispersion using Method of Moments.\n",
    "    \n",
    "    Args:\n",
    "        ref_counts: Array of reference allele counts\n",
    "        total_counts: Array of total read counts\n",
    "    \n",
    "    Returns:\n",
    "        MoM estimate of dispersion parameter rho\n",
    "    \"\"\"\n",
    "    # Input validation\n",
    "    ref_counts = np.asarray(ref_counts)\n",
    "    total_counts = np.asarray(total_counts)\n",
    "    \n",
    "    if len(ref_counts) != len(total_counts):\n",
    "        raise ValueError(\"Arrays must have same length\")\n",
    "    \n",
    "    ratios = ref_counts / total_counts\n",
    "    mu = 0.5  # Assume balanced under null\n",
    "    \n",
    "    # Observed variance of ratios\n",
    "    obs_var = np.var(ratios)\n",
    "    \n",
    "    # Expected binomial variance (if rho=0)\n",
    "    mean_n = np.mean(total_counts)\n",
    "    binom_var = mu * (1 - mu) / mean_n\n",
    "    \n",
    "    # Solve for rho with safeguards\n",
    "    numerator = obs_var - binom_var\n",
    "    denominator = mu * (1 - mu) * (1 - 1/mean_n)\n",
    "    \n",
    "    # Handle edge case where denominator is near zero\n",
    "    if abs(denominator) < 1e-10:\n",
    "        return RHO_EPSILON\n",
    "    \n",
    "    rho_mom = numerator / denominator\n",
    "    \n",
    "    # Clamp to valid range using WASP2's constant\n",
    "    return float(clamp_rho(rho_mom))\n",
    "\n",
    "\n",
    "# Compare MLE vs MoM\n",
    "rho_mom = estimate_dispersion_mom(test_data, total_counts)\n",
    "\n",
    "print(f\"Comparison of estimation methods:\")\n",
    "print(f\"  True rho: {true_rho:.4f}\")\n",
    "print(f\"  MLE estimate: {estimated_rho:.4f}\")\n",
    "print(f\"  MoM estimate: {rho_mom:.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "mle-mom-comparison",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Systematic comparison across different true rho values\n",
    "comparison_results = []\n",
    "\n",
    "for true_rho in [0.01, 0.02, 0.05, 0.10, 0.15]:\n",
    "    # Simulate data\n",
    "    alpha, beta = mu_rho_to_alpha_beta(0.5, true_rho)\n",
    "    sim_data = betabinom.rvs(n=50, a=alpha, b=beta, size=1000)\n",
    "    sim_totals = np.full(1000, 50)\n",
    "    \n",
    "    # Estimate with both methods\n",
    "    mle_est = estimate_dispersion_mle(sim_data, sim_totals)\n",
    "    mom_est = estimate_dispersion_mom(sim_data, sim_totals)\n",
    "    \n",
    "    comparison_results.append({\n",
    "        'true_rho': true_rho,\n",
    "        'mle_estimate': mle_est,\n",
    "        'mom_estimate': mom_est,\n",
    "        'mle_error': abs(mle_est - true_rho) / true_rho * 100,\n",
    "        'mom_error': abs(mom_est - true_rho) / true_rho * 100\n",
    "    })\n",
    "\n",
    "comparison_df = pd.DataFrame(comparison_results)\n",
    "print(\"MLE vs MoM Comparison:\")\n",
    "print(comparison_df.round(4).to_string(index=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "linear-model-header",
   "metadata": {},
   "source": [
    "### 3.3 Linear Dispersion Model\n",
    "\n",
    "In some datasets, dispersion varies with read depth. The **linear model** parameterizes:\n",
    "\n",
    "$$\\text{logit}(\\rho_i) = \\beta_0 + \\beta_1 \\cdot N_i$$\n",
    "\n",
    "This allows higher-depth regions to have different dispersion than lower-depth regions.\n",
    "\n",
    "**WASP2 Implementation:** See `src/analysis/as_analysis.py:53-78` (opt_linear function)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "linear-model-demo",
   "metadata": {},
   "outputs": [],
   "source": [
    "def estimate_linear_dispersion(\n",
    "    ref_counts: NDArray[np.int64], \n",
    "    total_counts: NDArray[np.int64]\n",
    ") -> Tuple[float, float]:\n",
    "    \"\"\"\n",
    "    Estimate depth-dependent dispersion: logit(rho) = beta0 + beta1 * N\n",
    "    \n",
    "    Mirrors WASP2's opt_linear in src/analysis/as_analysis.py:53-78.\n",
    "    \n",
    "    Args:\n",
    "        ref_counts: Array of reference allele counts\n",
    "        total_counts: Array of total read counts\n",
    "    \n",
    "    Returns:\n",
    "        Tuple of (beta0, beta1) parameters\n",
    "    \"\"\"\n",
    "    def neg_loglik(params):\n",
    "        beta0, beta1 = params\n",
    "        \n",
    "        # Compute rho for each observation\n",
    "        linear_pred = beta0 + beta1 * total_counts\n",
    "        # Clip to prevent overflow (matches WASP2's approach)\n",
    "        linear_pred = np.clip(linear_pred, -10, 10)\n",
    "        rho = expit(linear_pred)\n",
    "        rho = clamp_rho(rho)  # Apply WASP2's clamping\n",
    "        \n",
    "        # Compute likelihood\n",
    "        alpha = 0.5 * (1 - rho) / rho\n",
    "        beta = 0.5 * (1 - rho) / rho\n",
    "        \n",
    "        ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))\n",
    "        return -ll\n",
    "    \n",
    "    result = minimize(neg_loglik, x0=[-3, 0], method='Nelder-Mead')\n",
    "    return tuple(result.x)\n",
    "\n",
    "\n",
    "# Simulate data with depth-dependent dispersion\n",
    "np.random.seed(42)\n",
    "n_snps = 2000\n",
    "depths = np.random.choice([20, 50, 100, 200], size=n_snps)\n",
    "\n",
    "# True model: logit(rho) = -3 + 0.01 * N\n",
    "true_beta0, true_beta1 = -3, 0.01\n",
    "true_rhos = expit(true_beta0 + true_beta1 * depths)\n",
    "\n",
    "# Generate counts with proper clamping\n",
    "ref_counts_linear = np.array([\n",
    "    betabinom.rvs(\n",
    "        n=n, \n",
    "        a=0.5*(1-clamp_rho(rho))/clamp_rho(rho), \n",
    "        b=0.5*(1-clamp_rho(rho))/clamp_rho(rho)\n",
    "    )\n",
    "    for n, rho in zip(depths, true_rhos)\n",
    "])\n",
    "\n",
    "# Estimate\n",
    "est_beta0, est_beta1 = estimate_linear_dispersion(ref_counts_linear, depths)\n",
    "\n",
    "print(\"Linear dispersion model estimation:\")\n",
    "print(f\"  True: logit(rho) = {true_beta0:.2f} + {true_beta1:.4f} * N\")\n",
    "print(f\"  Est:  logit(rho) = {est_beta0:.2f} + {est_beta1:.4f} * N\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "linear-model-visual",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize the linear dispersion model\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
    "\n",
    "# Panel 1: Rho vs depth\n",
    "ax = axes[0]\n",
    "depth_range = np.linspace(10, 250, 100)\n",
    "true_curve = expit(true_beta0 + true_beta1 * depth_range)\n",
    "est_curve = expit(est_beta0 + est_beta1 * depth_range)\n",
    "\n",
    "ax.plot(depth_range, true_curve, 'b-', lw=2, label='True')\n",
    "ax.plot(depth_range, est_curve, 'r--', lw=2, label='Estimated')\n",
    "ax.set_xlabel('Read Depth (N)')\n",
    "ax.set_ylabel('Dispersion (rho)')\n",
    "ax.set_title('Linear Dispersion Model')\n",
    "ax.legend()\n",
    "\n",
    "# Panel 2: Show effect on variance\n",
    "ax = axes[1]\n",
    "binom_var = depth_range * 0.5 * 0.5\n",
    "bb_var_true = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * true_curve)\n",
    "bb_var_const = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * 0.05)\n",
    "\n",
    "ax.plot(depth_range, binom_var, 'k-', lw=2, label='Binomial')\n",
    "ax.plot(depth_range, bb_var_const, 'g--', lw=2, label='Constant rho=0.05')\n",
    "ax.plot(depth_range, bb_var_true, 'b-', lw=2, label='Linear model')\n",
    "ax.set_xlabel('Read Depth (N)')\n",
    "ax.set_ylabel('Variance of Ref Count')\n",
    "ax.set_title('Variance Scaling with Depth')\n",
    "ax.legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section4-header",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 4: QQ Plots for Model Calibration\n",
    "\n",
    "**Quantile-Quantile (QQ) plots** compare observed p-values to their expected distribution under the null. If the model is well-calibrated:\n",
    "\n",
    "- P-values should be uniformly distributed under $H_0$\n",
    "- QQ plot should follow the diagonal line\n",
    "\n",
    "### 4.1 Constructing QQ Plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "qq-function",
   "metadata": {},
   "outputs": [],
   "source": [
    "def qq_plot(\n",
    "    pvalues: NDArray[np.float64], \n",
    "    ax=None, \n",
    "    label: str = 'Observed', \n",
    "    color: str = 'steelblue'\n",
    "):\n",
    "    \"\"\"\n",
    "    Create a QQ plot of -log10(p-values).\n",
    "    \n",
    "    Args:\n",
    "        pvalues: Array of p-values\n",
    "        ax: Matplotlib axes (created if None)\n",
    "        label: Label for the scatter plot\n",
    "        color: Color for the points\n",
    "    \n",
    "    Returns:\n",
    "        Matplotlib axes object\n",
    "    \"\"\"\n",
    "    if ax is None:\n",
    "        fig, ax = plt.subplots()\n",
    "    \n",
    "    # Input validation and cleaning\n",
    "    pvalues = np.asarray(pvalues, dtype=np.float64)\n",
    "    pvalues = pvalues[~np.isnan(pvalues)]  # Remove NaN\n",
    "    pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)]  # Valid p-values only\n",
    "    \n",
    "    if len(pvalues) == 0:\n",
    "        raise ValueError(\"No valid p-values after filtering\")\n",
    "    \n",
    "    pvalues = np.sort(pvalues)\n",
    "    \n",
    "    # Expected p-values under uniform distribution\n",
    "    n = len(pvalues)\n",
    "    expected = (np.arange(1, n + 1) - 0.5) / n\n",
    "    \n",
    "    # Transform to -log10 scale with safe clipping\n",
    "    obs_log = -np.log10(np.clip(pvalues, 1e-300, 1))\n",
    "    exp_log = -np.log10(expected)\n",
    "    \n",
    "    # Plot\n",
    "    ax.scatter(exp_log, obs_log, s=10, alpha=0.6, color=color, label=label)\n",
    "    \n",
    "    # Diagonal line\n",
    "    max_val = max(exp_log.max(), obs_log.max())\n",
    "    ax.plot([0, max_val], [0, max_val], 'r--', lw=1, label='Expected')\n",
    "    \n",
    "    ax.set_xlabel('Expected -log10(p)')\n",
    "    ax.set_ylabel('Observed -log10(p)')\n",
    "    \n",
    "    return ax\n",
    "\n",
    "\n",
    "def genomic_inflation_factor(pvalues: NDArray[np.float64]) -> float:\n",
    "    \"\"\"\n",
    "    Calculate genomic inflation factor (lambda).\n",
    "    \n",
    "    Lambda = 1 indicates perfect calibration.\n",
    "    Lambda > 1 indicates inflation (too many false positives).\n",
    "    Lambda < 1 indicates deflation (too conservative).\n",
    "    \n",
    "    Args:\n",
    "        pvalues: Array of p-values\n",
    "    \n",
    "    Returns:\n",
    "        Genomic inflation factor\n",
    "    \"\"\"\n",
    "    pvalues = np.asarray(pvalues, dtype=np.float64)\n",
    "    pvalues = pvalues[~np.isnan(pvalues)]\n",
    "    pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)]\n",
    "    \n",
    "    if len(pvalues) == 0:\n",
    "        return np.nan\n",
    "    \n",
    "    chi2_stats = chi2.ppf(1 - pvalues, df=1)\n",
    "    return np.median(chi2_stats) / chi2.ppf(0.5, df=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "qq-comparison",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate p-values using binomial vs beta-binomial tests on overdispersed data\n",
    "true_rho = 0.05\n",
    "test_data = overdispersed_data[true_rho]\n",
    "n_total = read_depth\n",
    "\n",
    "# Binomial test (wrong model)\n",
    "pvals_binomial = [stats.binomtest(int(k), n_total, 0.5).pvalue for k in test_data]\n",
    "\n",
    "# Beta-binomial test (correct model)\n",
    "est_rho = estimate_dispersion_mle(test_data, np.full(len(test_data), n_total))\n",
    "pvals_betabinom = [likelihood_ratio_test(int(k), n_total, est_rho)[1] for k in test_data]\n",
    "\n",
    "# Create QQ plots\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
    "\n",
    "ax = axes[0]\n",
    "qq_plot(np.array(pvals_binomial), ax=ax, label='Binomial', color='firebrick')\n",
    "lambda_binom = genomic_inflation_factor(np.array(pvals_binomial))\n",
    "ax.set_title(f'Binomial Test\\n(lambda = {lambda_binom:.2f})')\n",
    "ax.legend()\n",
    "\n",
    "ax = axes[1]\n",
    "qq_plot(np.array(pvals_betabinom), ax=ax, label='Beta-binomial', color='steelblue')\n",
    "lambda_bb = genomic_inflation_factor(np.array(pvals_betabinom))\n",
    "ax.set_title(f'Beta-Binomial Test\\n(lambda = {lambda_bb:.2f})')\n",
    "ax.legend()\n",
    "\n",
    "plt.suptitle(f'QQ Plots on Null Data (true rho = {true_rho})', fontsize=14, y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"Genomic inflation factor (lambda):\")\n",
    "print(f\"  Binomial test: {lambda_binom:.3f} (expected: 1.0)\")\n",
    "print(f\"  Beta-binomial: {lambda_bb:.3f} (expected: 1.0)\")\n",
    "print(f\"\\nA lambda >> 1 indicates systematic inflation (too many false positives).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qq-interpretation",
   "metadata": {},
   "source": [
    "### 4.2 Interpreting QQ Plots\n",
    "\n",
    "| Pattern | Interpretation |\n",
    "|---------|----------------|\n",
    "| Points on diagonal | Well-calibrated (good!) |\n",
    "| Points above diagonal | Inflation (too many small p-values) |\n",
    "| Points below diagonal | Deflation (test too conservative) |\n",
    "| Lift at the tail only | True signal mixed with null |\n",
    "\n",
    "The **genomic inflation factor (lambda)** quantifies deviation:\n",
    "- $\\lambda = 1$: Perfect calibration\n",
    "- $\\lambda > 1$: Inflation\n",
    "- $\\lambda < 1$: Deflation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "qq-with-signal",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Demonstrate QQ plot with true signal\n",
    "np.random.seed(42)\n",
    "\n",
    "# Generate mixed data: 90% null + 10% truly imbalanced\n",
    "n_snps = 1000\n",
    "n_signal = 100\n",
    "rho = 0.05\n",
    "\n",
    "# Null data (mu = 0.5)\n",
    "alpha_null, beta_null = mu_rho_to_alpha_beta(0.5, rho)\n",
    "null_counts = betabinom.rvs(n=50, a=alpha_null, b=beta_null, size=n_snps - n_signal)\n",
    "\n",
    "# Signal data (mu = 0.7, strong imbalance)\n",
    "alpha_sig, beta_sig = mu_rho_to_alpha_beta(0.7, rho)\n",
    "signal_counts = betabinom.rvs(n=50, a=alpha_sig, b=beta_sig, size=n_signal)\n",
    "\n",
    "# Combine\n",
    "mixed_counts = np.concatenate([null_counts, signal_counts])\n",
    "is_signal = np.concatenate([np.zeros(n_snps - n_signal), np.ones(n_signal)]).astype(bool)\n",
    "\n",
    "# Test all SNPs\n",
    "est_rho = estimate_dispersion_mle(mixed_counts, np.full(len(mixed_counts), 50))\n",
    "mixed_pvals = np.array([likelihood_ratio_test(int(k), 50, est_rho)[1] for k in mixed_counts])\n",
    "\n",
    "# QQ plot\n",
    "fig, ax = plt.subplots(figsize=(8, 8))\n",
    "qq_plot(mixed_pvals, ax=ax)\n",
    "ax.set_title(f'QQ Plot with {n_signal} True Signals\\n(10% imbalanced, mu=0.7)')\n",
    "ax.legend()\n",
    "plt.show()\n",
    "\n",
    "# Check detection power\n",
    "detected = mixed_pvals < 0.05\n",
    "print(f\"Detection results (alpha = 0.05):\")\n",
    "print(f\"  True positives: {detected[is_signal].sum()} / {n_signal}\")\n",
    "print(f\"  False positives: {detected[~is_signal].sum()} / {n_snps - n_signal}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "section5-header",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 5: False Discovery Rate (FDR) Correction\n",
    "\n",
    "When testing thousands of SNPs, multiple testing correction is essential. WASP2 uses **Benjamini-Hochberg (BH)** FDR correction via `scipy.stats.false_discovery_control()`.\n",
    "\n",
    "**WASP2 Implementation:** See `src/analysis/as_analysis.py:492` which uses `false_discovery_control(pvals, method=\"bh\")`\n",
    "\n",
    "### 5.1 The Multiple Testing Problem\n",
    "\n",
    "If we test 10,000 SNPs at $\\alpha = 0.05$, we expect 500 false positives even if there's no true signal!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdr-motivation",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Demonstrate multiple testing problem\n",
    "n_tests = 10000\n",
    "alpha = 0.05\n",
    "\n",
    "# All null (no true signal)\n",
    "null_pvalues = np.random.uniform(0, 1, n_tests)\n",
    "\n",
    "# Without correction\n",
    "false_positives_raw = (null_pvalues < alpha).sum()\n",
    "\n",
    "print(f\"Multiple testing with {n_tests} tests (no true signal):\")\n",
    "print(f\"  Expected false positives at alpha={alpha}: {n_tests * alpha:.0f}\")\n",
    "print(f\"  Observed false positives: {false_positives_raw}\")\n",
    "print(f\"\\nThis is why we need FDR correction!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bh-header",
   "metadata": {},
   "source": [
    "### 5.2 Benjamini-Hochberg Procedure\n",
    "\n",
    "The BH procedure controls the **false discovery rate** - the expected proportion of false positives among all discoveries.\n",
    "\n",
    "**Algorithm:**\n",
    "1. Sort p-values: $p_{(1)} \\leq p_{(2)} \\leq \\ldots \\leq p_{(m)}$\n",
    "2. Find largest $k$ such that $p_{(k)} \\leq \\frac{k}{m} \\cdot q$ where $q$ is target FDR\n",
    "3. Reject all hypotheses with $p_{(i)} \\leq p_{(k)}$\n",
    "\n",
    "**Note:** WASP2 uses `scipy.stats.false_discovery_control(pvals, method=\"bh\")` for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bh-implementation",
   "metadata": {},
   "outputs": [],
   "source": [
    "def benjamini_hochberg(pvalues: NDArray[np.float64]) -> NDArray[np.float64]:\n",
    "    \"\"\"\n",
    "    Apply Benjamini-Hochberg FDR correction.\n",
    "    \n",
    "    Note: WASP2 uses scipy.stats.false_discovery_control(pvals, method=\"bh\")\n",
    "    which is equivalent. This implementation is for educational purposes.\n",
    "    \n",
    "    Args:\n",
    "        pvalues: Array of p-values\n",
    "    \n",
    "    Returns:\n",
    "        Array of adjusted p-values (q-values)\n",
    "    \"\"\"\n",
    "    pvalues = np.asarray(pvalues, dtype=np.float64)\n",
    "    n = len(pvalues)\n",
    "    \n",
    "    if n == 0:\n",
    "        return np.array([])\n",
    "    \n",
    "    # Sort indices\n",
    "    sorted_idx = np.argsort(pvalues)\n",
    "    sorted_pvals = pvalues[sorted_idx]\n",
    "    \n",
    "    # BH adjustment: p_adj[i] = p[i] * n / rank[i]\n",
    "    adjusted = sorted_pvals * n / np.arange(1, n + 1)\n",
    "    \n",
    "    # Ensure monotonicity (cumulative minimum from the end)\n",
    "    adjusted = np.minimum.accumulate(adjusted[::-1])[::-1]\n",
    "    \n",
    "    # Cap at 1\n",
    "    adjusted = np.minimum(adjusted, 1)\n",
    "    \n",
    "    # Restore original order\n",
    "    result = np.empty(n)\n",
    "    result[sorted_idx] = adjusted\n",
    "    \n",
    "    return result\n",
    "\n",
    "\n",
    "# Compare our implementation with scipy's\n",
    "print(\"Comparing BH implementations:\")\n",
    "test_pvals = np.array([0.001, 0.01, 0.02, 0.05, 0.1, 0.5])\n",
    "our_adjusted = benjamini_hochberg(test_pvals)\n",
    "scipy_adjusted = false_discovery_control(test_pvals, method='bh')\n",
    "\n",
    "print(f\"  Raw p-values:     {test_pvals}\")\n",
    "print(f\"  Our BH adjusted:  {our_adjusted.round(4)}\")\n",
    "print(f\"  SciPy BH adjusted:{scipy_adjusted.round(4)}\")\n",
    "print(f\"  Match: {np.allclose(our_adjusted, scipy_adjusted)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdr-application",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply FDR correction to mixed data\n",
    "# Using scipy's implementation (same as WASP2)\n",
    "fdr_pvals = false_discovery_control(mixed_pvals, method='bh')\n",
    "\n",
    "# Compare results\n",
    "print(\"FDR correction comparison (alpha/FDR = 0.05):\")\n",
    "print(f\"\\nRaw p-values:\")\n",
    "print(f\"  Significant: {(mixed_pvals < 0.05).sum()}\")\n",
    "print(f\"  True positives: {((mixed_pvals < 0.05) & is_signal).sum()}\")\n",
    "print(f\"  False positives: {((mixed_pvals < 0.05) & ~is_signal).sum()}\")\n",
    "\n",
    "print(f\"\\nBH-adjusted p-values (scipy.stats.false_discovery_control):\")\n",
    "print(f\"  Significant: {(fdr_pvals < 0.05).sum()}\")\n",
    "print(f\"  True positives: {((fdr_pvals < 0.05) & is_signal).sum()}\")\n",
    "print(f\"  False positives: {((fdr_pvals < 0.05) & ~is_signal).sum()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fdr-alternatives",
   "metadata": {},
   "source": [
    "### 5.3 Alternative FDR Methods\n",
    "\n",
    "WASP2 primarily uses BH, but other options exist:\n",
    "\n",
    "| Method | Description | Use Case |\n",
    "|--------|-------------|----------|\n",
    "| **Benjamini-Hochberg (BH)** | Standard FDR control | Default choice |\n",
    "| **Benjamini-Yekutieli (BY)** | Controls FDR under dependency | Correlated tests |\n",
    "| **Storey's q-value** | Estimates proportion of true nulls | Large-scale testing |\n",
    "| **mid-p correction** | Less conservative binomial test | Discrete distributions |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdr-visual",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize FDR correction effect\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
    "\n",
    "# Panel 1: Raw vs adjusted p-values\n",
    "ax = axes[0]\n",
    "sorted_idx = np.argsort(mixed_pvals)\n",
    "ax.plot(mixed_pvals[sorted_idx], label='Raw p-value', lw=2)\n",
    "ax.plot(fdr_pvals[sorted_idx], label='BH-adjusted', lw=2)\n",
    "ax.axhline(0.05, color='red', linestyle='--', label='Threshold (0.05)')\n",
    "ax.set_xlabel('Rank')\n",
    "ax.set_ylabel('P-value')\n",
    "ax.set_title('Raw vs FDR-Adjusted P-values')\n",
    "ax.set_xlim(0, 200)  # Focus on top hits\n",
    "ax.legend()\n",
    "\n",
    "# Panel 2: Number of discoveries at different thresholds\n",
    "ax = axes[1]\n",
    "thresholds = np.linspace(0.001, 0.2, 50)\n",
    "raw_discoveries = [(mixed_pvals < t).sum() for t in thresholds]\n",
    "fdr_discoveries = [(fdr_pvals < t).sum() for t in thresholds]\n",
    "\n",
    "ax.plot(thresholds, raw_discoveries, label='Raw p-value', lw=2)\n",
    "ax.plot(thresholds, fdr_discoveries, label='FDR-adjusted', lw=2)\n",
    "ax.axhline(n_signal, color='green', linestyle=':', label=f'True signals ({n_signal})')\n",
    "ax.axvline(0.05, color='red', linestyle='--', alpha=0.5)\n",
    "ax.set_xlabel('Threshold')\n",
    "ax.set_ylabel('Number of Discoveries')\n",
    "ax.set_title('Discoveries at Different Thresholds')\n",
    "ax.legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "summary-header",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "In this tutorial, you learned:\n",
    "\n",
    "### Key Concepts\n",
    "\n",
    "1. **Overdispersion** - Sequencing data shows more variance than the binomial model predicts\n",
    "   - Sources: PCR bias, library prep effects, technical variability\n",
    "   - Consequence: Binomial test has inflated false positive rate\n",
    "\n",
    "2. **Beta-binomial distribution** - Naturally handles overdispersion\n",
    "   - Parameters: mean ($\\mu$) and dispersion ($\\rho$)\n",
    "   - Variance: $N\\mu(1-\\mu)[1 + (N-1)\\rho]$\n",
    "   - Reduces to binomial when $\\rho \\to 0$\n",
    "   - **Critical:** Use `clamp_rho()` to prevent numerical instability at boundaries\n",
    "\n",
    "3. **Dispersion estimation**\n",
    "   - **MLE**: More efficient, used by WASP2\n",
    "   - **MoM**: Faster, closed-form solution\n",
    "   - **Linear model**: Allows depth-dependent dispersion\n",
    "\n",
    "4. **QQ plots** - Diagnostic tool for model calibration\n",
    "   - Points on diagonal = well-calibrated\n",
    "   - Inflation factor ($\\lambda$) quantifies deviation\n",
    "\n",
    "5. **FDR correction** - Essential for multiple testing\n",
    "   - Benjamini-Hochberg controls false discovery rate\n",
    "   - WASP2 uses `scipy.stats.false_discovery_control(pvals, method=\"bh\")`\n",
    "\n",
    "### WASP2 Implementation Reference\n",
    "\n",
    "| Tutorial Function | WASP2 Source |\n",
    "|-------------------|-------------|\n",
    "| `clamp_rho()` | `src/analysis/as_analysis.py:36-50` |\n",
    "| `mu_rho_to_alpha_beta()` | `src/analysis/as_analysis.py:104-105` |\n",
    "| `betabinom_loglik()` | `src/analysis/as_analysis.py:81-112` (opt_prob) |\n",
    "| `likelihood_ratio_test()` | `src/analysis/as_analysis.py:322-323` |\n",
    "| `estimate_dispersion_mle()` | `src/analysis/as_analysis.py:251-325` (single_model) |\n",
    "| `estimate_linear_dispersion()` | `src/analysis/as_analysis.py:53-78` (opt_linear) |\n",
    "| FDR correction | `src/analysis/as_analysis.py:492` |\n",
    "\n",
    "### Further Reading\n",
    "\n",
    "- [Statistical Models Documentation](../methods/statistical_models.rst)\n",
    "- [Dispersion Estimation Methods](../methods/dispersion_estimation.rst)\n",
    "- Skelly et al. (2011) - Beta-binomial framework for allelic imbalance\n",
    "- Benjamini & Hochberg (1995) - Controlling false discovery rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "final-cell",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Tutorial complete!\")\n",
    "print(\"\\nYou now understand the statistical foundations of WASP2's\")\n",
    "print(\"allelic imbalance detection framework.\")\n",
    "print(\"\\nKey takeaways:\")\n",
    "print(f\"  - Always use RHO_EPSILON ({RHO_EPSILON}) to prevent numerical instability\")\n",
    "print(\"  - Beta-binomial handles overdispersion that binomial cannot\")\n",
    "print(\"  - QQ plots diagnose model calibration\")\n",
    "print(\"  - FDR correction is essential for genome-wide testing\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}