{ "cells": [ { "cell_type": "markdown", "id": "intro-header", "metadata": {}, "source": [ "# Statistical Methods Tutorial: Understanding WASP2's Beta-Binomial Framework\n", "\n", "**Estimated time:** ~45 minutes\n", "\n", "This interactive tutorial provides a deep dive into the statistical methods used by WASP2 for detecting allelic imbalance. You will learn:\n", "\n", "1. **Why overdispersion matters** - Why the simple binomial model fails for sequencing data\n", "2. **Beta-binomial distributions** - The statistical foundation of WASP2\n", "3. **Dispersion estimation** - MLE vs Method of Moments approaches\n", "4. **QQ plots** - Visualizing model fit and calibration\n", "5. **FDR correction** - Benjamini-Hochberg and alternatives\n", "\n", "## Prerequisites\n", "\n", "This tutorial assumes familiarity with:\n", "- Basic probability distributions (binomial)\n", "- Hypothesis testing concepts (p-values)\n", "- Python data analysis (numpy, pandas, matplotlib)\n", "\n", "No prior knowledge of beta-binomial distributions or overdispersion is required.\n", "\n", "## Relationship to WASP2 Source Code\n", "\n", "The functions in this tutorial mirror the implementations in:\n", "- `src/analysis/as_analysis.py` - Core statistical functions (`clamp_rho`, `opt_prob`, `opt_linear`)\n", "- `src/analysis/compare_ai.py` - Comparative analysis between groups\n", "- `src/analysis/as_analysis_sc.py` - Single-cell extensions" ] }, { "cell_type": "code", "execution_count": null, "id": "setup-imports", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "from scipy.stats import binom, betabinom, chi2, false_discovery_control\n", "from scipy.special import expit, logit\n", "from scipy.optimize import minimize_scalar, minimize\n", "from typing import Tuple, Union\n", "from numpy.typing import NDArray\n", "import warnings\n", "\n", "# Configure plotting\n", "plt.style.use('seaborn-v0_8-whitegrid')\n", "plt.rcParams['figure.figsize'] = (10, 6)\n", "plt.rcParams['font.size'] = 11\n", "np.random.seed(42)\n", "\n", "print(\"Statistical Methods Tutorial\")\n", "print(\"=\" * 40)\n", "print(f\"NumPy version: {np.__version__}\")\n", "print(f\"SciPy version: {stats.scipy.__version__ if hasattr(stats, 'scipy') else 'N/A'}\")" ] }, { "cell_type": "markdown", "id": "constants-header", "metadata": {}, "source": [ "## Critical Constants and Helper Functions\n", "\n", "WASP2 defines critical numerical constants to prevent division by zero and numerical overflow. These are defined in `src/analysis/as_analysis.py` (Issue #228).\n", "\n", "**Why this matters:** The beta-binomial parameterization uses $\\alpha = \\mu \\cdot \\frac{1-\\rho}{\\rho}$, which:\n", "- Causes **division by zero** when $\\rho = 0$\n", "- Produces **zero alpha/beta** when $\\rho = 1$\n", "\n", "WASP2 clamps $\\rho$ to $(\\epsilon, 1-\\epsilon)$ where $\\epsilon = 10^{-10}$." ] }, { "cell_type": "code", "execution_count": null, "id": "constants-definition", "metadata": {}, "outputs": [], "source": [ "# =============================================================================\n", "# BETA-BINOMIAL RHO PARAMETER BOUNDS (Issue #228)\n", "# =============================================================================\n", "# Matches WASP2's src/analysis/as_analysis.py:33\n", "# =============================================================================\n", "\n", "RHO_EPSILON: float = 1e-10\n", "\n", "\n", "def clamp_rho(rho: Union[float, NDArray[np.float64]]) -> Union[float, NDArray[np.float64]]:\n", " \"\"\"\n", " Clamp dispersion parameter rho to safe range (epsilon, 1-epsilon).\n", " \n", " This function mirrors WASP2's src/analysis/as_analysis.py:36-50.\n", " \n", " The beta-binomial parameterization uses alpha = mu * (1-rho) / rho, which\n", " causes division by zero when rho=0 and produces zero alpha/beta when rho=1.\n", " This function prevents these boundary issues.\n", " \n", " Args:\n", " rho: Dispersion parameter (scalar or array), expected in [0, 1]\n", " \n", " Returns:\n", " Clamped rho in range (RHO_EPSILON, 1 - RHO_EPSILON)\n", " \n", " Example:\n", " >>> clamp_rho(0.0) # Would cause division by zero\n", " 1e-10\n", " >>> clamp_rho(1.0) # Would produce zero alpha/beta\n", " 0.9999999999\n", " \"\"\"\n", " return np.clip(rho, RHO_EPSILON, 1.0 - RHO_EPSILON)\n", "\n", "\n", "# Demonstrate the importance of clamping\n", "print(\"Demonstrating rho clamping (Issue #228):\")\n", "print(f\" RHO_EPSILON = {RHO_EPSILON}\")\n", "print(f\" clamp_rho(0.0) = {clamp_rho(0.0)}\")\n", "print(f\" clamp_rho(1.0) = {clamp_rho(1.0)}\")\n", "print(f\" clamp_rho(0.05) = {clamp_rho(0.05)} # No change needed\")" ] }, { "cell_type": "markdown", "id": "section1-header", "metadata": {}, "source": [ "---\n", "\n", "## Section 1: Understanding Overdispersion\n", "\n", "### 1.1 The Naive Binomial Model\n", "\n", "When counting alleles at a heterozygous SNP, the simplest model assumes each read is an independent coin flip with probability 0.5 of coming from each allele:\n", "\n", "$$X \\sim \\text{Binomial}(N, p=0.5)$$\n", "\n", "where $X$ is the reference allele count and $N$ is the total read count.\n", "\n", "**Expected variance:** $\\text{Var}(X) = N \\cdot p \\cdot (1-p) = N/4$\n", "\n", "Let's simulate what this looks like:" ] }, { "cell_type": "code", "execution_count": null, "id": "binomial-simulation", "metadata": {}, "outputs": [], "source": [ "# Simulate ideal binomial data (balanced, no overdispersion)\n", "n_snps = 1000\n", "read_depth = 50 # Total reads per SNP\n", "\n", "# Input validation\n", "assert n_snps > 0, \"n_snps must be positive\"\n", "assert read_depth > 0, \"read_depth must be positive\"\n", "\n", "# Perfect binomial sampling\n", "ideal_ref_counts = np.random.binomial(n=read_depth, p=0.5, size=n_snps)\n", "ideal_ratios = ideal_ref_counts / read_depth\n", "\n", "# Calculate observed vs expected variance\n", "expected_var = read_depth * 0.5 * 0.5\n", "observed_var = np.var(ideal_ref_counts)\n", "\n", "print(f\"Binomial Model (N={read_depth}, p=0.5)\")\n", "print(f\" Expected variance: {expected_var:.2f}\")\n", "print(f\" Observed variance: {observed_var:.2f}\")\n", "print(f\" Ratio (observed/expected): {observed_var/expected_var:.3f}\")" ] }, { "cell_type": "markdown", "id": "overdispersion-intro", "metadata": {}, "source": [ "### 1.2 The Problem: Real Data Shows Overdispersion\n", "\n", "In real sequencing data, the observed variance is typically **larger** than expected from a binomial model. This is called **overdispersion**.\n", "\n", "**Sources of overdispersion in allele counting:**\n", "1. **PCR amplification bias** - Some fragments amplify better than others\n", "2. **Library preparation effects** - Batch effects during sample prep\n", "3. **Technical variability** - Across lanes, flowcells, and sequencing runs\n", "4. **Mapping bias** - Reference allele may map slightly better (even after WASP correction)\n", "\n", "Let's simulate data with overdispersion:" ] }, { "cell_type": "code", "execution_count": null, "id": "overdispersed-simulation", "metadata": {}, "outputs": [], "source": [ "def simulate_overdispersed(\n", " n_snps: int, \n", " read_depth: int, \n", " rho: float = 0.05\n", ") -> NDArray[np.int64]:\n", " \"\"\"\n", " Simulate overdispersed allele counts using beta-binomial.\n", " \n", " Args:\n", " n_snps: Number of SNPs to simulate\n", " read_depth: Total read depth per SNP\n", " rho: Dispersion parameter (0 = binomial, higher = more overdispersed)\n", " \n", " Returns:\n", " Array of reference allele counts\n", " \n", " Raises:\n", " ValueError: If parameters are out of valid range\n", " \"\"\"\n", " # Input validation\n", " if n_snps <= 0:\n", " raise ValueError(f\"n_snps must be positive, got {n_snps}\")\n", " if read_depth <= 0:\n", " raise ValueError(f\"read_depth must be positive, got {read_depth}\")\n", " if not 0 < rho < 1:\n", " raise ValueError(f\"rho must be in (0, 1), got {rho}\")\n", " \n", " mu = 0.5 # Balanced (no true imbalance)\n", " rho = clamp_rho(rho) # Apply WASP2's clamping\n", " \n", " alpha = mu * (1 - rho) / rho\n", " beta = (1 - mu) * (1 - rho) / rho\n", " \n", " return betabinom.rvs(n=read_depth, a=alpha, b=beta, size=n_snps)\n", "\n", "\n", "# Simulate with different dispersion levels\n", "rho_values = [0.001, 0.02, 0.05, 0.10]\n", "overdispersed_data = {rho: simulate_overdispersed(n_snps, read_depth, rho) \n", " for rho in rho_values}\n", "\n", "# Compare variances\n", "print(f\"Expected binomial variance: {expected_var:.2f}\")\n", "print(\"\\nObserved variances by dispersion (rho):\")\n", "for rho, data in overdispersed_data.items():\n", " obs_var = np.var(data)\n", " print(f\" rho={rho:.3f}: variance={obs_var:.2f} (ratio={obs_var/expected_var:.2f}x)\")" ] }, { "cell_type": "code", "execution_count": null, "id": "overdispersion-visual", "metadata": {}, "outputs": [], "source": [ "# Visualize the effect of overdispersion\n", "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n", "\n", "for idx, (rho, data) in enumerate(overdispersed_data.items()):\n", " ax = axes.flat[idx]\n", " ratios = data / read_depth\n", " \n", " # Histogram of observed ratios\n", " ax.hist(ratios, bins=30, density=True, alpha=0.7, color='steelblue', \n", " edgecolor='black', label='Observed')\n", " \n", " # Overlay expected binomial distribution\n", " x = np.arange(0, read_depth + 1)\n", " binomial_pmf = binom.pmf(x, read_depth, 0.5)\n", " ax.plot(x/read_depth, binomial_pmf * read_depth, 'r-', lw=2, \n", " label='Expected (Binomial)')\n", " \n", " obs_var = np.var(data)\n", " ax.set_title(f'rho = {rho:.3f}\\nVariance ratio: {obs_var/expected_var:.2f}x', fontsize=11)\n", " ax.set_xlabel('Reference Allele Frequency')\n", " ax.set_ylabel('Density')\n", " ax.legend(fontsize=9)\n", " ax.set_xlim(0, 1)\n", "\n", "plt.suptitle('Effect of Overdispersion on Allele Ratio Distributions', fontsize=14, y=1.02)\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"\\nKey observation: As rho increases, the distribution becomes wider\")\n", "print(\"than expected from binomial sampling alone.\")" ] }, { "cell_type": "markdown", "id": "consequences-header", "metadata": {}, "source": [ "### 1.3 Consequences of Ignoring Overdispersion\n", "\n", "If we use a binomial model on overdispersed data, we will:\n", "\n", "1. **Underestimate variance** → p-values too small\n", "2. **Inflate false positive rate** → Many spurious \"significant\" results\n", "3. **Poor calibration** → QQ plots show massive inflation\n", "\n", "Let's demonstrate this problem:" ] }, { "cell_type": "code", "execution_count": null, "id": "false-positive-demo", "metadata": {}, "outputs": [], "source": [ "def binomial_test_pvalue(ref_count: int, total_count: int) -> float:\n", " \"\"\"\n", " Two-sided binomial test for deviation from 0.5.\n", " \n", " Args:\n", " ref_count: Number of reference allele reads\n", " total_count: Total number of reads\n", " \n", " Returns:\n", " Two-sided p-value\n", " \"\"\"\n", " # Input validation\n", " if ref_count < 0 or ref_count > total_count:\n", " raise ValueError(f\"Invalid counts: ref={ref_count}, total={total_count}\")\n", " if total_count <= 0:\n", " raise ValueError(f\"total_count must be positive, got {total_count}\")\n", " \n", " result = stats.binomtest(ref_count, total_count, p=0.5, alternative='two-sided')\n", " return result.pvalue\n", "\n", "\n", "# Test on overdispersed data (rho=0.05) - no TRUE imbalance!\n", "test_data = overdispersed_data[0.05]\n", "pvalues_binomial = [binomial_test_pvalue(int(k), read_depth) for k in test_data]\n", "\n", "# Count \"significant\" results at different thresholds\n", "pvalues_binomial = np.array(pvalues_binomial)\n", "\n", "print(\"False positive rates using binomial test on overdispersed data:\")\n", "print(\"(Remember: there is NO true imbalance in this data!)\\n\")\n", "for alpha in [0.05, 0.01, 0.001]:\n", " fp_rate = (pvalues_binomial < alpha).mean()\n", " print(f\" Alpha = {alpha:.3f}: {fp_rate*100:.1f}% significant (expected: {alpha*100:.1f}%)\")\n", "\n", "print(f\"\\nThis represents a {(pvalues_binomial < 0.05).mean() / 0.05:.1f}x inflation!\")" ] }, { "cell_type": "markdown", "id": "section2-header", "metadata": {}, "source": [ "---\n", "\n", "## Section 2: The Beta-Binomial Distribution\n", "\n", "### 2.1 Mathematical Foundation\n", "\n", "The **beta-binomial distribution** extends the binomial by allowing the success probability $p$ to vary according to a Beta distribution:\n", "\n", "$$p \\sim \\text{Beta}(\\alpha, \\beta)$$\n", "$$X | p \\sim \\text{Binomial}(N, p)$$\n", "\n", "The marginal distribution of $X$ is the beta-binomial.\n", "\n", "**WASP2's parameterization** uses mean ($\\mu$) and dispersion ($\\rho$):\n", "\n", "$$\\alpha = \\mu \\cdot \\frac{1 - \\rho}{\\rho}, \\quad \\beta = (1 - \\mu) \\cdot \\frac{1 - \\rho}{\\rho}$$\n", "\n", "**Key properties:**\n", "- Mean: $E[X] = N \\cdot \\mu$ (same as binomial)\n", "- Variance: $\\text{Var}(X) = N \\cdot \\mu \\cdot (1-\\mu) \\cdot [1 + (N-1) \\cdot \\rho]$ (inflated!)\n", "\n", "When $\\rho \\to 0$, the beta-binomial converges to the binomial." ] }, { "cell_type": "code", "execution_count": null, "id": "betabinom-params", "metadata": {}, "outputs": [], "source": [ "def mu_rho_to_alpha_beta(\n", " mu: float, \n", " rho: float\n", ") -> Tuple[float, float]:\n", " \"\"\"\n", " Convert WASP2's (mu, rho) parameterization to (alpha, beta).\n", " \n", " This mirrors WASP2's parameterization in src/analysis/as_analysis.py:104-105.\n", " \n", " Args:\n", " mu: Mean parameter (allele frequency), in (0, 1)\n", " rho: Dispersion parameter, in (0, 1)\n", " \n", " Returns:\n", " Tuple of (alpha, beta) parameters for scipy.stats.betabinom\n", " \n", " Warning:\n", " When rho is near 0 or 1, numerical instability can occur.\n", " Use clamp_rho() to ensure safe values.\n", " \"\"\"\n", " # Apply WASP2's clamping to prevent numerical issues\n", " rho = clamp_rho(rho)\n", " \n", " # Validate mu\n", " if not 0 < mu < 1:\n", " raise ValueError(f\"mu must be in (0, 1), got {mu}\")\n", " \n", " alpha = mu * (1 - rho) / rho\n", " beta = (1 - mu) * (1 - rho) / rho\n", " return alpha, beta\n", "\n", "\n", "def betabinom_variance(n: int, mu: float, rho: float) -> float:\n", " \"\"\"\n", " Compute beta-binomial variance.\n", " \n", " Args:\n", " n: Number of trials (total read count)\n", " mu: Mean parameter (allele frequency)\n", " rho: Dispersion parameter\n", " \n", " Returns:\n", " Variance of the beta-binomial distribution\n", " \"\"\"\n", " return n * mu * (1 - mu) * (1 + (n - 1) * rho)\n", "\n", "\n", "# Demonstrate variance inflation\n", "N = 50\n", "mu = 0.5\n", "binomial_var = N * mu * (1 - mu)\n", "\n", "print(f\"Variance comparison (N={N}, mu={mu}):\\n\")\n", "print(f\"Binomial variance: {binomial_var:.2f}\")\n", "print(\"\\nBeta-binomial variance by rho:\")\n", "for rho in [0.001, 0.01, 0.05, 0.10, 0.20]:\n", " bb_var = betabinom_variance(N, mu, rho)\n", " inflation = bb_var / binomial_var\n", " print(f\" rho={rho:.3f}: {bb_var:.2f} ({inflation:.2f}x inflation)\")" ] }, { "cell_type": "markdown", "id": "edge-cases-header", "metadata": {}, "source": [ "### 2.1.1 Edge Cases and Numerical Stability\n", "\n", "**Warning:** The beta-binomial parameterization has dangerous edge cases that can cause numerical errors. This section demonstrates why WASP2's `clamp_rho()` function is essential." ] }, { "cell_type": "code", "execution_count": null, "id": "edge-cases-demo", "metadata": {}, "outputs": [], "source": [ "# Demonstrate edge cases that clamp_rho() prevents\n", "print(\"Edge Cases in Beta-Binomial Parameterization\")\n", "print(\"=\" * 50)\n", "\n", "def unsafe_mu_rho_to_alpha_beta(mu, rho):\n", " \"\"\"Unsafe version WITHOUT clamping - for demonstration only.\"\"\"\n", " alpha = mu * (1 - rho) / rho\n", " beta = (1 - mu) * (1 - rho) / rho\n", " return alpha, beta\n", "\n", "# Case 1: rho = 0 causes division by zero\n", "print(\"\\n1. rho = 0 (division by zero):\")\n", "try:\n", " with warnings.catch_warnings():\n", " warnings.simplefilter(\"ignore\")\n", " alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 0.0)\n", " print(f\" alpha = {alpha}, beta = {beta}\")\n", " print(\" Result: inf values - causes downstream errors!\")\n", "except Exception as e:\n", " print(f\" Error: {e}\")\n", "\n", "# Case 2: rho = 1 produces zero alpha/beta\n", "print(\"\\n2. rho = 1 (zero parameters):\")\n", "alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1.0)\n", "print(f\" alpha = {alpha}, beta = {beta}\")\n", "print(\" Result: zero values - invalid for Beta distribution!\")\n", "\n", "# Case 3: Very small rho produces huge alpha/beta\n", "print(\"\\n3. rho = 1e-15 (numerical overflow risk):\")\n", "alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1e-15)\n", "print(f\" alpha = {alpha:.2e}, beta = {beta:.2e}\")\n", "print(\" Result: huge values - may overflow in gamma functions!\")\n", "\n", "# Safe version with clamping\n", "print(\"\\n\" + \"=\" * 50)\n", "print(\"With WASP2's clamp_rho():\")\n", "print(\"=\" * 50)\n", "\n", "for rho_input in [0.0, 1.0, 1e-15]:\n", " rho_safe = clamp_rho(rho_input)\n", " alpha, beta = mu_rho_to_alpha_beta(0.5, rho_safe)\n", " print(f\"\\nInput rho={rho_input} -> clamped to {rho_safe}\")\n", " print(f\" alpha = {alpha:.4f}, beta = {beta:.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "betabinom-visual", "metadata": {}, "outputs": [], "source": [ "# Visualize beta-binomial vs binomial PMFs\n", "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n", "\n", "N = 50\n", "x = np.arange(0, N + 1)\n", "\n", "# Panel 1: Different rho values (mu=0.5)\n", "ax = axes[0]\n", "ax.plot(x, binom.pmf(x, N, 0.5), 'k-', lw=2, label='Binomial')\n", "for rho, color in [(0.02, 'blue'), (0.05, 'orange'), (0.10, 'red')]:\n", " alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n", " ax.plot(x, betabinom.pmf(x, N, alpha, beta), '--', lw=2, \n", " color=color, label=f'rho={rho}')\n", "ax.set_xlabel('Reference Count')\n", "ax.set_ylabel('Probability')\n", "ax.set_title('Effect of Dispersion (mu=0.5)')\n", "ax.legend()\n", "\n", "# Panel 2: Different mu values (rho=0.05)\n", "ax = axes[1]\n", "rho = 0.05\n", "for mu, color in [(0.5, 'gray'), (0.6, 'blue'), (0.7, 'orange'), (0.8, 'red')]:\n", " alpha, beta = mu_rho_to_alpha_beta(mu, rho)\n", " ax.plot(x, betabinom.pmf(x, N, alpha, beta), '-', lw=2, \n", " color=color, label=f'mu={mu}')\n", "ax.set_xlabel('Reference Count')\n", "ax.set_ylabel('Probability')\n", "ax.set_title(f'Effect of Imbalance (rho={rho})')\n", "ax.legend()\n", "\n", "# Panel 3: Log-scale tails comparison\n", "ax = axes[2]\n", "ax.semilogy(x, binom.pmf(x, N, 0.5), 'k-', lw=2, label='Binomial')\n", "for rho, color in [(0.02, 'blue'), (0.10, 'red')]:\n", " alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n", " ax.semilogy(x, betabinom.pmf(x, N, alpha, beta), '--', lw=2, \n", " color=color, label=f'rho={rho}')\n", "ax.set_xlabel('Reference Count')\n", "ax.set_ylabel('Probability (log scale)')\n", "ax.set_title('Tail Behavior (log scale)')\n", "ax.legend()\n", "ax.set_ylim(1e-15, 1)\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"Key insight: Beta-binomial has heavier tails than binomial,\")\n", "print(\"making extreme counts more likely under the null hypothesis.\")" ] }, { "cell_type": "markdown", "id": "lrt-header", "metadata": {}, "source": [ "### 2.2 Likelihood Ratio Test for Imbalance\n", "\n", "WASP2 uses a **likelihood ratio test (LRT)** to detect allelic imbalance:\n", "\n", "- **Null hypothesis ($H_0$):** $\\mu = 0.5$ (balanced)\n", "- **Alternative ($H_1$):** $\\mu \\neq 0.5$ (imbalanced)\n", "\n", "The test statistic is:\n", "\n", "$$\\Lambda = -2 \\left[ \\log L(H_0) - \\log L(H_1) \\right] \\sim \\chi^2_1$$\n", "\n", "This follows a chi-squared distribution with 1 degree of freedom under $H_0$.\n", "\n", "**WASP2 Implementation:** See `src/analysis/as_analysis.py:322-323`" ] }, { "cell_type": "code", "execution_count": null, "id": "lrt-implementation", "metadata": {}, "outputs": [], "source": [ "def betabinom_loglik(\n", " ref_count: int, \n", " total_count: int, \n", " mu: float, \n", " rho: float\n", ") -> float:\n", " \"\"\"\n", " Compute beta-binomial log-likelihood.\n", " \n", " Mirrors WASP2's src/analysis/as_analysis.py:81-112 (opt_prob function).\n", " \n", " Args:\n", " ref_count: Reference allele count\n", " total_count: Total read count\n", " mu: Mean parameter (allele frequency)\n", " rho: Dispersion parameter\n", " \n", " Returns:\n", " Log-likelihood value\n", " \"\"\"\n", " alpha, beta = mu_rho_to_alpha_beta(mu, rho)\n", " return betabinom.logpmf(ref_count, total_count, alpha, beta)\n", "\n", "\n", "def find_mle_mu(\n", " ref_count: int, \n", " total_count: int, \n", " rho: float\n", ") -> float:\n", " \"\"\"\n", " Find MLE of mu given fixed rho.\n", " \n", " Mirrors WASP2's src/analysis/as_analysis.py:184-248 (parse_opt function).\n", " \n", " Args:\n", " ref_count: Reference allele count\n", " total_count: Total read count\n", " rho: Fixed dispersion parameter\n", " \n", " Returns:\n", " Maximum likelihood estimate of mu\n", " \"\"\"\n", " def neg_loglik(mu):\n", " return -betabinom_loglik(ref_count, total_count, mu, rho)\n", " \n", " # Use bounded optimization with safe mu range\n", " result = minimize_scalar(\n", " neg_loglik, \n", " bounds=(0.001, 0.999), \n", " method='bounded'\n", " )\n", " return result.x\n", "\n", "\n", "def likelihood_ratio_test(\n", " ref_count: int, \n", " total_count: int, \n", " rho: float\n", ") -> Tuple[float, float, float]:\n", " \"\"\"\n", " Perform LRT for allelic imbalance.\n", " \n", " Mirrors WASP2's calculation in src/analysis/as_analysis.py:322-323.\n", " \n", " Args:\n", " ref_count: Reference allele count\n", " total_count: Total read count\n", " rho: Dispersion parameter\n", " \n", " Returns:\n", " Tuple of (lrt_statistic, p_value, mle_mu)\n", " \"\"\"\n", " # Input validation\n", " if ref_count < 0 or ref_count > total_count:\n", " raise ValueError(f\"Invalid: ref_count={ref_count}, total_count={total_count}\")\n", " if total_count <= 0:\n", " raise ValueError(f\"total_count must be positive, got {total_count}\")\n", " \n", " # Null likelihood (mu = 0.5)\n", " null_ll = betabinom_loglik(ref_count, total_count, 0.5, rho)\n", " \n", " # Alternative likelihood (mu = MLE)\n", " mle_mu = find_mle_mu(ref_count, total_count, rho)\n", " alt_ll = betabinom_loglik(ref_count, total_count, mle_mu, rho)\n", " \n", " # LRT statistic (matches WASP2: -2 * (null_ll - alt_ll))\n", " lrt = -2 * (null_ll - alt_ll)\n", " lrt = max(0, lrt) # Ensure non-negative due to numerical precision\n", " \n", " # P-value from chi-squared distribution (df=1)\n", " pvalue = chi2.sf(lrt, df=1)\n", " \n", " return lrt, pvalue, mle_mu\n", "\n", "\n", "# Example calculation\n", "ref, total = 35, 50 # Observed: 35 ref out of 50 total\n", "rho = 0.05\n", "\n", "lrt, pval, mu_hat = likelihood_ratio_test(ref, total, rho)\n", "\n", "print(f\"Example LRT calculation:\")\n", "print(f\" Observed: {ref} ref / {total} total (ratio = {ref/total:.2f})\")\n", "print(f\" Dispersion (rho): {rho}\")\n", "print(f\" MLE of mu: {mu_hat:.3f}\")\n", "print(f\" LRT statistic: {lrt:.3f}\")\n", "print(f\" P-value: {pval:.4f}\")" ] }, { "cell_type": "markdown", "id": "section3-header", "metadata": {}, "source": [ "---\n", "\n", "## Section 3: Dispersion Estimation\n", "\n", "A critical step is estimating the dispersion parameter $\\rho$ from the data. WASP2 offers two approaches:\n", "\n", "1. **Single dispersion model** - One $\\rho$ for the entire dataset\n", "2. **Linear dispersion model** - $\\rho$ varies with read depth: $\\text{logit}(\\rho) = \\beta_0 + \\beta_1 \\cdot N$\n", "\n", "### 3.1 Maximum Likelihood Estimation (MLE)\n", "\n", "MLE finds the $\\rho$ that maximizes the likelihood under $H_0$ (balanced).\n", "\n", "**WASP2 Implementation:** See `src/analysis/as_analysis.py:251-325` (single_model function)" ] }, { "cell_type": "code", "execution_count": null, "id": "mle-dispersion", "metadata": {}, "outputs": [], "source": [ "def estimate_dispersion_mle(\n", " ref_counts: NDArray[np.int64], \n", " total_counts: NDArray[np.int64]\n", ") -> float:\n", " \"\"\"\n", " Estimate single dispersion parameter via MLE.\n", " \n", " Mirrors WASP2's single_model in src/analysis/as_analysis.py:251-325.\n", " Assumes mu=0.5 (null hypothesis) for all observations.\n", " \n", " Args:\n", " ref_counts: Array of reference allele counts\n", " total_counts: Array of total read counts\n", " \n", " Returns:\n", " MLE estimate of dispersion parameter rho\n", " \n", " Raises:\n", " ValueError: If arrays have different lengths or invalid values\n", " \"\"\"\n", " # Input validation\n", " ref_counts = np.asarray(ref_counts)\n", " total_counts = np.asarray(total_counts)\n", " \n", " if len(ref_counts) != len(total_counts):\n", " raise ValueError(\"ref_counts and total_counts must have same length\")\n", " if len(ref_counts) == 0:\n", " raise ValueError(\"Arrays must not be empty\")\n", " if np.any(ref_counts < 0) or np.any(ref_counts > total_counts):\n", " raise ValueError(\"Invalid counts: ref_count must be in [0, total_count]\")\n", " \n", " def neg_loglik(rho):\n", " rho = clamp_rho(rho) # Apply WASP2's clamping\n", " alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n", " ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))\n", " return -ll\n", " \n", " result = minimize_scalar(\n", " neg_loglik, \n", " bounds=(RHO_EPSILON, 0.5), # Use RHO_EPSILON as lower bound\n", " method='bounded'\n", " )\n", " return result.x\n", "\n", "\n", "# Estimate dispersion from our simulated data\n", "true_rho = 0.05\n", "test_data = overdispersed_data[true_rho]\n", "total_counts = np.full(len(test_data), read_depth)\n", "\n", "estimated_rho = estimate_dispersion_mle(test_data, total_counts)\n", "\n", "print(f\"Dispersion estimation (MLE):\")\n", "print(f\" True rho: {true_rho:.4f}\")\n", "print(f\" Estimated rho: {estimated_rho:.4f}\")\n", "print(f\" Relative error: {abs(estimated_rho - true_rho) / true_rho * 100:.1f}%\")" ] }, { "cell_type": "markdown", "id": "mom-header", "metadata": {}, "source": [ "### 3.2 Method of Moments (MoM)\n", "\n", "An alternative is the **Method of Moments**, which matches the observed variance to the expected beta-binomial variance:\n", "\n", "$$\\hat{\\rho}_{\\text{MoM}} = \\frac{\\text{Var}(X/N) - \\mu(1-\\mu)/\\bar{N}}{\\mu(1-\\mu)(1 - 1/\\bar{N})}$$\n", "\n", "MoM is faster but may be less efficient than MLE." ] }, { "cell_type": "code", "execution_count": null, "id": "mom-dispersion", "metadata": {}, "outputs": [], "source": [ "def estimate_dispersion_mom(\n", " ref_counts: NDArray[np.int64], \n", " total_counts: NDArray[np.int64]\n", ") -> float:\n", " \"\"\"\n", " Estimate dispersion using Method of Moments.\n", " \n", " Args:\n", " ref_counts: Array of reference allele counts\n", " total_counts: Array of total read counts\n", " \n", " Returns:\n", " MoM estimate of dispersion parameter rho\n", " \"\"\"\n", " # Input validation\n", " ref_counts = np.asarray(ref_counts)\n", " total_counts = np.asarray(total_counts)\n", " \n", " if len(ref_counts) != len(total_counts):\n", " raise ValueError(\"Arrays must have same length\")\n", " \n", " ratios = ref_counts / total_counts\n", " mu = 0.5 # Assume balanced under null\n", " \n", " # Observed variance of ratios\n", " obs_var = np.var(ratios)\n", " \n", " # Expected binomial variance (if rho=0)\n", " mean_n = np.mean(total_counts)\n", " binom_var = mu * (1 - mu) / mean_n\n", " \n", " # Solve for rho with safeguards\n", " numerator = obs_var - binom_var\n", " denominator = mu * (1 - mu) * (1 - 1/mean_n)\n", " \n", " # Handle edge case where denominator is near zero\n", " if abs(denominator) < 1e-10:\n", " return RHO_EPSILON\n", " \n", " rho_mom = numerator / denominator\n", " \n", " # Clamp to valid range using WASP2's constant\n", " return float(clamp_rho(rho_mom))\n", "\n", "\n", "# Compare MLE vs MoM\n", "rho_mom = estimate_dispersion_mom(test_data, total_counts)\n", "\n", "print(f\"Comparison of estimation methods:\")\n", "print(f\" True rho: {true_rho:.4f}\")\n", "print(f\" MLE estimate: {estimated_rho:.4f}\")\n", "print(f\" MoM estimate: {rho_mom:.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "mle-mom-comparison", "metadata": {}, "outputs": [], "source": [ "# Systematic comparison across different true rho values\n", "comparison_results = []\n", "\n", "for true_rho in [0.01, 0.02, 0.05, 0.10, 0.15]:\n", " # Simulate data\n", " alpha, beta = mu_rho_to_alpha_beta(0.5, true_rho)\n", " sim_data = betabinom.rvs(n=50, a=alpha, b=beta, size=1000)\n", " sim_totals = np.full(1000, 50)\n", " \n", " # Estimate with both methods\n", " mle_est = estimate_dispersion_mle(sim_data, sim_totals)\n", " mom_est = estimate_dispersion_mom(sim_data, sim_totals)\n", " \n", " comparison_results.append({\n", " 'true_rho': true_rho,\n", " 'mle_estimate': mle_est,\n", " 'mom_estimate': mom_est,\n", " 'mle_error': abs(mle_est - true_rho) / true_rho * 100,\n", " 'mom_error': abs(mom_est - true_rho) / true_rho * 100\n", " })\n", "\n", "comparison_df = pd.DataFrame(comparison_results)\n", "print(\"MLE vs MoM Comparison:\")\n", "print(comparison_df.round(4).to_string(index=False))" ] }, { "cell_type": "markdown", "id": "linear-model-header", "metadata": {}, "source": [ "### 3.3 Linear Dispersion Model\n", "\n", "In some datasets, dispersion varies with read depth. The **linear model** parameterizes:\n", "\n", "$$\\text{logit}(\\rho_i) = \\beta_0 + \\beta_1 \\cdot N_i$$\n", "\n", "This allows higher-depth regions to have different dispersion than lower-depth regions.\n", "\n", "**WASP2 Implementation:** See `src/analysis/as_analysis.py:53-78` (opt_linear function)" ] }, { "cell_type": "code", "execution_count": null, "id": "linear-model-demo", "metadata": {}, "outputs": [], "source": [ "def estimate_linear_dispersion(\n", " ref_counts: NDArray[np.int64], \n", " total_counts: NDArray[np.int64]\n", ") -> Tuple[float, float]:\n", " \"\"\"\n", " Estimate depth-dependent dispersion: logit(rho) = beta0 + beta1 * N\n", " \n", " Mirrors WASP2's opt_linear in src/analysis/as_analysis.py:53-78.\n", " \n", " Args:\n", " ref_counts: Array of reference allele counts\n", " total_counts: Array of total read counts\n", " \n", " Returns:\n", " Tuple of (beta0, beta1) parameters\n", " \"\"\"\n", " def neg_loglik(params):\n", " beta0, beta1 = params\n", " \n", " # Compute rho for each observation\n", " linear_pred = beta0 + beta1 * total_counts\n", " # Clip to prevent overflow (matches WASP2's approach)\n", " linear_pred = np.clip(linear_pred, -10, 10)\n", " rho = expit(linear_pred)\n", " rho = clamp_rho(rho) # Apply WASP2's clamping\n", " \n", " # Compute likelihood\n", " alpha = 0.5 * (1 - rho) / rho\n", " beta = 0.5 * (1 - rho) / rho\n", " \n", " ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))\n", " return -ll\n", " \n", " result = minimize(neg_loglik, x0=[-3, 0], method='Nelder-Mead')\n", " return tuple(result.x)\n", "\n", "\n", "# Simulate data with depth-dependent dispersion\n", "np.random.seed(42)\n", "n_snps = 2000\n", "depths = np.random.choice([20, 50, 100, 200], size=n_snps)\n", "\n", "# True model: logit(rho) = -3 + 0.01 * N\n", "true_beta0, true_beta1 = -3, 0.01\n", "true_rhos = expit(true_beta0 + true_beta1 * depths)\n", "\n", "# Generate counts with proper clamping\n", "ref_counts_linear = np.array([\n", " betabinom.rvs(\n", " n=n, \n", " a=0.5*(1-clamp_rho(rho))/clamp_rho(rho), \n", " b=0.5*(1-clamp_rho(rho))/clamp_rho(rho)\n", " )\n", " for n, rho in zip(depths, true_rhos)\n", "])\n", "\n", "# Estimate\n", "est_beta0, est_beta1 = estimate_linear_dispersion(ref_counts_linear, depths)\n", "\n", "print(\"Linear dispersion model estimation:\")\n", "print(f\" True: logit(rho) = {true_beta0:.2f} + {true_beta1:.4f} * N\")\n", "print(f\" Est: logit(rho) = {est_beta0:.2f} + {est_beta1:.4f} * N\")" ] }, { "cell_type": "code", "execution_count": null, "id": "linear-model-visual", "metadata": {}, "outputs": [], "source": [ "# Visualize the linear dispersion model\n", "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n", "\n", "# Panel 1: Rho vs depth\n", "ax = axes[0]\n", "depth_range = np.linspace(10, 250, 100)\n", "true_curve = expit(true_beta0 + true_beta1 * depth_range)\n", "est_curve = expit(est_beta0 + est_beta1 * depth_range)\n", "\n", "ax.plot(depth_range, true_curve, 'b-', lw=2, label='True')\n", "ax.plot(depth_range, est_curve, 'r--', lw=2, label='Estimated')\n", "ax.set_xlabel('Read Depth (N)')\n", "ax.set_ylabel('Dispersion (rho)')\n", "ax.set_title('Linear Dispersion Model')\n", "ax.legend()\n", "\n", "# Panel 2: Show effect on variance\n", "ax = axes[1]\n", "binom_var = depth_range * 0.5 * 0.5\n", "bb_var_true = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * true_curve)\n", "bb_var_const = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * 0.05)\n", "\n", "ax.plot(depth_range, binom_var, 'k-', lw=2, label='Binomial')\n", "ax.plot(depth_range, bb_var_const, 'g--', lw=2, label='Constant rho=0.05')\n", "ax.plot(depth_range, bb_var_true, 'b-', lw=2, label='Linear model')\n", "ax.set_xlabel('Read Depth (N)')\n", "ax.set_ylabel('Variance of Ref Count')\n", "ax.set_title('Variance Scaling with Depth')\n", "ax.legend()\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "section4-header", "metadata": {}, "source": [ "---\n", "\n", "## Section 4: QQ Plots for Model Calibration\n", "\n", "**Quantile-Quantile (QQ) plots** compare observed p-values to their expected distribution under the null. If the model is well-calibrated:\n", "\n", "- P-values should be uniformly distributed under $H_0$\n", "- QQ plot should follow the diagonal line\n", "\n", "### 4.1 Constructing QQ Plots" ] }, { "cell_type": "code", "execution_count": null, "id": "qq-function", "metadata": {}, "outputs": [], "source": [ "def qq_plot(\n", " pvalues: NDArray[np.float64], \n", " ax=None, \n", " label: str = 'Observed', \n", " color: str = 'steelblue'\n", "):\n", " \"\"\"\n", " Create a QQ plot of -log10(p-values).\n", " \n", " Args:\n", " pvalues: Array of p-values\n", " ax: Matplotlib axes (created if None)\n", " label: Label for the scatter plot\n", " color: Color for the points\n", " \n", " Returns:\n", " Matplotlib axes object\n", " \"\"\"\n", " if ax is None:\n", " fig, ax = plt.subplots()\n", " \n", " # Input validation and cleaning\n", " pvalues = np.asarray(pvalues, dtype=np.float64)\n", " pvalues = pvalues[~np.isnan(pvalues)] # Remove NaN\n", " pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)] # Valid p-values only\n", " \n", " if len(pvalues) == 0:\n", " raise ValueError(\"No valid p-values after filtering\")\n", " \n", " pvalues = np.sort(pvalues)\n", " \n", " # Expected p-values under uniform distribution\n", " n = len(pvalues)\n", " expected = (np.arange(1, n + 1) - 0.5) / n\n", " \n", " # Transform to -log10 scale with safe clipping\n", " obs_log = -np.log10(np.clip(pvalues, 1e-300, 1))\n", " exp_log = -np.log10(expected)\n", " \n", " # Plot\n", " ax.scatter(exp_log, obs_log, s=10, alpha=0.6, color=color, label=label)\n", " \n", " # Diagonal line\n", " max_val = max(exp_log.max(), obs_log.max())\n", " ax.plot([0, max_val], [0, max_val], 'r--', lw=1, label='Expected')\n", " \n", " ax.set_xlabel('Expected -log10(p)')\n", " ax.set_ylabel('Observed -log10(p)')\n", " \n", " return ax\n", "\n", "\n", "def genomic_inflation_factor(pvalues: NDArray[np.float64]) -> float:\n", " \"\"\"\n", " Calculate genomic inflation factor (lambda).\n", " \n", " Lambda = 1 indicates perfect calibration.\n", " Lambda > 1 indicates inflation (too many false positives).\n", " Lambda < 1 indicates deflation (too conservative).\n", " \n", " Args:\n", " pvalues: Array of p-values\n", " \n", " Returns:\n", " Genomic inflation factor\n", " \"\"\"\n", " pvalues = np.asarray(pvalues, dtype=np.float64)\n", " pvalues = pvalues[~np.isnan(pvalues)]\n", " pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)]\n", " \n", " if len(pvalues) == 0:\n", " return np.nan\n", " \n", " chi2_stats = chi2.ppf(1 - pvalues, df=1)\n", " return np.median(chi2_stats) / chi2.ppf(0.5, df=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "qq-comparison", "metadata": {}, "outputs": [], "source": [ "# Generate p-values using binomial vs beta-binomial tests on overdispersed data\n", "true_rho = 0.05\n", "test_data = overdispersed_data[true_rho]\n", "n_total = read_depth\n", "\n", "# Binomial test (wrong model)\n", "pvals_binomial = [stats.binomtest(int(k), n_total, 0.5).pvalue for k in test_data]\n", "\n", "# Beta-binomial test (correct model)\n", "est_rho = estimate_dispersion_mle(test_data, np.full(len(test_data), n_total))\n", "pvals_betabinom = [likelihood_ratio_test(int(k), n_total, est_rho)[1] for k in test_data]\n", "\n", "# Create QQ plots\n", "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n", "\n", "ax = axes[0]\n", "qq_plot(np.array(pvals_binomial), ax=ax, label='Binomial', color='firebrick')\n", "lambda_binom = genomic_inflation_factor(np.array(pvals_binomial))\n", "ax.set_title(f'Binomial Test\\n(lambda = {lambda_binom:.2f})')\n", "ax.legend()\n", "\n", "ax = axes[1]\n", "qq_plot(np.array(pvals_betabinom), ax=ax, label='Beta-binomial', color='steelblue')\n", "lambda_bb = genomic_inflation_factor(np.array(pvals_betabinom))\n", "ax.set_title(f'Beta-Binomial Test\\n(lambda = {lambda_bb:.2f})')\n", "ax.legend()\n", "\n", "plt.suptitle(f'QQ Plots on Null Data (true rho = {true_rho})', fontsize=14, y=1.02)\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(f\"Genomic inflation factor (lambda):\")\n", "print(f\" Binomial test: {lambda_binom:.3f} (expected: 1.0)\")\n", "print(f\" Beta-binomial: {lambda_bb:.3f} (expected: 1.0)\")\n", "print(f\"\\nA lambda >> 1 indicates systematic inflation (too many false positives).\")" ] }, { "cell_type": "markdown", "id": "qq-interpretation", "metadata": {}, "source": [ "### 4.2 Interpreting QQ Plots\n", "\n", "| Pattern | Interpretation |\n", "|---------|----------------|\n", "| Points on diagonal | Well-calibrated (good!) |\n", "| Points above diagonal | Inflation (too many small p-values) |\n", "| Points below diagonal | Deflation (test too conservative) |\n", "| Lift at the tail only | True signal mixed with null |\n", "\n", "The **genomic inflation factor (lambda)** quantifies deviation:\n", "- $\\lambda = 1$: Perfect calibration\n", "- $\\lambda > 1$: Inflation\n", "- $\\lambda < 1$: Deflation" ] }, { "cell_type": "code", "execution_count": null, "id": "qq-with-signal", "metadata": {}, "outputs": [], "source": [ "# Demonstrate QQ plot with true signal\n", "np.random.seed(42)\n", "\n", "# Generate mixed data: 90% null + 10% truly imbalanced\n", "n_snps = 1000\n", "n_signal = 100\n", "rho = 0.05\n", "\n", "# Null data (mu = 0.5)\n", "alpha_null, beta_null = mu_rho_to_alpha_beta(0.5, rho)\n", "null_counts = betabinom.rvs(n=50, a=alpha_null, b=beta_null, size=n_snps - n_signal)\n", "\n", "# Signal data (mu = 0.7, strong imbalance)\n", "alpha_sig, beta_sig = mu_rho_to_alpha_beta(0.7, rho)\n", "signal_counts = betabinom.rvs(n=50, a=alpha_sig, b=beta_sig, size=n_signal)\n", "\n", "# Combine\n", "mixed_counts = np.concatenate([null_counts, signal_counts])\n", "is_signal = np.concatenate([np.zeros(n_snps - n_signal), np.ones(n_signal)]).astype(bool)\n", "\n", "# Test all SNPs\n", "est_rho = estimate_dispersion_mle(mixed_counts, np.full(len(mixed_counts), 50))\n", "mixed_pvals = np.array([likelihood_ratio_test(int(k), 50, est_rho)[1] for k in mixed_counts])\n", "\n", "# QQ plot\n", "fig, ax = plt.subplots(figsize=(8, 8))\n", "qq_plot(mixed_pvals, ax=ax)\n", "ax.set_title(f'QQ Plot with {n_signal} True Signals\\n(10% imbalanced, mu=0.7)')\n", "ax.legend()\n", "plt.show()\n", "\n", "# Check detection power\n", "detected = mixed_pvals < 0.05\n", "print(f\"Detection results (alpha = 0.05):\")\n", "print(f\" True positives: {detected[is_signal].sum()} / {n_signal}\")\n", "print(f\" False positives: {detected[~is_signal].sum()} / {n_snps - n_signal}\")" ] }, { "cell_type": "markdown", "id": "section5-header", "metadata": {}, "source": [ "---\n", "\n", "## Section 5: False Discovery Rate (FDR) Correction\n", "\n", "When testing thousands of SNPs, multiple testing correction is essential. WASP2 uses **Benjamini-Hochberg (BH)** FDR correction via `scipy.stats.false_discovery_control()`.\n", "\n", "**WASP2 Implementation:** See `src/analysis/as_analysis.py:492` which uses `false_discovery_control(pvals, method=\"bh\")`\n", "\n", "### 5.1 The Multiple Testing Problem\n", "\n", "If we test 10,000 SNPs at $\\alpha = 0.05$, we expect 500 false positives even if there's no true signal!" ] }, { "cell_type": "code", "execution_count": null, "id": "fdr-motivation", "metadata": {}, "outputs": [], "source": [ "# Demonstrate multiple testing problem\n", "n_tests = 10000\n", "alpha = 0.05\n", "\n", "# All null (no true signal)\n", "null_pvalues = np.random.uniform(0, 1, n_tests)\n", "\n", "# Without correction\n", "false_positives_raw = (null_pvalues < alpha).sum()\n", "\n", "print(f\"Multiple testing with {n_tests} tests (no true signal):\")\n", "print(f\" Expected false positives at alpha={alpha}: {n_tests * alpha:.0f}\")\n", "print(f\" Observed false positives: {false_positives_raw}\")\n", "print(f\"\\nThis is why we need FDR correction!\")" ] }, { "cell_type": "markdown", "id": "bh-header", "metadata": {}, "source": [ "### 5.2 Benjamini-Hochberg Procedure\n", "\n", "The BH procedure controls the **false discovery rate** - the expected proportion of false positives among all discoveries.\n", "\n", "**Algorithm:**\n", "1. Sort p-values: $p_{(1)} \\leq p_{(2)} \\leq \\ldots \\leq p_{(m)}$\n", "2. Find largest $k$ such that $p_{(k)} \\leq \\frac{k}{m} \\cdot q$ where $q$ is target FDR\n", "3. Reject all hypotheses with $p_{(i)} \\leq p_{(k)}$\n", "\n", "**Note:** WASP2 uses `scipy.stats.false_discovery_control(pvals, method=\"bh\")` for this." ] }, { "cell_type": "code", "execution_count": null, "id": "bh-implementation", "metadata": {}, "outputs": [], "source": [ "def benjamini_hochberg(pvalues: NDArray[np.float64]) -> NDArray[np.float64]:\n", " \"\"\"\n", " Apply Benjamini-Hochberg FDR correction.\n", " \n", " Note: WASP2 uses scipy.stats.false_discovery_control(pvals, method=\"bh\")\n", " which is equivalent. This implementation is for educational purposes.\n", " \n", " Args:\n", " pvalues: Array of p-values\n", " \n", " Returns:\n", " Array of adjusted p-values (q-values)\n", " \"\"\"\n", " pvalues = np.asarray(pvalues, dtype=np.float64)\n", " n = len(pvalues)\n", " \n", " if n == 0:\n", " return np.array([])\n", " \n", " # Sort indices\n", " sorted_idx = np.argsort(pvalues)\n", " sorted_pvals = pvalues[sorted_idx]\n", " \n", " # BH adjustment: p_adj[i] = p[i] * n / rank[i]\n", " adjusted = sorted_pvals * n / np.arange(1, n + 1)\n", " \n", " # Ensure monotonicity (cumulative minimum from the end)\n", " adjusted = np.minimum.accumulate(adjusted[::-1])[::-1]\n", " \n", " # Cap at 1\n", " adjusted = np.minimum(adjusted, 1)\n", " \n", " # Restore original order\n", " result = np.empty(n)\n", " result[sorted_idx] = adjusted\n", " \n", " return result\n", "\n", "\n", "# Compare our implementation with scipy's\n", "print(\"Comparing BH implementations:\")\n", "test_pvals = np.array([0.001, 0.01, 0.02, 0.05, 0.1, 0.5])\n", "our_adjusted = benjamini_hochberg(test_pvals)\n", "scipy_adjusted = false_discovery_control(test_pvals, method='bh')\n", "\n", "print(f\" Raw p-values: {test_pvals}\")\n", "print(f\" Our BH adjusted: {our_adjusted.round(4)}\")\n", "print(f\" SciPy BH adjusted:{scipy_adjusted.round(4)}\")\n", "print(f\" Match: {np.allclose(our_adjusted, scipy_adjusted)}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "fdr-application", "metadata": {}, "outputs": [], "source": [ "# Apply FDR correction to mixed data\n", "# Using scipy's implementation (same as WASP2)\n", "fdr_pvals = false_discovery_control(mixed_pvals, method='bh')\n", "\n", "# Compare results\n", "print(\"FDR correction comparison (alpha/FDR = 0.05):\")\n", "print(f\"\\nRaw p-values:\")\n", "print(f\" Significant: {(mixed_pvals < 0.05).sum()}\")\n", "print(f\" True positives: {((mixed_pvals < 0.05) & is_signal).sum()}\")\n", "print(f\" False positives: {((mixed_pvals < 0.05) & ~is_signal).sum()}\")\n", "\n", "print(f\"\\nBH-adjusted p-values (scipy.stats.false_discovery_control):\")\n", "print(f\" Significant: {(fdr_pvals < 0.05).sum()}\")\n", "print(f\" True positives: {((fdr_pvals < 0.05) & is_signal).sum()}\")\n", "print(f\" False positives: {((fdr_pvals < 0.05) & ~is_signal).sum()}\")" ] }, { "cell_type": "markdown", "id": "fdr-alternatives", "metadata": {}, "source": [ "### 5.3 Alternative FDR Methods\n", "\n", "WASP2 primarily uses BH, but other options exist:\n", "\n", "| Method | Description | Use Case |\n", "|--------|-------------|----------|\n", "| **Benjamini-Hochberg (BH)** | Standard FDR control | Default choice |\n", "| **Benjamini-Yekutieli (BY)** | Controls FDR under dependency | Correlated tests |\n", "| **Storey's q-value** | Estimates proportion of true nulls | Large-scale testing |\n", "| **mid-p correction** | Less conservative binomial test | Discrete distributions |" ] }, { "cell_type": "code", "execution_count": null, "id": "fdr-visual", "metadata": {}, "outputs": [], "source": [ "# Visualize FDR correction effect\n", "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n", "\n", "# Panel 1: Raw vs adjusted p-values\n", "ax = axes[0]\n", "sorted_idx = np.argsort(mixed_pvals)\n", "ax.plot(mixed_pvals[sorted_idx], label='Raw p-value', lw=2)\n", "ax.plot(fdr_pvals[sorted_idx], label='BH-adjusted', lw=2)\n", "ax.axhline(0.05, color='red', linestyle='--', label='Threshold (0.05)')\n", "ax.set_xlabel('Rank')\n", "ax.set_ylabel('P-value')\n", "ax.set_title('Raw vs FDR-Adjusted P-values')\n", "ax.set_xlim(0, 200) # Focus on top hits\n", "ax.legend()\n", "\n", "# Panel 2: Number of discoveries at different thresholds\n", "ax = axes[1]\n", "thresholds = np.linspace(0.001, 0.2, 50)\n", "raw_discoveries = [(mixed_pvals < t).sum() for t in thresholds]\n", "fdr_discoveries = [(fdr_pvals < t).sum() for t in thresholds]\n", "\n", "ax.plot(thresholds, raw_discoveries, label='Raw p-value', lw=2)\n", "ax.plot(thresholds, fdr_discoveries, label='FDR-adjusted', lw=2)\n", "ax.axhline(n_signal, color='green', linestyle=':', label=f'True signals ({n_signal})')\n", "ax.axvline(0.05, color='red', linestyle='--', alpha=0.5)\n", "ax.set_xlabel('Threshold')\n", "ax.set_ylabel('Number of Discoveries')\n", "ax.set_title('Discoveries at Different Thresholds')\n", "ax.legend()\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "summary-header", "metadata": {}, "source": [ "---\n", "\n", "## Summary\n", "\n", "In this tutorial, you learned:\n", "\n", "### Key Concepts\n", "\n", "1. **Overdispersion** - Sequencing data shows more variance than the binomial model predicts\n", " - Sources: PCR bias, library prep effects, technical variability\n", " - Consequence: Binomial test has inflated false positive rate\n", "\n", "2. **Beta-binomial distribution** - Naturally handles overdispersion\n", " - Parameters: mean ($\\mu$) and dispersion ($\\rho$)\n", " - Variance: $N\\mu(1-\\mu)[1 + (N-1)\\rho]$\n", " - Reduces to binomial when $\\rho \\to 0$\n", " - **Critical:** Use `clamp_rho()` to prevent numerical instability at boundaries\n", "\n", "3. **Dispersion estimation**\n", " - **MLE**: More efficient, used by WASP2\n", " - **MoM**: Faster, closed-form solution\n", " - **Linear model**: Allows depth-dependent dispersion\n", "\n", "4. **QQ plots** - Diagnostic tool for model calibration\n", " - Points on diagonal = well-calibrated\n", " - Inflation factor ($\\lambda$) quantifies deviation\n", "\n", "5. **FDR correction** - Essential for multiple testing\n", " - Benjamini-Hochberg controls false discovery rate\n", " - WASP2 uses `scipy.stats.false_discovery_control(pvals, method=\"bh\")`\n", "\n", "### WASP2 Implementation Reference\n", "\n", "| Tutorial Function | WASP2 Source |\n", "|-------------------|-------------|\n", "| `clamp_rho()` | `src/analysis/as_analysis.py:36-50` |\n", "| `mu_rho_to_alpha_beta()` | `src/analysis/as_analysis.py:104-105` |\n", "| `betabinom_loglik()` | `src/analysis/as_analysis.py:81-112` (opt_prob) |\n", "| `likelihood_ratio_test()` | `src/analysis/as_analysis.py:322-323` |\n", "| `estimate_dispersion_mle()` | `src/analysis/as_analysis.py:251-325` (single_model) |\n", "| `estimate_linear_dispersion()` | `src/analysis/as_analysis.py:53-78` (opt_linear) |\n", "| FDR correction | `src/analysis/as_analysis.py:492` |\n", "\n", "### Further Reading\n", "\n", "- [Statistical Models Documentation](../methods/statistical_models.rst)\n", "- [Dispersion Estimation Methods](../methods/dispersion_estimation.rst)\n", "- Skelly et al. (2011) - Beta-binomial framework for allelic imbalance\n", "- Benjamini & Hochberg (1995) - Controlling false discovery rate" ] }, { "cell_type": "code", "execution_count": null, "id": "final-cell", "metadata": {}, "outputs": [], "source": [ "print(\"Tutorial complete!\")\n", "print(\"\\nYou now understand the statistical foundations of WASP2's\")\n", "print(\"allelic imbalance detection framework.\")\n", "print(\"\\nKey takeaways:\")\n", "print(f\" - Always use RHO_EPSILON ({RHO_EPSILON}) to prevent numerical instability\")\n", "print(\" - Beta-binomial handles overdispersion that binomial cannot\")\n", "print(\" - QQ plots diagnose model calibration\")\n", "print(\" - FDR correction is essential for genome-wide testing\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 5 }