Statistical Methods Tutorial: Understanding WASP2’s Beta-Binomial Framework#
Estimated time: ~45 minutes
This interactive tutorial provides a deep dive into the statistical methods used by WASP2 for detecting allelic imbalance. You will learn:
Why overdispersion matters - Why the simple binomial model fails for sequencing data
Beta-binomial distributions - The statistical foundation of WASP2
Dispersion estimation - MLE vs Method of Moments approaches
QQ plots - Visualizing model fit and calibration
FDR correction - Benjamini-Hochberg and alternatives
Prerequisites#
This tutorial assumes familiarity with:
Basic probability distributions (binomial)
Hypothesis testing concepts (p-values)
Python data analysis (numpy, pandas, matplotlib)
No prior knowledge of beta-binomial distributions or overdispersion is required.
Relationship to WASP2 Source Code#
The functions in this tutorial mirror the implementations in:
src/analysis/as_analysis.py- Core statistical functions (clamp_rho,opt_prob,opt_linear)src/analysis/compare_ai.py- Comparative analysis between groupssrc/analysis/as_analysis_sc.py- Single-cell extensions
[ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import binom, betabinom, chi2, false_discovery_control
from scipy.special import expit, logit
from scipy.optimize import minimize_scalar, minimize
from typing import Tuple, Union
from numpy.typing import NDArray
import warnings
# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
np.random.seed(42)
print("Statistical Methods Tutorial")
print("=" * 40)
print(f"NumPy version: {np.__version__}")
print(f"SciPy version: {stats.scipy.__version__ if hasattr(stats, 'scipy') else 'N/A'}")
Critical Constants and Helper Functions#
WASP2 defines critical numerical constants to prevent division by zero and numerical overflow. These are defined in src/analysis/as_analysis.py (Issue #228).
Why this matters: The beta-binomial parameterization uses \(\alpha = \mu \cdot \frac{1-\rho}{\rho}\), which:
Causes division by zero when \(\rho = 0\)
Produces zero alpha/beta when \(\rho = 1\)
WASP2 clamps \(\rho\) to \((\epsilon, 1-\epsilon)\) where \(\epsilon = 10^{-10}\).
[ ]:
# =============================================================================
# BETA-BINOMIAL RHO PARAMETER BOUNDS (Issue #228)
# =============================================================================
# Matches WASP2's src/analysis/as_analysis.py:33
# =============================================================================
RHO_EPSILON: float = 1e-10
def clamp_rho(rho: Union[float, NDArray[np.float64]]) -> Union[float, NDArray[np.float64]]:
"""
Clamp dispersion parameter rho to safe range (epsilon, 1-epsilon).
This function mirrors WASP2's src/analysis/as_analysis.py:36-50.
The beta-binomial parameterization uses alpha = mu * (1-rho) / rho, which
causes division by zero when rho=0 and produces zero alpha/beta when rho=1.
This function prevents these boundary issues.
Args:
rho: Dispersion parameter (scalar or array), expected in [0, 1]
Returns:
Clamped rho in range (RHO_EPSILON, 1 - RHO_EPSILON)
Example:
>>> clamp_rho(0.0) # Would cause division by zero
1e-10
>>> clamp_rho(1.0) # Would produce zero alpha/beta
0.9999999999
"""
return np.clip(rho, RHO_EPSILON, 1.0 - RHO_EPSILON)
# Demonstrate the importance of clamping
print("Demonstrating rho clamping (Issue #228):")
print(f" RHO_EPSILON = {RHO_EPSILON}")
print(f" clamp_rho(0.0) = {clamp_rho(0.0)}")
print(f" clamp_rho(1.0) = {clamp_rho(1.0)}")
print(f" clamp_rho(0.05) = {clamp_rho(0.05)} # No change needed")
Section 1: Understanding Overdispersion#
1.1 The Naive Binomial Model#
When counting alleles at a heterozygous SNP, the simplest model assumes each read is an independent coin flip with probability 0.5 of coming from each allele:
where \(X\) is the reference allele count and \(N\) is the total read count.
Expected variance: \(\text{Var}(X) = N \cdot p \cdot (1-p) = N/4\)
Let’s simulate what this looks like:
[ ]:
# Simulate ideal binomial data (balanced, no overdispersion)
n_snps = 1000
read_depth = 50 # Total reads per SNP
# Input validation
assert n_snps > 0, "n_snps must be positive"
assert read_depth > 0, "read_depth must be positive"
# Perfect binomial sampling
ideal_ref_counts = np.random.binomial(n=read_depth, p=0.5, size=n_snps)
ideal_ratios = ideal_ref_counts / read_depth
# Calculate observed vs expected variance
expected_var = read_depth * 0.5 * 0.5
observed_var = np.var(ideal_ref_counts)
print(f"Binomial Model (N={read_depth}, p=0.5)")
print(f" Expected variance: {expected_var:.2f}")
print(f" Observed variance: {observed_var:.2f}")
print(f" Ratio (observed/expected): {observed_var/expected_var:.3f}")
1.2 The Problem: Real Data Shows Overdispersion#
In real sequencing data, the observed variance is typically larger than expected from a binomial model. This is called overdispersion.
Sources of overdispersion in allele counting:
PCR amplification bias - Some fragments amplify better than others
Library preparation effects - Batch effects during sample prep
Technical variability - Across lanes, flowcells, and sequencing runs
Mapping bias - Reference allele may map slightly better (even after WASP correction)
Let’s simulate data with overdispersion:
[ ]:
def simulate_overdispersed(
n_snps: int,
read_depth: int,
rho: float = 0.05
) -> NDArray[np.int64]:
"""
Simulate overdispersed allele counts using beta-binomial.
Args:
n_snps: Number of SNPs to simulate
read_depth: Total read depth per SNP
rho: Dispersion parameter (0 = binomial, higher = more overdispersed)
Returns:
Array of reference allele counts
Raises:
ValueError: If parameters are out of valid range
"""
# Input validation
if n_snps <= 0:
raise ValueError(f"n_snps must be positive, got {n_snps}")
if read_depth <= 0:
raise ValueError(f"read_depth must be positive, got {read_depth}")
if not 0 < rho < 1:
raise ValueError(f"rho must be in (0, 1), got {rho}")
mu = 0.5 # Balanced (no true imbalance)
rho = clamp_rho(rho) # Apply WASP2's clamping
alpha = mu * (1 - rho) / rho
beta = (1 - mu) * (1 - rho) / rho
return betabinom.rvs(n=read_depth, a=alpha, b=beta, size=n_snps)
# Simulate with different dispersion levels
rho_values = [0.001, 0.02, 0.05, 0.10]
overdispersed_data = {rho: simulate_overdispersed(n_snps, read_depth, rho)
for rho in rho_values}
# Compare variances
print(f"Expected binomial variance: {expected_var:.2f}")
print("\nObserved variances by dispersion (rho):")
for rho, data in overdispersed_data.items():
obs_var = np.var(data)
print(f" rho={rho:.3f}: variance={obs_var:.2f} (ratio={obs_var/expected_var:.2f}x)")
[ ]:
# Visualize the effect of overdispersion
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for idx, (rho, data) in enumerate(overdispersed_data.items()):
ax = axes.flat[idx]
ratios = data / read_depth
# Histogram of observed ratios
ax.hist(ratios, bins=30, density=True, alpha=0.7, color='steelblue',
edgecolor='black', label='Observed')
# Overlay expected binomial distribution
x = np.arange(0, read_depth + 1)
binomial_pmf = binom.pmf(x, read_depth, 0.5)
ax.plot(x/read_depth, binomial_pmf * read_depth, 'r-', lw=2,
label='Expected (Binomial)')
obs_var = np.var(data)
ax.set_title(f'rho = {rho:.3f}\nVariance ratio: {obs_var/expected_var:.2f}x', fontsize=11)
ax.set_xlabel('Reference Allele Frequency')
ax.set_ylabel('Density')
ax.legend(fontsize=9)
ax.set_xlim(0, 1)
plt.suptitle('Effect of Overdispersion on Allele Ratio Distributions', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print("\nKey observation: As rho increases, the distribution becomes wider")
print("than expected from binomial sampling alone.")
1.3 Consequences of Ignoring Overdispersion#
If we use a binomial model on overdispersed data, we will:
Underestimate variance → p-values too small
Inflate false positive rate → Many spurious “significant” results
Poor calibration → QQ plots show massive inflation
Let’s demonstrate this problem:
[ ]:
def binomial_test_pvalue(ref_count: int, total_count: int) -> float:
"""
Two-sided binomial test for deviation from 0.5.
Args:
ref_count: Number of reference allele reads
total_count: Total number of reads
Returns:
Two-sided p-value
"""
# Input validation
if ref_count < 0 or ref_count > total_count:
raise ValueError(f"Invalid counts: ref={ref_count}, total={total_count}")
if total_count <= 0:
raise ValueError(f"total_count must be positive, got {total_count}")
result = stats.binomtest(ref_count, total_count, p=0.5, alternative='two-sided')
return result.pvalue
# Test on overdispersed data (rho=0.05) - no TRUE imbalance!
test_data = overdispersed_data[0.05]
pvalues_binomial = [binomial_test_pvalue(int(k), read_depth) for k in test_data]
# Count "significant" results at different thresholds
pvalues_binomial = np.array(pvalues_binomial)
print("False positive rates using binomial test on overdispersed data:")
print("(Remember: there is NO true imbalance in this data!)\n")
for alpha in [0.05, 0.01, 0.001]:
fp_rate = (pvalues_binomial < alpha).mean()
print(f" Alpha = {alpha:.3f}: {fp_rate*100:.1f}% significant (expected: {alpha*100:.1f}%)")
print(f"\nThis represents a {(pvalues_binomial < 0.05).mean() / 0.05:.1f}x inflation!")
Section 2: The Beta-Binomial Distribution#
2.1 Mathematical Foundation#
The beta-binomial distribution extends the binomial by allowing the success probability \(p\) to vary according to a Beta distribution:
The marginal distribution of \(X\) is the beta-binomial.
WASP2’s parameterization uses mean (\(\mu\)) and dispersion (\(\rho\)):
Key properties:
Mean: \(E[X] = N \cdot \mu\) (same as binomial)
Variance: \(\text{Var}(X) = N \cdot \mu \cdot (1-\mu) \cdot [1 + (N-1) \cdot \rho]\) (inflated!)
When \(\rho \to 0\), the beta-binomial converges to the binomial.
[ ]:
def mu_rho_to_alpha_beta(
mu: float,
rho: float
) -> Tuple[float, float]:
"""
Convert WASP2's (mu, rho) parameterization to (alpha, beta).
This mirrors WASP2's parameterization in src/analysis/as_analysis.py:104-105.
Args:
mu: Mean parameter (allele frequency), in (0, 1)
rho: Dispersion parameter, in (0, 1)
Returns:
Tuple of (alpha, beta) parameters for scipy.stats.betabinom
Warning:
When rho is near 0 or 1, numerical instability can occur.
Use clamp_rho() to ensure safe values.
"""
# Apply WASP2's clamping to prevent numerical issues
rho = clamp_rho(rho)
# Validate mu
if not 0 < mu < 1:
raise ValueError(f"mu must be in (0, 1), got {mu}")
alpha = mu * (1 - rho) / rho
beta = (1 - mu) * (1 - rho) / rho
return alpha, beta
def betabinom_variance(n: int, mu: float, rho: float) -> float:
"""
Compute beta-binomial variance.
Args:
n: Number of trials (total read count)
mu: Mean parameter (allele frequency)
rho: Dispersion parameter
Returns:
Variance of the beta-binomial distribution
"""
return n * mu * (1 - mu) * (1 + (n - 1) * rho)
# Demonstrate variance inflation
N = 50
mu = 0.5
binomial_var = N * mu * (1 - mu)
print(f"Variance comparison (N={N}, mu={mu}):\n")
print(f"Binomial variance: {binomial_var:.2f}")
print("\nBeta-binomial variance by rho:")
for rho in [0.001, 0.01, 0.05, 0.10, 0.20]:
bb_var = betabinom_variance(N, mu, rho)
inflation = bb_var / binomial_var
print(f" rho={rho:.3f}: {bb_var:.2f} ({inflation:.2f}x inflation)")
2.1.1 Edge Cases and Numerical Stability#
Warning: The beta-binomial parameterization has dangerous edge cases that can cause numerical errors. This section demonstrates why WASP2’s clamp_rho() function is essential.
[ ]:
# Demonstrate edge cases that clamp_rho() prevents
print("Edge Cases in Beta-Binomial Parameterization")
print("=" * 50)
def unsafe_mu_rho_to_alpha_beta(mu, rho):
"""Unsafe version WITHOUT clamping - for demonstration only."""
alpha = mu * (1 - rho) / rho
beta = (1 - mu) * (1 - rho) / rho
return alpha, beta
# Case 1: rho = 0 causes division by zero
print("\n1. rho = 0 (division by zero):")
try:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 0.0)
print(f" alpha = {alpha}, beta = {beta}")
print(" Result: inf values - causes downstream errors!")
except Exception as e:
print(f" Error: {e}")
# Case 2: rho = 1 produces zero alpha/beta
print("\n2. rho = 1 (zero parameters):")
alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1.0)
print(f" alpha = {alpha}, beta = {beta}")
print(" Result: zero values - invalid for Beta distribution!")
# Case 3: Very small rho produces huge alpha/beta
print("\n3. rho = 1e-15 (numerical overflow risk):")
alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1e-15)
print(f" alpha = {alpha:.2e}, beta = {beta:.2e}")
print(" Result: huge values - may overflow in gamma functions!")
# Safe version with clamping
print("\n" + "=" * 50)
print("With WASP2's clamp_rho():")
print("=" * 50)
for rho_input in [0.0, 1.0, 1e-15]:
rho_safe = clamp_rho(rho_input)
alpha, beta = mu_rho_to_alpha_beta(0.5, rho_safe)
print(f"\nInput rho={rho_input} -> clamped to {rho_safe}")
print(f" alpha = {alpha:.4f}, beta = {beta:.4f}")
[ ]:
# Visualize beta-binomial vs binomial PMFs
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
N = 50
x = np.arange(0, N + 1)
# Panel 1: Different rho values (mu=0.5)
ax = axes[0]
ax.plot(x, binom.pmf(x, N, 0.5), 'k-', lw=2, label='Binomial')
for rho, color in [(0.02, 'blue'), (0.05, 'orange'), (0.10, 'red')]:
alpha, beta = mu_rho_to_alpha_beta(0.5, rho)
ax.plot(x, betabinom.pmf(x, N, alpha, beta), '--', lw=2,
color=color, label=f'rho={rho}')
ax.set_xlabel('Reference Count')
ax.set_ylabel('Probability')
ax.set_title('Effect of Dispersion (mu=0.5)')
ax.legend()
# Panel 2: Different mu values (rho=0.05)
ax = axes[1]
rho = 0.05
for mu, color in [(0.5, 'gray'), (0.6, 'blue'), (0.7, 'orange'), (0.8, 'red')]:
alpha, beta = mu_rho_to_alpha_beta(mu, rho)
ax.plot(x, betabinom.pmf(x, N, alpha, beta), '-', lw=2,
color=color, label=f'mu={mu}')
ax.set_xlabel('Reference Count')
ax.set_ylabel('Probability')
ax.set_title(f'Effect of Imbalance (rho={rho})')
ax.legend()
# Panel 3: Log-scale tails comparison
ax = axes[2]
ax.semilogy(x, binom.pmf(x, N, 0.5), 'k-', lw=2, label='Binomial')
for rho, color in [(0.02, 'blue'), (0.10, 'red')]:
alpha, beta = mu_rho_to_alpha_beta(0.5, rho)
ax.semilogy(x, betabinom.pmf(x, N, alpha, beta), '--', lw=2,
color=color, label=f'rho={rho}')
ax.set_xlabel('Reference Count')
ax.set_ylabel('Probability (log scale)')
ax.set_title('Tail Behavior (log scale)')
ax.legend()
ax.set_ylim(1e-15, 1)
plt.tight_layout()
plt.show()
print("Key insight: Beta-binomial has heavier tails than binomial,")
print("making extreme counts more likely under the null hypothesis.")
2.2 Likelihood Ratio Test for Imbalance#
WASP2 uses a likelihood ratio test (LRT) to detect allelic imbalance:
Null hypothesis (:math:`H_0`): \(\mu = 0.5\) (balanced)
Alternative (:math:`H_1`): \(\mu \neq 0.5\) (imbalanced)
The test statistic is:
This follows a chi-squared distribution with 1 degree of freedom under \(H_0\).
WASP2 Implementation: See src/analysis/as_analysis.py:322-323
[ ]:
def betabinom_loglik(
ref_count: int,
total_count: int,
mu: float,
rho: float
) -> float:
"""
Compute beta-binomial log-likelihood.
Mirrors WASP2's src/analysis/as_analysis.py:81-112 (opt_prob function).
Args:
ref_count: Reference allele count
total_count: Total read count
mu: Mean parameter (allele frequency)
rho: Dispersion parameter
Returns:
Log-likelihood value
"""
alpha, beta = mu_rho_to_alpha_beta(mu, rho)
return betabinom.logpmf(ref_count, total_count, alpha, beta)
def find_mle_mu(
ref_count: int,
total_count: int,
rho: float
) -> float:
"""
Find MLE of mu given fixed rho.
Mirrors WASP2's src/analysis/as_analysis.py:184-248 (parse_opt function).
Args:
ref_count: Reference allele count
total_count: Total read count
rho: Fixed dispersion parameter
Returns:
Maximum likelihood estimate of mu
"""
def neg_loglik(mu):
return -betabinom_loglik(ref_count, total_count, mu, rho)
# Use bounded optimization with safe mu range
result = minimize_scalar(
neg_loglik,
bounds=(0.001, 0.999),
method='bounded'
)
return result.x
def likelihood_ratio_test(
ref_count: int,
total_count: int,
rho: float
) -> Tuple[float, float, float]:
"""
Perform LRT for allelic imbalance.
Mirrors WASP2's calculation in src/analysis/as_analysis.py:322-323.
Args:
ref_count: Reference allele count
total_count: Total read count
rho: Dispersion parameter
Returns:
Tuple of (lrt_statistic, p_value, mle_mu)
"""
# Input validation
if ref_count < 0 or ref_count > total_count:
raise ValueError(f"Invalid: ref_count={ref_count}, total_count={total_count}")
if total_count <= 0:
raise ValueError(f"total_count must be positive, got {total_count}")
# Null likelihood (mu = 0.5)
null_ll = betabinom_loglik(ref_count, total_count, 0.5, rho)
# Alternative likelihood (mu = MLE)
mle_mu = find_mle_mu(ref_count, total_count, rho)
alt_ll = betabinom_loglik(ref_count, total_count, mle_mu, rho)
# LRT statistic (matches WASP2: -2 * (null_ll - alt_ll))
lrt = -2 * (null_ll - alt_ll)
lrt = max(0, lrt) # Ensure non-negative due to numerical precision
# P-value from chi-squared distribution (df=1)
pvalue = chi2.sf(lrt, df=1)
return lrt, pvalue, mle_mu
# Example calculation
ref, total = 35, 50 # Observed: 35 ref out of 50 total
rho = 0.05
lrt, pval, mu_hat = likelihood_ratio_test(ref, total, rho)
print(f"Example LRT calculation:")
print(f" Observed: {ref} ref / {total} total (ratio = {ref/total:.2f})")
print(f" Dispersion (rho): {rho}")
print(f" MLE of mu: {mu_hat:.3f}")
print(f" LRT statistic: {lrt:.3f}")
print(f" P-value: {pval:.4f}")
Section 3: Dispersion Estimation#
A critical step is estimating the dispersion parameter \(\rho\) from the data. WASP2 offers two approaches:
Single dispersion model - One \(\rho\) for the entire dataset
Linear dispersion model - \(\rho\) varies with read depth: \(\text{logit}(\rho) = \beta_0 + \beta_1 \cdot N\)
3.1 Maximum Likelihood Estimation (MLE)#
MLE finds the \(\rho\) that maximizes the likelihood under \(H_0\) (balanced).
WASP2 Implementation: See src/analysis/as_analysis.py:251-325 (single_model function)
[ ]:
def estimate_dispersion_mle(
ref_counts: NDArray[np.int64],
total_counts: NDArray[np.int64]
) -> float:
"""
Estimate single dispersion parameter via MLE.
Mirrors WASP2's single_model in src/analysis/as_analysis.py:251-325.
Assumes mu=0.5 (null hypothesis) for all observations.
Args:
ref_counts: Array of reference allele counts
total_counts: Array of total read counts
Returns:
MLE estimate of dispersion parameter rho
Raises:
ValueError: If arrays have different lengths or invalid values
"""
# Input validation
ref_counts = np.asarray(ref_counts)
total_counts = np.asarray(total_counts)
if len(ref_counts) != len(total_counts):
raise ValueError("ref_counts and total_counts must have same length")
if len(ref_counts) == 0:
raise ValueError("Arrays must not be empty")
if np.any(ref_counts < 0) or np.any(ref_counts > total_counts):
raise ValueError("Invalid counts: ref_count must be in [0, total_count]")
def neg_loglik(rho):
rho = clamp_rho(rho) # Apply WASP2's clamping
alpha, beta = mu_rho_to_alpha_beta(0.5, rho)
ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))
return -ll
result = minimize_scalar(
neg_loglik,
bounds=(RHO_EPSILON, 0.5), # Use RHO_EPSILON as lower bound
method='bounded'
)
return result.x
# Estimate dispersion from our simulated data
true_rho = 0.05
test_data = overdispersed_data[true_rho]
total_counts = np.full(len(test_data), read_depth)
estimated_rho = estimate_dispersion_mle(test_data, total_counts)
print(f"Dispersion estimation (MLE):")
print(f" True rho: {true_rho:.4f}")
print(f" Estimated rho: {estimated_rho:.4f}")
print(f" Relative error: {abs(estimated_rho - true_rho) / true_rho * 100:.1f}%")
3.2 Method of Moments (MoM)#
An alternative is the Method of Moments, which matches the observed variance to the expected beta-binomial variance:
MoM is faster but may be less efficient than MLE.
[ ]:
def estimate_dispersion_mom(
ref_counts: NDArray[np.int64],
total_counts: NDArray[np.int64]
) -> float:
"""
Estimate dispersion using Method of Moments.
Args:
ref_counts: Array of reference allele counts
total_counts: Array of total read counts
Returns:
MoM estimate of dispersion parameter rho
"""
# Input validation
ref_counts = np.asarray(ref_counts)
total_counts = np.asarray(total_counts)
if len(ref_counts) != len(total_counts):
raise ValueError("Arrays must have same length")
ratios = ref_counts / total_counts
mu = 0.5 # Assume balanced under null
# Observed variance of ratios
obs_var = np.var(ratios)
# Expected binomial variance (if rho=0)
mean_n = np.mean(total_counts)
binom_var = mu * (1 - mu) / mean_n
# Solve for rho with safeguards
numerator = obs_var - binom_var
denominator = mu * (1 - mu) * (1 - 1/mean_n)
# Handle edge case where denominator is near zero
if abs(denominator) < 1e-10:
return RHO_EPSILON
rho_mom = numerator / denominator
# Clamp to valid range using WASP2's constant
return float(clamp_rho(rho_mom))
# Compare MLE vs MoM
rho_mom = estimate_dispersion_mom(test_data, total_counts)
print(f"Comparison of estimation methods:")
print(f" True rho: {true_rho:.4f}")
print(f" MLE estimate: {estimated_rho:.4f}")
print(f" MoM estimate: {rho_mom:.4f}")
[ ]:
# Systematic comparison across different true rho values
comparison_results = []
for true_rho in [0.01, 0.02, 0.05, 0.10, 0.15]:
# Simulate data
alpha, beta = mu_rho_to_alpha_beta(0.5, true_rho)
sim_data = betabinom.rvs(n=50, a=alpha, b=beta, size=1000)
sim_totals = np.full(1000, 50)
# Estimate with both methods
mle_est = estimate_dispersion_mle(sim_data, sim_totals)
mom_est = estimate_dispersion_mom(sim_data, sim_totals)
comparison_results.append({
'true_rho': true_rho,
'mle_estimate': mle_est,
'mom_estimate': mom_est,
'mle_error': abs(mle_est - true_rho) / true_rho * 100,
'mom_error': abs(mom_est - true_rho) / true_rho * 100
})
comparison_df = pd.DataFrame(comparison_results)
print("MLE vs MoM Comparison:")
print(comparison_df.round(4).to_string(index=False))
3.3 Linear Dispersion Model#
In some datasets, dispersion varies with read depth. The linear model parameterizes:
This allows higher-depth regions to have different dispersion than lower-depth regions.
WASP2 Implementation: See src/analysis/as_analysis.py:53-78 (opt_linear function)
[ ]:
def estimate_linear_dispersion(
ref_counts: NDArray[np.int64],
total_counts: NDArray[np.int64]
) -> Tuple[float, float]:
"""
Estimate depth-dependent dispersion: logit(rho) = beta0 + beta1 * N
Mirrors WASP2's opt_linear in src/analysis/as_analysis.py:53-78.
Args:
ref_counts: Array of reference allele counts
total_counts: Array of total read counts
Returns:
Tuple of (beta0, beta1) parameters
"""
def neg_loglik(params):
beta0, beta1 = params
# Compute rho for each observation
linear_pred = beta0 + beta1 * total_counts
# Clip to prevent overflow (matches WASP2's approach)
linear_pred = np.clip(linear_pred, -10, 10)
rho = expit(linear_pred)
rho = clamp_rho(rho) # Apply WASP2's clamping
# Compute likelihood
alpha = 0.5 * (1 - rho) / rho
beta = 0.5 * (1 - rho) / rho
ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))
return -ll
result = minimize(neg_loglik, x0=[-3, 0], method='Nelder-Mead')
return tuple(result.x)
# Simulate data with depth-dependent dispersion
np.random.seed(42)
n_snps = 2000
depths = np.random.choice([20, 50, 100, 200], size=n_snps)
# True model: logit(rho) = -3 + 0.01 * N
true_beta0, true_beta1 = -3, 0.01
true_rhos = expit(true_beta0 + true_beta1 * depths)
# Generate counts with proper clamping
ref_counts_linear = np.array([
betabinom.rvs(
n=n,
a=0.5*(1-clamp_rho(rho))/clamp_rho(rho),
b=0.5*(1-clamp_rho(rho))/clamp_rho(rho)
)
for n, rho in zip(depths, true_rhos)
])
# Estimate
est_beta0, est_beta1 = estimate_linear_dispersion(ref_counts_linear, depths)
print("Linear dispersion model estimation:")
print(f" True: logit(rho) = {true_beta0:.2f} + {true_beta1:.4f} * N")
print(f" Est: logit(rho) = {est_beta0:.2f} + {est_beta1:.4f} * N")
[ ]:
# Visualize the linear dispersion model
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Panel 1: Rho vs depth
ax = axes[0]
depth_range = np.linspace(10, 250, 100)
true_curve = expit(true_beta0 + true_beta1 * depth_range)
est_curve = expit(est_beta0 + est_beta1 * depth_range)
ax.plot(depth_range, true_curve, 'b-', lw=2, label='True')
ax.plot(depth_range, est_curve, 'r--', lw=2, label='Estimated')
ax.set_xlabel('Read Depth (N)')
ax.set_ylabel('Dispersion (rho)')
ax.set_title('Linear Dispersion Model')
ax.legend()
# Panel 2: Show effect on variance
ax = axes[1]
binom_var = depth_range * 0.5 * 0.5
bb_var_true = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * true_curve)
bb_var_const = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * 0.05)
ax.plot(depth_range, binom_var, 'k-', lw=2, label='Binomial')
ax.plot(depth_range, bb_var_const, 'g--', lw=2, label='Constant rho=0.05')
ax.plot(depth_range, bb_var_true, 'b-', lw=2, label='Linear model')
ax.set_xlabel('Read Depth (N)')
ax.set_ylabel('Variance of Ref Count')
ax.set_title('Variance Scaling with Depth')
ax.legend()
plt.tight_layout()
plt.show()
Section 4: QQ Plots for Model Calibration#
Quantile-Quantile (QQ) plots compare observed p-values to their expected distribution under the null. If the model is well-calibrated:
P-values should be uniformly distributed under \(H_0\)
QQ plot should follow the diagonal line
4.1 Constructing QQ Plots#
[ ]:
def qq_plot(
pvalues: NDArray[np.float64],
ax=None,
label: str = 'Observed',
color: str = 'steelblue'
):
"""
Create a QQ plot of -log10(p-values).
Args:
pvalues: Array of p-values
ax: Matplotlib axes (created if None)
label: Label for the scatter plot
color: Color for the points
Returns:
Matplotlib axes object
"""
if ax is None:
fig, ax = plt.subplots()
# Input validation and cleaning
pvalues = np.asarray(pvalues, dtype=np.float64)
pvalues = pvalues[~np.isnan(pvalues)] # Remove NaN
pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)] # Valid p-values only
if len(pvalues) == 0:
raise ValueError("No valid p-values after filtering")
pvalues = np.sort(pvalues)
# Expected p-values under uniform distribution
n = len(pvalues)
expected = (np.arange(1, n + 1) - 0.5) / n
# Transform to -log10 scale with safe clipping
obs_log = -np.log10(np.clip(pvalues, 1e-300, 1))
exp_log = -np.log10(expected)
# Plot
ax.scatter(exp_log, obs_log, s=10, alpha=0.6, color=color, label=label)
# Diagonal line
max_val = max(exp_log.max(), obs_log.max())
ax.plot([0, max_val], [0, max_val], 'r--', lw=1, label='Expected')
ax.set_xlabel('Expected -log10(p)')
ax.set_ylabel('Observed -log10(p)')
return ax
def genomic_inflation_factor(pvalues: NDArray[np.float64]) -> float:
"""
Calculate genomic inflation factor (lambda).
Lambda = 1 indicates perfect calibration.
Lambda > 1 indicates inflation (too many false positives).
Lambda < 1 indicates deflation (too conservative).
Args:
pvalues: Array of p-values
Returns:
Genomic inflation factor
"""
pvalues = np.asarray(pvalues, dtype=np.float64)
pvalues = pvalues[~np.isnan(pvalues)]
pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)]
if len(pvalues) == 0:
return np.nan
chi2_stats = chi2.ppf(1 - pvalues, df=1)
return np.median(chi2_stats) / chi2.ppf(0.5, df=1)
[ ]:
# Generate p-values using binomial vs beta-binomial tests on overdispersed data
true_rho = 0.05
test_data = overdispersed_data[true_rho]
n_total = read_depth
# Binomial test (wrong model)
pvals_binomial = [stats.binomtest(int(k), n_total, 0.5).pvalue for k in test_data]
# Beta-binomial test (correct model)
est_rho = estimate_dispersion_mle(test_data, np.full(len(test_data), n_total))
pvals_betabinom = [likelihood_ratio_test(int(k), n_total, est_rho)[1] for k in test_data]
# Create QQ plots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
ax = axes[0]
qq_plot(np.array(pvals_binomial), ax=ax, label='Binomial', color='firebrick')
lambda_binom = genomic_inflation_factor(np.array(pvals_binomial))
ax.set_title(f'Binomial Test\n(lambda = {lambda_binom:.2f})')
ax.legend()
ax = axes[1]
qq_plot(np.array(pvals_betabinom), ax=ax, label='Beta-binomial', color='steelblue')
lambda_bb = genomic_inflation_factor(np.array(pvals_betabinom))
ax.set_title(f'Beta-Binomial Test\n(lambda = {lambda_bb:.2f})')
ax.legend()
plt.suptitle(f'QQ Plots on Null Data (true rho = {true_rho})', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print(f"Genomic inflation factor (lambda):")
print(f" Binomial test: {lambda_binom:.3f} (expected: 1.0)")
print(f" Beta-binomial: {lambda_bb:.3f} (expected: 1.0)")
print(f"\nA lambda >> 1 indicates systematic inflation (too many false positives).")
4.2 Interpreting QQ Plots#
Pattern |
Interpretation |
|---|---|
Points on diagonal |
Well-calibrated (good!) |
Points above diagonal |
Inflation (too many small p-values) |
Points below diagonal |
Deflation (test too conservative) |
Lift at the tail only |
True signal mixed with null |
The genomic inflation factor (lambda) quantifies deviation:
\(\lambda = 1\): Perfect calibration
\(\lambda > 1\): Inflation
\(\lambda < 1\): Deflation
[ ]:
# Demonstrate QQ plot with true signal
np.random.seed(42)
# Generate mixed data: 90% null + 10% truly imbalanced
n_snps = 1000
n_signal = 100
rho = 0.05
# Null data (mu = 0.5)
alpha_null, beta_null = mu_rho_to_alpha_beta(0.5, rho)
null_counts = betabinom.rvs(n=50, a=alpha_null, b=beta_null, size=n_snps - n_signal)
# Signal data (mu = 0.7, strong imbalance)
alpha_sig, beta_sig = mu_rho_to_alpha_beta(0.7, rho)
signal_counts = betabinom.rvs(n=50, a=alpha_sig, b=beta_sig, size=n_signal)
# Combine
mixed_counts = np.concatenate([null_counts, signal_counts])
is_signal = np.concatenate([np.zeros(n_snps - n_signal), np.ones(n_signal)]).astype(bool)
# Test all SNPs
est_rho = estimate_dispersion_mle(mixed_counts, np.full(len(mixed_counts), 50))
mixed_pvals = np.array([likelihood_ratio_test(int(k), 50, est_rho)[1] for k in mixed_counts])
# QQ plot
fig, ax = plt.subplots(figsize=(8, 8))
qq_plot(mixed_pvals, ax=ax)
ax.set_title(f'QQ Plot with {n_signal} True Signals\n(10% imbalanced, mu=0.7)')
ax.legend()
plt.show()
# Check detection power
detected = mixed_pvals < 0.05
print(f"Detection results (alpha = 0.05):")
print(f" True positives: {detected[is_signal].sum()} / {n_signal}")
print(f" False positives: {detected[~is_signal].sum()} / {n_snps - n_signal}")
Section 5: False Discovery Rate (FDR) Correction#
When testing thousands of SNPs, multiple testing correction is essential. WASP2 uses Benjamini-Hochberg (BH) FDR correction via scipy.stats.false_discovery_control().
WASP2 Implementation: See src/analysis/as_analysis.py:492 which uses false_discovery_control(pvals, method="bh")
5.1 The Multiple Testing Problem#
If we test 10,000 SNPs at \(\alpha = 0.05\), we expect 500 false positives even if there’s no true signal!
[ ]:
# Demonstrate multiple testing problem
n_tests = 10000
alpha = 0.05
# All null (no true signal)
null_pvalues = np.random.uniform(0, 1, n_tests)
# Without correction
false_positives_raw = (null_pvalues < alpha).sum()
print(f"Multiple testing with {n_tests} tests (no true signal):")
print(f" Expected false positives at alpha={alpha}: {n_tests * alpha:.0f}")
print(f" Observed false positives: {false_positives_raw}")
print(f"\nThis is why we need FDR correction!")
5.2 Benjamini-Hochberg Procedure#
The BH procedure controls the false discovery rate - the expected proportion of false positives among all discoveries.
Algorithm:
Sort p-values: \(p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}\)
Find largest \(k\) such that \(p_{(k)} \leq \frac{k}{m} \cdot q\) where \(q\) is target FDR
Reject all hypotheses with \(p_{(i)} \leq p_{(k)}\)
Note: WASP2 uses scipy.stats.false_discovery_control(pvals, method="bh") for this.
[ ]:
def benjamini_hochberg(pvalues: NDArray[np.float64]) -> NDArray[np.float64]:
"""
Apply Benjamini-Hochberg FDR correction.
Note: WASP2 uses scipy.stats.false_discovery_control(pvals, method="bh")
which is equivalent. This implementation is for educational purposes.
Args:
pvalues: Array of p-values
Returns:
Array of adjusted p-values (q-values)
"""
pvalues = np.asarray(pvalues, dtype=np.float64)
n = len(pvalues)
if n == 0:
return np.array([])
# Sort indices
sorted_idx = np.argsort(pvalues)
sorted_pvals = pvalues[sorted_idx]
# BH adjustment: p_adj[i] = p[i] * n / rank[i]
adjusted = sorted_pvals * n / np.arange(1, n + 1)
# Ensure monotonicity (cumulative minimum from the end)
adjusted = np.minimum.accumulate(adjusted[::-1])[::-1]
# Cap at 1
adjusted = np.minimum(adjusted, 1)
# Restore original order
result = np.empty(n)
result[sorted_idx] = adjusted
return result
# Compare our implementation with scipy's
print("Comparing BH implementations:")
test_pvals = np.array([0.001, 0.01, 0.02, 0.05, 0.1, 0.5])
our_adjusted = benjamini_hochberg(test_pvals)
scipy_adjusted = false_discovery_control(test_pvals, method='bh')
print(f" Raw p-values: {test_pvals}")
print(f" Our BH adjusted: {our_adjusted.round(4)}")
print(f" SciPy BH adjusted:{scipy_adjusted.round(4)}")
print(f" Match: {np.allclose(our_adjusted, scipy_adjusted)}")
[ ]:
# Apply FDR correction to mixed data
# Using scipy's implementation (same as WASP2)
fdr_pvals = false_discovery_control(mixed_pvals, method='bh')
# Compare results
print("FDR correction comparison (alpha/FDR = 0.05):")
print(f"\nRaw p-values:")
print(f" Significant: {(mixed_pvals < 0.05).sum()}")
print(f" True positives: {((mixed_pvals < 0.05) & is_signal).sum()}")
print(f" False positives: {((mixed_pvals < 0.05) & ~is_signal).sum()}")
print(f"\nBH-adjusted p-values (scipy.stats.false_discovery_control):")
print(f" Significant: {(fdr_pvals < 0.05).sum()}")
print(f" True positives: {((fdr_pvals < 0.05) & is_signal).sum()}")
print(f" False positives: {((fdr_pvals < 0.05) & ~is_signal).sum()}")
5.3 Alternative FDR Methods#
WASP2 primarily uses BH, but other options exist:
Method |
Description |
Use Case |
|---|---|---|
Benjamini-Hochberg (BH) |
Standard FDR control |
Default choice |
Benjamini-Yekutieli (BY) |
Controls FDR under dependency |
Correlated tests |
Storey’s q-value |
Estimates proportion of true nulls |
Large-scale testing |
mid-p correction |
Less conservative binomial test |
Discrete distributions |
[ ]:
# Visualize FDR correction effect
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Panel 1: Raw vs adjusted p-values
ax = axes[0]
sorted_idx = np.argsort(mixed_pvals)
ax.plot(mixed_pvals[sorted_idx], label='Raw p-value', lw=2)
ax.plot(fdr_pvals[sorted_idx], label='BH-adjusted', lw=2)
ax.axhline(0.05, color='red', linestyle='--', label='Threshold (0.05)')
ax.set_xlabel('Rank')
ax.set_ylabel('P-value')
ax.set_title('Raw vs FDR-Adjusted P-values')
ax.set_xlim(0, 200) # Focus on top hits
ax.legend()
# Panel 2: Number of discoveries at different thresholds
ax = axes[1]
thresholds = np.linspace(0.001, 0.2, 50)
raw_discoveries = [(mixed_pvals < t).sum() for t in thresholds]
fdr_discoveries = [(fdr_pvals < t).sum() for t in thresholds]
ax.plot(thresholds, raw_discoveries, label='Raw p-value', lw=2)
ax.plot(thresholds, fdr_discoveries, label='FDR-adjusted', lw=2)
ax.axhline(n_signal, color='green', linestyle=':', label=f'True signals ({n_signal})')
ax.axvline(0.05, color='red', linestyle='--', alpha=0.5)
ax.set_xlabel('Threshold')
ax.set_ylabel('Number of Discoveries')
ax.set_title('Discoveries at Different Thresholds')
ax.legend()
plt.tight_layout()
plt.show()
Summary#
In this tutorial, you learned:
Key Concepts#
Overdispersion - Sequencing data shows more variance than the binomial model predicts
Sources: PCR bias, library prep effects, technical variability
Consequence: Binomial test has inflated false positive rate
Beta-binomial distribution - Naturally handles overdispersion
Parameters: mean (\(\mu\)) and dispersion (\(\rho\))
Variance: \(N\mu(1-\mu)[1 + (N-1)\rho]\)
Reduces to binomial when \(\rho \to 0\)
Critical: Use
clamp_rho()to prevent numerical instability at boundaries
Dispersion estimation
MLE: More efficient, used by WASP2
MoM: Faster, closed-form solution
Linear model: Allows depth-dependent dispersion
QQ plots - Diagnostic tool for model calibration
Points on diagonal = well-calibrated
Inflation factor (\(\lambda\)) quantifies deviation
FDR correction - Essential for multiple testing
Benjamini-Hochberg controls false discovery rate
WASP2 uses
scipy.stats.false_discovery_control(pvals, method="bh")
WASP2 Implementation Reference#
Tutorial Function |
WASP2 Source |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
FDR correction |
|
Further Reading#
Skelly et al. (2011) - Beta-binomial framework for allelic imbalance
Benjamini & Hochberg (1995) - Controlling false discovery rate
[ ]:
print("Tutorial complete!")
print("\nYou now understand the statistical foundations of WASP2's")
print("allelic imbalance detection framework.")
print("\nKey takeaways:")
print(f" - Always use RHO_EPSILON ({RHO_EPSILON}) to prevent numerical instability")
print(" - Beta-binomial handles overdispersion that binomial cannot")
print(" - QQ plots diagnose model calibration")
print(" - FDR correction is essential for genome-wide testing")