Statistical Methods Tutorial: Understanding WASP2’s Beta-Binomial Framework#

Estimated time: ~45 minutes

This interactive tutorial provides a deep dive into the statistical methods used by WASP2 for detecting allelic imbalance. You will learn:

  1. Why overdispersion matters - Why the simple binomial model fails for sequencing data

  2. Beta-binomial distributions - The statistical foundation of WASP2

  3. Dispersion estimation - MLE vs Method of Moments approaches

  4. QQ plots - Visualizing model fit and calibration

  5. FDR correction - Benjamini-Hochberg and alternatives

Prerequisites#

This tutorial assumes familiarity with:

  • Basic probability distributions (binomial)

  • Hypothesis testing concepts (p-values)

  • Python data analysis (numpy, pandas, matplotlib)

No prior knowledge of beta-binomial distributions or overdispersion is required.

Relationship to WASP2 Source Code#

The functions in this tutorial mirror the implementations in:

  • src/analysis/as_analysis.py - Core statistical functions (clamp_rho, opt_prob, opt_linear)

  • src/analysis/compare_ai.py - Comparative analysis between groups

  • src/analysis/as_analysis_sc.py - Single-cell extensions

[ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import binom, betabinom, chi2, false_discovery_control
from scipy.special import expit, logit
from scipy.optimize import minimize_scalar, minimize
from typing import Tuple, Union
from numpy.typing import NDArray
import warnings

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
np.random.seed(42)

print("Statistical Methods Tutorial")
print("=" * 40)
print(f"NumPy version: {np.__version__}")
print(f"SciPy version: {stats.scipy.__version__ if hasattr(stats, 'scipy') else 'N/A'}")

Critical Constants and Helper Functions#

WASP2 defines critical numerical constants to prevent division by zero and numerical overflow. These are defined in src/analysis/as_analysis.py (Issue #228).

Why this matters: The beta-binomial parameterization uses \(\alpha = \mu \cdot \frac{1-\rho}{\rho}\), which:

  • Causes division by zero when \(\rho = 0\)

  • Produces zero alpha/beta when \(\rho = 1\)

WASP2 clamps \(\rho\) to \((\epsilon, 1-\epsilon)\) where \(\epsilon = 10^{-10}\).

[ ]:
# =============================================================================
# BETA-BINOMIAL RHO PARAMETER BOUNDS (Issue #228)
# =============================================================================
# Matches WASP2's src/analysis/as_analysis.py:33
# =============================================================================

RHO_EPSILON: float = 1e-10


def clamp_rho(rho: Union[float, NDArray[np.float64]]) -> Union[float, NDArray[np.float64]]:
    """
    Clamp dispersion parameter rho to safe range (epsilon, 1-epsilon).

    This function mirrors WASP2's src/analysis/as_analysis.py:36-50.

    The beta-binomial parameterization uses alpha = mu * (1-rho) / rho, which
    causes division by zero when rho=0 and produces zero alpha/beta when rho=1.
    This function prevents these boundary issues.

    Args:
        rho: Dispersion parameter (scalar or array), expected in [0, 1]

    Returns:
        Clamped rho in range (RHO_EPSILON, 1 - RHO_EPSILON)

    Example:
        >>> clamp_rho(0.0)  # Would cause division by zero
        1e-10
        >>> clamp_rho(1.0)  # Would produce zero alpha/beta
        0.9999999999
    """
    return np.clip(rho, RHO_EPSILON, 1.0 - RHO_EPSILON)


# Demonstrate the importance of clamping
print("Demonstrating rho clamping (Issue #228):")
print(f"  RHO_EPSILON = {RHO_EPSILON}")
print(f"  clamp_rho(0.0) = {clamp_rho(0.0)}")
print(f"  clamp_rho(1.0) = {clamp_rho(1.0)}")
print(f"  clamp_rho(0.05) = {clamp_rho(0.05)}  # No change needed")

Section 1: Understanding Overdispersion#

1.1 The Naive Binomial Model#

When counting alleles at a heterozygous SNP, the simplest model assumes each read is an independent coin flip with probability 0.5 of coming from each allele:

\[X \sim \text{Binomial}(N, p=0.5)\]

where \(X\) is the reference allele count and \(N\) is the total read count.

Expected variance: \(\text{Var}(X) = N \cdot p \cdot (1-p) = N/4\)

Let’s simulate what this looks like:

[ ]:
# Simulate ideal binomial data (balanced, no overdispersion)
n_snps = 1000
read_depth = 50  # Total reads per SNP

# Input validation
assert n_snps > 0, "n_snps must be positive"
assert read_depth > 0, "read_depth must be positive"

# Perfect binomial sampling
ideal_ref_counts = np.random.binomial(n=read_depth, p=0.5, size=n_snps)
ideal_ratios = ideal_ref_counts / read_depth

# Calculate observed vs expected variance
expected_var = read_depth * 0.5 * 0.5
observed_var = np.var(ideal_ref_counts)

print(f"Binomial Model (N={read_depth}, p=0.5)")
print(f"  Expected variance: {expected_var:.2f}")
print(f"  Observed variance: {observed_var:.2f}")
print(f"  Ratio (observed/expected): {observed_var/expected_var:.3f}")

1.2 The Problem: Real Data Shows Overdispersion#

In real sequencing data, the observed variance is typically larger than expected from a binomial model. This is called overdispersion.

Sources of overdispersion in allele counting:

  1. PCR amplification bias - Some fragments amplify better than others

  2. Library preparation effects - Batch effects during sample prep

  3. Technical variability - Across lanes, flowcells, and sequencing runs

  4. Mapping bias - Reference allele may map slightly better (even after WASP correction)

Let’s simulate data with overdispersion:

[ ]:
def simulate_overdispersed(
    n_snps: int,
    read_depth: int,
    rho: float = 0.05
) -> NDArray[np.int64]:
    """
    Simulate overdispersed allele counts using beta-binomial.

    Args:
        n_snps: Number of SNPs to simulate
        read_depth: Total read depth per SNP
        rho: Dispersion parameter (0 = binomial, higher = more overdispersed)

    Returns:
        Array of reference allele counts

    Raises:
        ValueError: If parameters are out of valid range
    """
    # Input validation
    if n_snps <= 0:
        raise ValueError(f"n_snps must be positive, got {n_snps}")
    if read_depth <= 0:
        raise ValueError(f"read_depth must be positive, got {read_depth}")
    if not 0 < rho < 1:
        raise ValueError(f"rho must be in (0, 1), got {rho}")

    mu = 0.5  # Balanced (no true imbalance)
    rho = clamp_rho(rho)  # Apply WASP2's clamping

    alpha = mu * (1 - rho) / rho
    beta = (1 - mu) * (1 - rho) / rho

    return betabinom.rvs(n=read_depth, a=alpha, b=beta, size=n_snps)


# Simulate with different dispersion levels
rho_values = [0.001, 0.02, 0.05, 0.10]
overdispersed_data = {rho: simulate_overdispersed(n_snps, read_depth, rho)
                      for rho in rho_values}

# Compare variances
print(f"Expected binomial variance: {expected_var:.2f}")
print("\nObserved variances by dispersion (rho):")
for rho, data in overdispersed_data.items():
    obs_var = np.var(data)
    print(f"  rho={rho:.3f}: variance={obs_var:.2f} (ratio={obs_var/expected_var:.2f}x)")
[ ]:
# Visualize the effect of overdispersion
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for idx, (rho, data) in enumerate(overdispersed_data.items()):
    ax = axes.flat[idx]
    ratios = data / read_depth

    # Histogram of observed ratios
    ax.hist(ratios, bins=30, density=True, alpha=0.7, color='steelblue',
            edgecolor='black', label='Observed')

    # Overlay expected binomial distribution
    x = np.arange(0, read_depth + 1)
    binomial_pmf = binom.pmf(x, read_depth, 0.5)
    ax.plot(x/read_depth, binomial_pmf * read_depth, 'r-', lw=2,
            label='Expected (Binomial)')

    obs_var = np.var(data)
    ax.set_title(f'rho = {rho:.3f}\nVariance ratio: {obs_var/expected_var:.2f}x', fontsize=11)
    ax.set_xlabel('Reference Allele Frequency')
    ax.set_ylabel('Density')
    ax.legend(fontsize=9)
    ax.set_xlim(0, 1)

plt.suptitle('Effect of Overdispersion on Allele Ratio Distributions', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("\nKey observation: As rho increases, the distribution becomes wider")
print("than expected from binomial sampling alone.")

1.3 Consequences of Ignoring Overdispersion#

If we use a binomial model on overdispersed data, we will:

  1. Underestimate variance → p-values too small

  2. Inflate false positive rate → Many spurious “significant” results

  3. Poor calibration → QQ plots show massive inflation

Let’s demonstrate this problem:

[ ]:
def binomial_test_pvalue(ref_count: int, total_count: int) -> float:
    """
    Two-sided binomial test for deviation from 0.5.

    Args:
        ref_count: Number of reference allele reads
        total_count: Total number of reads

    Returns:
        Two-sided p-value
    """
    # Input validation
    if ref_count < 0 or ref_count > total_count:
        raise ValueError(f"Invalid counts: ref={ref_count}, total={total_count}")
    if total_count <= 0:
        raise ValueError(f"total_count must be positive, got {total_count}")

    result = stats.binomtest(ref_count, total_count, p=0.5, alternative='two-sided')
    return result.pvalue


# Test on overdispersed data (rho=0.05) - no TRUE imbalance!
test_data = overdispersed_data[0.05]
pvalues_binomial = [binomial_test_pvalue(int(k), read_depth) for k in test_data]

# Count "significant" results at different thresholds
pvalues_binomial = np.array(pvalues_binomial)

print("False positive rates using binomial test on overdispersed data:")
print("(Remember: there is NO true imbalance in this data!)\n")
for alpha in [0.05, 0.01, 0.001]:
    fp_rate = (pvalues_binomial < alpha).mean()
    print(f"  Alpha = {alpha:.3f}: {fp_rate*100:.1f}% significant (expected: {alpha*100:.1f}%)")

print(f"\nThis represents a {(pvalues_binomial < 0.05).mean() / 0.05:.1f}x inflation!")

Section 2: The Beta-Binomial Distribution#

2.1 Mathematical Foundation#

The beta-binomial distribution extends the binomial by allowing the success probability \(p\) to vary according to a Beta distribution:

\[p \sim \text{Beta}(\alpha, \beta)\]
\[X | p \sim \text{Binomial}(N, p)\]

The marginal distribution of \(X\) is the beta-binomial.

WASP2’s parameterization uses mean (\(\mu\)) and dispersion (\(\rho\)):

\[\alpha = \mu \cdot \frac{1 - \rho}{\rho}, \quad \beta = (1 - \mu) \cdot \frac{1 - \rho}{\rho}\]

Key properties:

  • Mean: \(E[X] = N \cdot \mu\) (same as binomial)

  • Variance: \(\text{Var}(X) = N \cdot \mu \cdot (1-\mu) \cdot [1 + (N-1) \cdot \rho]\) (inflated!)

When \(\rho \to 0\), the beta-binomial converges to the binomial.

[ ]:
def mu_rho_to_alpha_beta(
    mu: float,
    rho: float
) -> Tuple[float, float]:
    """
    Convert WASP2's (mu, rho) parameterization to (alpha, beta).

    This mirrors WASP2's parameterization in src/analysis/as_analysis.py:104-105.

    Args:
        mu: Mean parameter (allele frequency), in (0, 1)
        rho: Dispersion parameter, in (0, 1)

    Returns:
        Tuple of (alpha, beta) parameters for scipy.stats.betabinom

    Warning:
        When rho is near 0 or 1, numerical instability can occur.
        Use clamp_rho() to ensure safe values.
    """
    # Apply WASP2's clamping to prevent numerical issues
    rho = clamp_rho(rho)

    # Validate mu
    if not 0 < mu < 1:
        raise ValueError(f"mu must be in (0, 1), got {mu}")

    alpha = mu * (1 - rho) / rho
    beta = (1 - mu) * (1 - rho) / rho
    return alpha, beta


def betabinom_variance(n: int, mu: float, rho: float) -> float:
    """
    Compute beta-binomial variance.

    Args:
        n: Number of trials (total read count)
        mu: Mean parameter (allele frequency)
        rho: Dispersion parameter

    Returns:
        Variance of the beta-binomial distribution
    """
    return n * mu * (1 - mu) * (1 + (n - 1) * rho)


# Demonstrate variance inflation
N = 50
mu = 0.5
binomial_var = N * mu * (1 - mu)

print(f"Variance comparison (N={N}, mu={mu}):\n")
print(f"Binomial variance: {binomial_var:.2f}")
print("\nBeta-binomial variance by rho:")
for rho in [0.001, 0.01, 0.05, 0.10, 0.20]:
    bb_var = betabinom_variance(N, mu, rho)
    inflation = bb_var / binomial_var
    print(f"  rho={rho:.3f}: {bb_var:.2f} ({inflation:.2f}x inflation)")

2.1.1 Edge Cases and Numerical Stability#

Warning: The beta-binomial parameterization has dangerous edge cases that can cause numerical errors. This section demonstrates why WASP2’s clamp_rho() function is essential.

[ ]:
# Demonstrate edge cases that clamp_rho() prevents
print("Edge Cases in Beta-Binomial Parameterization")
print("=" * 50)

def unsafe_mu_rho_to_alpha_beta(mu, rho):
    """Unsafe version WITHOUT clamping - for demonstration only."""
    alpha = mu * (1 - rho) / rho
    beta = (1 - mu) * (1 - rho) / rho
    return alpha, beta

# Case 1: rho = 0 causes division by zero
print("\n1. rho = 0 (division by zero):")
try:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 0.0)
    print(f"   alpha = {alpha}, beta = {beta}")
    print("   Result: inf values - causes downstream errors!")
except Exception as e:
    print(f"   Error: {e}")

# Case 2: rho = 1 produces zero alpha/beta
print("\n2. rho = 1 (zero parameters):")
alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1.0)
print(f"   alpha = {alpha}, beta = {beta}")
print("   Result: zero values - invalid for Beta distribution!")

# Case 3: Very small rho produces huge alpha/beta
print("\n3. rho = 1e-15 (numerical overflow risk):")
alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1e-15)
print(f"   alpha = {alpha:.2e}, beta = {beta:.2e}")
print("   Result: huge values - may overflow in gamma functions!")

# Safe version with clamping
print("\n" + "=" * 50)
print("With WASP2's clamp_rho():")
print("=" * 50)

for rho_input in [0.0, 1.0, 1e-15]:
    rho_safe = clamp_rho(rho_input)
    alpha, beta = mu_rho_to_alpha_beta(0.5, rho_safe)
    print(f"\nInput rho={rho_input} -> clamped to {rho_safe}")
    print(f"   alpha = {alpha:.4f}, beta = {beta:.4f}")
[ ]:
# Visualize beta-binomial vs binomial PMFs
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

N = 50
x = np.arange(0, N + 1)

# Panel 1: Different rho values (mu=0.5)
ax = axes[0]
ax.plot(x, binom.pmf(x, N, 0.5), 'k-', lw=2, label='Binomial')
for rho, color in [(0.02, 'blue'), (0.05, 'orange'), (0.10, 'red')]:
    alpha, beta = mu_rho_to_alpha_beta(0.5, rho)
    ax.plot(x, betabinom.pmf(x, N, alpha, beta), '--', lw=2,
            color=color, label=f'rho={rho}')
ax.set_xlabel('Reference Count')
ax.set_ylabel('Probability')
ax.set_title('Effect of Dispersion (mu=0.5)')
ax.legend()

# Panel 2: Different mu values (rho=0.05)
ax = axes[1]
rho = 0.05
for mu, color in [(0.5, 'gray'), (0.6, 'blue'), (0.7, 'orange'), (0.8, 'red')]:
    alpha, beta = mu_rho_to_alpha_beta(mu, rho)
    ax.plot(x, betabinom.pmf(x, N, alpha, beta), '-', lw=2,
            color=color, label=f'mu={mu}')
ax.set_xlabel('Reference Count')
ax.set_ylabel('Probability')
ax.set_title(f'Effect of Imbalance (rho={rho})')
ax.legend()

# Panel 3: Log-scale tails comparison
ax = axes[2]
ax.semilogy(x, binom.pmf(x, N, 0.5), 'k-', lw=2, label='Binomial')
for rho, color in [(0.02, 'blue'), (0.10, 'red')]:
    alpha, beta = mu_rho_to_alpha_beta(0.5, rho)
    ax.semilogy(x, betabinom.pmf(x, N, alpha, beta), '--', lw=2,
                color=color, label=f'rho={rho}')
ax.set_xlabel('Reference Count')
ax.set_ylabel('Probability (log scale)')
ax.set_title('Tail Behavior (log scale)')
ax.legend()
ax.set_ylim(1e-15, 1)

plt.tight_layout()
plt.show()

print("Key insight: Beta-binomial has heavier tails than binomial,")
print("making extreme counts more likely under the null hypothesis.")

2.2 Likelihood Ratio Test for Imbalance#

WASP2 uses a likelihood ratio test (LRT) to detect allelic imbalance:

  • Null hypothesis (:math:`H_0`): \(\mu = 0.5\) (balanced)

  • Alternative (:math:`H_1`): \(\mu \neq 0.5\) (imbalanced)

The test statistic is:

\[\Lambda = -2 \left[ \log L(H_0) - \log L(H_1) \right] \sim \chi^2_1\]

This follows a chi-squared distribution with 1 degree of freedom under \(H_0\).

WASP2 Implementation: See src/analysis/as_analysis.py:322-323

[ ]:
def betabinom_loglik(
    ref_count: int,
    total_count: int,
    mu: float,
    rho: float
) -> float:
    """
    Compute beta-binomial log-likelihood.

    Mirrors WASP2's src/analysis/as_analysis.py:81-112 (opt_prob function).

    Args:
        ref_count: Reference allele count
        total_count: Total read count
        mu: Mean parameter (allele frequency)
        rho: Dispersion parameter

    Returns:
        Log-likelihood value
    """
    alpha, beta = mu_rho_to_alpha_beta(mu, rho)
    return betabinom.logpmf(ref_count, total_count, alpha, beta)


def find_mle_mu(
    ref_count: int,
    total_count: int,
    rho: float
) -> float:
    """
    Find MLE of mu given fixed rho.

    Mirrors WASP2's src/analysis/as_analysis.py:184-248 (parse_opt function).

    Args:
        ref_count: Reference allele count
        total_count: Total read count
        rho: Fixed dispersion parameter

    Returns:
        Maximum likelihood estimate of mu
    """
    def neg_loglik(mu):
        return -betabinom_loglik(ref_count, total_count, mu, rho)

    # Use bounded optimization with safe mu range
    result = minimize_scalar(
        neg_loglik,
        bounds=(0.001, 0.999),
        method='bounded'
    )
    return result.x


def likelihood_ratio_test(
    ref_count: int,
    total_count: int,
    rho: float
) -> Tuple[float, float, float]:
    """
    Perform LRT for allelic imbalance.

    Mirrors WASP2's calculation in src/analysis/as_analysis.py:322-323.

    Args:
        ref_count: Reference allele count
        total_count: Total read count
        rho: Dispersion parameter

    Returns:
        Tuple of (lrt_statistic, p_value, mle_mu)
    """
    # Input validation
    if ref_count < 0 or ref_count > total_count:
        raise ValueError(f"Invalid: ref_count={ref_count}, total_count={total_count}")
    if total_count <= 0:
        raise ValueError(f"total_count must be positive, got {total_count}")

    # Null likelihood (mu = 0.5)
    null_ll = betabinom_loglik(ref_count, total_count, 0.5, rho)

    # Alternative likelihood (mu = MLE)
    mle_mu = find_mle_mu(ref_count, total_count, rho)
    alt_ll = betabinom_loglik(ref_count, total_count, mle_mu, rho)

    # LRT statistic (matches WASP2: -2 * (null_ll - alt_ll))
    lrt = -2 * (null_ll - alt_ll)
    lrt = max(0, lrt)  # Ensure non-negative due to numerical precision

    # P-value from chi-squared distribution (df=1)
    pvalue = chi2.sf(lrt, df=1)

    return lrt, pvalue, mle_mu


# Example calculation
ref, total = 35, 50  # Observed: 35 ref out of 50 total
rho = 0.05

lrt, pval, mu_hat = likelihood_ratio_test(ref, total, rho)

print(f"Example LRT calculation:")
print(f"  Observed: {ref} ref / {total} total (ratio = {ref/total:.2f})")
print(f"  Dispersion (rho): {rho}")
print(f"  MLE of mu: {mu_hat:.3f}")
print(f"  LRT statistic: {lrt:.3f}")
print(f"  P-value: {pval:.4f}")

Section 3: Dispersion Estimation#

A critical step is estimating the dispersion parameter \(\rho\) from the data. WASP2 offers two approaches:

  1. Single dispersion model - One \(\rho\) for the entire dataset

  2. Linear dispersion model - \(\rho\) varies with read depth: \(\text{logit}(\rho) = \beta_0 + \beta_1 \cdot N\)

3.1 Maximum Likelihood Estimation (MLE)#

MLE finds the \(\rho\) that maximizes the likelihood under \(H_0\) (balanced).

WASP2 Implementation: See src/analysis/as_analysis.py:251-325 (single_model function)

[ ]:
def estimate_dispersion_mle(
    ref_counts: NDArray[np.int64],
    total_counts: NDArray[np.int64]
) -> float:
    """
    Estimate single dispersion parameter via MLE.

    Mirrors WASP2's single_model in src/analysis/as_analysis.py:251-325.
    Assumes mu=0.5 (null hypothesis) for all observations.

    Args:
        ref_counts: Array of reference allele counts
        total_counts: Array of total read counts

    Returns:
        MLE estimate of dispersion parameter rho

    Raises:
        ValueError: If arrays have different lengths or invalid values
    """
    # Input validation
    ref_counts = np.asarray(ref_counts)
    total_counts = np.asarray(total_counts)

    if len(ref_counts) != len(total_counts):
        raise ValueError("ref_counts and total_counts must have same length")
    if len(ref_counts) == 0:
        raise ValueError("Arrays must not be empty")
    if np.any(ref_counts < 0) or np.any(ref_counts > total_counts):
        raise ValueError("Invalid counts: ref_count must be in [0, total_count]")

    def neg_loglik(rho):
        rho = clamp_rho(rho)  # Apply WASP2's clamping
        alpha, beta = mu_rho_to_alpha_beta(0.5, rho)
        ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))
        return -ll

    result = minimize_scalar(
        neg_loglik,
        bounds=(RHO_EPSILON, 0.5),  # Use RHO_EPSILON as lower bound
        method='bounded'
    )
    return result.x


# Estimate dispersion from our simulated data
true_rho = 0.05
test_data = overdispersed_data[true_rho]
total_counts = np.full(len(test_data), read_depth)

estimated_rho = estimate_dispersion_mle(test_data, total_counts)

print(f"Dispersion estimation (MLE):")
print(f"  True rho: {true_rho:.4f}")
print(f"  Estimated rho: {estimated_rho:.4f}")
print(f"  Relative error: {abs(estimated_rho - true_rho) / true_rho * 100:.1f}%")

3.2 Method of Moments (MoM)#

An alternative is the Method of Moments, which matches the observed variance to the expected beta-binomial variance:

\[\hat{\rho}_{\text{MoM}} = \frac{\text{Var}(X/N) - \mu(1-\mu)/\bar{N}}{\mu(1-\mu)(1 - 1/\bar{N})}\]

MoM is faster but may be less efficient than MLE.

[ ]:
def estimate_dispersion_mom(
    ref_counts: NDArray[np.int64],
    total_counts: NDArray[np.int64]
) -> float:
    """
    Estimate dispersion using Method of Moments.

    Args:
        ref_counts: Array of reference allele counts
        total_counts: Array of total read counts

    Returns:
        MoM estimate of dispersion parameter rho
    """
    # Input validation
    ref_counts = np.asarray(ref_counts)
    total_counts = np.asarray(total_counts)

    if len(ref_counts) != len(total_counts):
        raise ValueError("Arrays must have same length")

    ratios = ref_counts / total_counts
    mu = 0.5  # Assume balanced under null

    # Observed variance of ratios
    obs_var = np.var(ratios)

    # Expected binomial variance (if rho=0)
    mean_n = np.mean(total_counts)
    binom_var = mu * (1 - mu) / mean_n

    # Solve for rho with safeguards
    numerator = obs_var - binom_var
    denominator = mu * (1 - mu) * (1 - 1/mean_n)

    # Handle edge case where denominator is near zero
    if abs(denominator) < 1e-10:
        return RHO_EPSILON

    rho_mom = numerator / denominator

    # Clamp to valid range using WASP2's constant
    return float(clamp_rho(rho_mom))


# Compare MLE vs MoM
rho_mom = estimate_dispersion_mom(test_data, total_counts)

print(f"Comparison of estimation methods:")
print(f"  True rho: {true_rho:.4f}")
print(f"  MLE estimate: {estimated_rho:.4f}")
print(f"  MoM estimate: {rho_mom:.4f}")
[ ]:
# Systematic comparison across different true rho values
comparison_results = []

for true_rho in [0.01, 0.02, 0.05, 0.10, 0.15]:
    # Simulate data
    alpha, beta = mu_rho_to_alpha_beta(0.5, true_rho)
    sim_data = betabinom.rvs(n=50, a=alpha, b=beta, size=1000)
    sim_totals = np.full(1000, 50)

    # Estimate with both methods
    mle_est = estimate_dispersion_mle(sim_data, sim_totals)
    mom_est = estimate_dispersion_mom(sim_data, sim_totals)

    comparison_results.append({
        'true_rho': true_rho,
        'mle_estimate': mle_est,
        'mom_estimate': mom_est,
        'mle_error': abs(mle_est - true_rho) / true_rho * 100,
        'mom_error': abs(mom_est - true_rho) / true_rho * 100
    })

comparison_df = pd.DataFrame(comparison_results)
print("MLE vs MoM Comparison:")
print(comparison_df.round(4).to_string(index=False))

3.3 Linear Dispersion Model#

In some datasets, dispersion varies with read depth. The linear model parameterizes:

\[\text{logit}(\rho_i) = \beta_0 + \beta_1 \cdot N_i\]

This allows higher-depth regions to have different dispersion than lower-depth regions.

WASP2 Implementation: See src/analysis/as_analysis.py:53-78 (opt_linear function)

[ ]:
def estimate_linear_dispersion(
    ref_counts: NDArray[np.int64],
    total_counts: NDArray[np.int64]
) -> Tuple[float, float]:
    """
    Estimate depth-dependent dispersion: logit(rho) = beta0 + beta1 * N

    Mirrors WASP2's opt_linear in src/analysis/as_analysis.py:53-78.

    Args:
        ref_counts: Array of reference allele counts
        total_counts: Array of total read counts

    Returns:
        Tuple of (beta0, beta1) parameters
    """
    def neg_loglik(params):
        beta0, beta1 = params

        # Compute rho for each observation
        linear_pred = beta0 + beta1 * total_counts
        # Clip to prevent overflow (matches WASP2's approach)
        linear_pred = np.clip(linear_pred, -10, 10)
        rho = expit(linear_pred)
        rho = clamp_rho(rho)  # Apply WASP2's clamping

        # Compute likelihood
        alpha = 0.5 * (1 - rho) / rho
        beta = 0.5 * (1 - rho) / rho

        ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))
        return -ll

    result = minimize(neg_loglik, x0=[-3, 0], method='Nelder-Mead')
    return tuple(result.x)


# Simulate data with depth-dependent dispersion
np.random.seed(42)
n_snps = 2000
depths = np.random.choice([20, 50, 100, 200], size=n_snps)

# True model: logit(rho) = -3 + 0.01 * N
true_beta0, true_beta1 = -3, 0.01
true_rhos = expit(true_beta0 + true_beta1 * depths)

# Generate counts with proper clamping
ref_counts_linear = np.array([
    betabinom.rvs(
        n=n,
        a=0.5*(1-clamp_rho(rho))/clamp_rho(rho),
        b=0.5*(1-clamp_rho(rho))/clamp_rho(rho)
    )
    for n, rho in zip(depths, true_rhos)
])

# Estimate
est_beta0, est_beta1 = estimate_linear_dispersion(ref_counts_linear, depths)

print("Linear dispersion model estimation:")
print(f"  True: logit(rho) = {true_beta0:.2f} + {true_beta1:.4f} * N")
print(f"  Est:  logit(rho) = {est_beta0:.2f} + {est_beta1:.4f} * N")
[ ]:
# Visualize the linear dispersion model
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Panel 1: Rho vs depth
ax = axes[0]
depth_range = np.linspace(10, 250, 100)
true_curve = expit(true_beta0 + true_beta1 * depth_range)
est_curve = expit(est_beta0 + est_beta1 * depth_range)

ax.plot(depth_range, true_curve, 'b-', lw=2, label='True')
ax.plot(depth_range, est_curve, 'r--', lw=2, label='Estimated')
ax.set_xlabel('Read Depth (N)')
ax.set_ylabel('Dispersion (rho)')
ax.set_title('Linear Dispersion Model')
ax.legend()

# Panel 2: Show effect on variance
ax = axes[1]
binom_var = depth_range * 0.5 * 0.5
bb_var_true = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * true_curve)
bb_var_const = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * 0.05)

ax.plot(depth_range, binom_var, 'k-', lw=2, label='Binomial')
ax.plot(depth_range, bb_var_const, 'g--', lw=2, label='Constant rho=0.05')
ax.plot(depth_range, bb_var_true, 'b-', lw=2, label='Linear model')
ax.set_xlabel('Read Depth (N)')
ax.set_ylabel('Variance of Ref Count')
ax.set_title('Variance Scaling with Depth')
ax.legend()

plt.tight_layout()
plt.show()

Section 4: QQ Plots for Model Calibration#

Quantile-Quantile (QQ) plots compare observed p-values to their expected distribution under the null. If the model is well-calibrated:

  • P-values should be uniformly distributed under \(H_0\)

  • QQ plot should follow the diagonal line

4.1 Constructing QQ Plots#

[ ]:
def qq_plot(
    pvalues: NDArray[np.float64],
    ax=None,
    label: str = 'Observed',
    color: str = 'steelblue'
):
    """
    Create a QQ plot of -log10(p-values).

    Args:
        pvalues: Array of p-values
        ax: Matplotlib axes (created if None)
        label: Label for the scatter plot
        color: Color for the points

    Returns:
        Matplotlib axes object
    """
    if ax is None:
        fig, ax = plt.subplots()

    # Input validation and cleaning
    pvalues = np.asarray(pvalues, dtype=np.float64)
    pvalues = pvalues[~np.isnan(pvalues)]  # Remove NaN
    pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)]  # Valid p-values only

    if len(pvalues) == 0:
        raise ValueError("No valid p-values after filtering")

    pvalues = np.sort(pvalues)

    # Expected p-values under uniform distribution
    n = len(pvalues)
    expected = (np.arange(1, n + 1) - 0.5) / n

    # Transform to -log10 scale with safe clipping
    obs_log = -np.log10(np.clip(pvalues, 1e-300, 1))
    exp_log = -np.log10(expected)

    # Plot
    ax.scatter(exp_log, obs_log, s=10, alpha=0.6, color=color, label=label)

    # Diagonal line
    max_val = max(exp_log.max(), obs_log.max())
    ax.plot([0, max_val], [0, max_val], 'r--', lw=1, label='Expected')

    ax.set_xlabel('Expected -log10(p)')
    ax.set_ylabel('Observed -log10(p)')

    return ax


def genomic_inflation_factor(pvalues: NDArray[np.float64]) -> float:
    """
    Calculate genomic inflation factor (lambda).

    Lambda = 1 indicates perfect calibration.
    Lambda > 1 indicates inflation (too many false positives).
    Lambda < 1 indicates deflation (too conservative).

    Args:
        pvalues: Array of p-values

    Returns:
        Genomic inflation factor
    """
    pvalues = np.asarray(pvalues, dtype=np.float64)
    pvalues = pvalues[~np.isnan(pvalues)]
    pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)]

    if len(pvalues) == 0:
        return np.nan

    chi2_stats = chi2.ppf(1 - pvalues, df=1)
    return np.median(chi2_stats) / chi2.ppf(0.5, df=1)
[ ]:
# Generate p-values using binomial vs beta-binomial tests on overdispersed data
true_rho = 0.05
test_data = overdispersed_data[true_rho]
n_total = read_depth

# Binomial test (wrong model)
pvals_binomial = [stats.binomtest(int(k), n_total, 0.5).pvalue for k in test_data]

# Beta-binomial test (correct model)
est_rho = estimate_dispersion_mle(test_data, np.full(len(test_data), n_total))
pvals_betabinom = [likelihood_ratio_test(int(k), n_total, est_rho)[1] for k in test_data]

# Create QQ plots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
qq_plot(np.array(pvals_binomial), ax=ax, label='Binomial', color='firebrick')
lambda_binom = genomic_inflation_factor(np.array(pvals_binomial))
ax.set_title(f'Binomial Test\n(lambda = {lambda_binom:.2f})')
ax.legend()

ax = axes[1]
qq_plot(np.array(pvals_betabinom), ax=ax, label='Beta-binomial', color='steelblue')
lambda_bb = genomic_inflation_factor(np.array(pvals_betabinom))
ax.set_title(f'Beta-Binomial Test\n(lambda = {lambda_bb:.2f})')
ax.legend()

plt.suptitle(f'QQ Plots on Null Data (true rho = {true_rho})', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print(f"Genomic inflation factor (lambda):")
print(f"  Binomial test: {lambda_binom:.3f} (expected: 1.0)")
print(f"  Beta-binomial: {lambda_bb:.3f} (expected: 1.0)")
print(f"\nA lambda >> 1 indicates systematic inflation (too many false positives).")

4.2 Interpreting QQ Plots#

Pattern

Interpretation

Points on diagonal

Well-calibrated (good!)

Points above diagonal

Inflation (too many small p-values)

Points below diagonal

Deflation (test too conservative)

Lift at the tail only

True signal mixed with null

The genomic inflation factor (lambda) quantifies deviation:

  • \(\lambda = 1\): Perfect calibration

  • \(\lambda > 1\): Inflation

  • \(\lambda < 1\): Deflation

[ ]:
# Demonstrate QQ plot with true signal
np.random.seed(42)

# Generate mixed data: 90% null + 10% truly imbalanced
n_snps = 1000
n_signal = 100
rho = 0.05

# Null data (mu = 0.5)
alpha_null, beta_null = mu_rho_to_alpha_beta(0.5, rho)
null_counts = betabinom.rvs(n=50, a=alpha_null, b=beta_null, size=n_snps - n_signal)

# Signal data (mu = 0.7, strong imbalance)
alpha_sig, beta_sig = mu_rho_to_alpha_beta(0.7, rho)
signal_counts = betabinom.rvs(n=50, a=alpha_sig, b=beta_sig, size=n_signal)

# Combine
mixed_counts = np.concatenate([null_counts, signal_counts])
is_signal = np.concatenate([np.zeros(n_snps - n_signal), np.ones(n_signal)]).astype(bool)

# Test all SNPs
est_rho = estimate_dispersion_mle(mixed_counts, np.full(len(mixed_counts), 50))
mixed_pvals = np.array([likelihood_ratio_test(int(k), 50, est_rho)[1] for k in mixed_counts])

# QQ plot
fig, ax = plt.subplots(figsize=(8, 8))
qq_plot(mixed_pvals, ax=ax)
ax.set_title(f'QQ Plot with {n_signal} True Signals\n(10% imbalanced, mu=0.7)')
ax.legend()
plt.show()

# Check detection power
detected = mixed_pvals < 0.05
print(f"Detection results (alpha = 0.05):")
print(f"  True positives: {detected[is_signal].sum()} / {n_signal}")
print(f"  False positives: {detected[~is_signal].sum()} / {n_snps - n_signal}")

Section 5: False Discovery Rate (FDR) Correction#

When testing thousands of SNPs, multiple testing correction is essential. WASP2 uses Benjamini-Hochberg (BH) FDR correction via scipy.stats.false_discovery_control().

WASP2 Implementation: See src/analysis/as_analysis.py:492 which uses false_discovery_control(pvals, method="bh")

5.1 The Multiple Testing Problem#

If we test 10,000 SNPs at \(\alpha = 0.05\), we expect 500 false positives even if there’s no true signal!

[ ]:
# Demonstrate multiple testing problem
n_tests = 10000
alpha = 0.05

# All null (no true signal)
null_pvalues = np.random.uniform(0, 1, n_tests)

# Without correction
false_positives_raw = (null_pvalues < alpha).sum()

print(f"Multiple testing with {n_tests} tests (no true signal):")
print(f"  Expected false positives at alpha={alpha}: {n_tests * alpha:.0f}")
print(f"  Observed false positives: {false_positives_raw}")
print(f"\nThis is why we need FDR correction!")

5.2 Benjamini-Hochberg Procedure#

The BH procedure controls the false discovery rate - the expected proportion of false positives among all discoveries.

Algorithm:

  1. Sort p-values: \(p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}\)

  2. Find largest \(k\) such that \(p_{(k)} \leq \frac{k}{m} \cdot q\) where \(q\) is target FDR

  3. Reject all hypotheses with \(p_{(i)} \leq p_{(k)}\)

Note: WASP2 uses scipy.stats.false_discovery_control(pvals, method="bh") for this.

[ ]:
def benjamini_hochberg(pvalues: NDArray[np.float64]) -> NDArray[np.float64]:
    """
    Apply Benjamini-Hochberg FDR correction.

    Note: WASP2 uses scipy.stats.false_discovery_control(pvals, method="bh")
    which is equivalent. This implementation is for educational purposes.

    Args:
        pvalues: Array of p-values

    Returns:
        Array of adjusted p-values (q-values)
    """
    pvalues = np.asarray(pvalues, dtype=np.float64)
    n = len(pvalues)

    if n == 0:
        return np.array([])

    # Sort indices
    sorted_idx = np.argsort(pvalues)
    sorted_pvals = pvalues[sorted_idx]

    # BH adjustment: p_adj[i] = p[i] * n / rank[i]
    adjusted = sorted_pvals * n / np.arange(1, n + 1)

    # Ensure monotonicity (cumulative minimum from the end)
    adjusted = np.minimum.accumulate(adjusted[::-1])[::-1]

    # Cap at 1
    adjusted = np.minimum(adjusted, 1)

    # Restore original order
    result = np.empty(n)
    result[sorted_idx] = adjusted

    return result


# Compare our implementation with scipy's
print("Comparing BH implementations:")
test_pvals = np.array([0.001, 0.01, 0.02, 0.05, 0.1, 0.5])
our_adjusted = benjamini_hochberg(test_pvals)
scipy_adjusted = false_discovery_control(test_pvals, method='bh')

print(f"  Raw p-values:     {test_pvals}")
print(f"  Our BH adjusted:  {our_adjusted.round(4)}")
print(f"  SciPy BH adjusted:{scipy_adjusted.round(4)}")
print(f"  Match: {np.allclose(our_adjusted, scipy_adjusted)}")
[ ]:
# Apply FDR correction to mixed data
# Using scipy's implementation (same as WASP2)
fdr_pvals = false_discovery_control(mixed_pvals, method='bh')

# Compare results
print("FDR correction comparison (alpha/FDR = 0.05):")
print(f"\nRaw p-values:")
print(f"  Significant: {(mixed_pvals < 0.05).sum()}")
print(f"  True positives: {((mixed_pvals < 0.05) & is_signal).sum()}")
print(f"  False positives: {((mixed_pvals < 0.05) & ~is_signal).sum()}")

print(f"\nBH-adjusted p-values (scipy.stats.false_discovery_control):")
print(f"  Significant: {(fdr_pvals < 0.05).sum()}")
print(f"  True positives: {((fdr_pvals < 0.05) & is_signal).sum()}")
print(f"  False positives: {((fdr_pvals < 0.05) & ~is_signal).sum()}")

5.3 Alternative FDR Methods#

WASP2 primarily uses BH, but other options exist:

Method

Description

Use Case

Benjamini-Hochberg (BH)

Standard FDR control

Default choice

Benjamini-Yekutieli (BY)

Controls FDR under dependency

Correlated tests

Storey’s q-value

Estimates proportion of true nulls

Large-scale testing

mid-p correction

Less conservative binomial test

Discrete distributions

[ ]:
# Visualize FDR correction effect
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Panel 1: Raw vs adjusted p-values
ax = axes[0]
sorted_idx = np.argsort(mixed_pvals)
ax.plot(mixed_pvals[sorted_idx], label='Raw p-value', lw=2)
ax.plot(fdr_pvals[sorted_idx], label='BH-adjusted', lw=2)
ax.axhline(0.05, color='red', linestyle='--', label='Threshold (0.05)')
ax.set_xlabel('Rank')
ax.set_ylabel('P-value')
ax.set_title('Raw vs FDR-Adjusted P-values')
ax.set_xlim(0, 200)  # Focus on top hits
ax.legend()

# Panel 2: Number of discoveries at different thresholds
ax = axes[1]
thresholds = np.linspace(0.001, 0.2, 50)
raw_discoveries = [(mixed_pvals < t).sum() for t in thresholds]
fdr_discoveries = [(fdr_pvals < t).sum() for t in thresholds]

ax.plot(thresholds, raw_discoveries, label='Raw p-value', lw=2)
ax.plot(thresholds, fdr_discoveries, label='FDR-adjusted', lw=2)
ax.axhline(n_signal, color='green', linestyle=':', label=f'True signals ({n_signal})')
ax.axvline(0.05, color='red', linestyle='--', alpha=0.5)
ax.set_xlabel('Threshold')
ax.set_ylabel('Number of Discoveries')
ax.set_title('Discoveries at Different Thresholds')
ax.legend()

plt.tight_layout()
plt.show()

Summary#

In this tutorial, you learned:

Key Concepts#

  1. Overdispersion - Sequencing data shows more variance than the binomial model predicts

    • Sources: PCR bias, library prep effects, technical variability

    • Consequence: Binomial test has inflated false positive rate

  2. Beta-binomial distribution - Naturally handles overdispersion

    • Parameters: mean (\(\mu\)) and dispersion (\(\rho\))

    • Variance: \(N\mu(1-\mu)[1 + (N-1)\rho]\)

    • Reduces to binomial when \(\rho \to 0\)

    • Critical: Use clamp_rho() to prevent numerical instability at boundaries

  3. Dispersion estimation

    • MLE: More efficient, used by WASP2

    • MoM: Faster, closed-form solution

    • Linear model: Allows depth-dependent dispersion

  4. QQ plots - Diagnostic tool for model calibration

    • Points on diagonal = well-calibrated

    • Inflation factor (\(\lambda\)) quantifies deviation

  5. FDR correction - Essential for multiple testing

    • Benjamini-Hochberg controls false discovery rate

    • WASP2 uses scipy.stats.false_discovery_control(pvals, method="bh")

WASP2 Implementation Reference#

Tutorial Function

WASP2 Source

clamp_rho()

src/analysis/as_analysis.py:36-50

mu_rho_to_alpha_beta()

src/analysis/as_analysis.py:104-105

betabinom_loglik()

src/analysis/as_analysis.py:81-112 (opt_prob)

likelihood_ratio_test()

src/analysis/as_analysis.py:322-323

estimate_dispersion_mle()

src/analysis/as_analysis.py:251-325 (single_model)

estimate_linear_dispersion()

src/analysis/as_analysis.py:53-78 (opt_linear)

FDR correction

src/analysis/as_analysis.py:492

Further Reading#

[ ]:
print("Tutorial complete!")
print("\nYou now understand the statistical foundations of WASP2's")
print("allelic imbalance detection framework.")
print("\nKey takeaways:")
print(f"  - Always use RHO_EPSILON ({RHO_EPSILON}) to prevent numerical instability")
print("  - Beta-binomial handles overdispersion that binomial cannot")
print("  - QQ plots diagnose model calibration")
print("  - FDR correction is essential for genome-wide testing")