Dispersion Parameter Estimation#

This document describes how WASP2 estimates the overdispersion parameter \(\rho\) in the beta-binomial model.

The Dispersion Parameter#

The dispersion parameter \(\rho\) quantifies the excess variance beyond the binomial expectation. In the beta-binomial model:

\[\text{Var}(X) = N\mu(1-\mu)[1 + (N-1)\rho]\]

where:

  • \(\rho = 0\): Binomial variance (no overdispersion)

  • \(\rho > 0\): Variance inflation due to correlated sampling

Estimation Methods#

WASP2 supports two approaches for dispersion estimation:

  1. Maximum Likelihood Estimation (MLE): Optimize the likelihood function

  2. Method of Moments (MoM): Solve equations based on sample moments

Both methods have trade-offs in terms of efficiency, bias, and computational cost.

Maximum Likelihood Estimation#

MLE finds the value of \(\rho\) that maximizes the likelihood of the observed data.

Single Dispersion Model#

WASP2’s default approach estimates a single \(\rho\) across all observations:

\[\hat{\rho}_{\text{MLE}} = \arg\max_{\rho} \sum_{i=1}^{n} \log P(X_i | N_i, \mu=0.5, \rho)\]

Under the null hypothesis of no imbalance (\(\mu = 0.5\)), the log-likelihood is:

\[\ell(\rho) = \sum_{i=1}^{n} \log \text{BetaBinom}(X_i; N_i, \alpha(\rho), \beta(\rho))\]

where \(\alpha = \beta = 0.5(1-\rho)/\rho\).

Implementation:

from scipy.optimize import minimize_scalar
from scipy.stats import betabinom
import numpy as np

def neg_log_likelihood(rho, ref_counts, n_counts):
    alpha = 0.5 * (1 - rho) / rho
    beta = 0.5 * (1 - rho) / rho
    return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta))

result = minimize_scalar(
    neg_log_likelihood,
    args=(ref_array, n_array),
    method='bounded',
    bounds=(1e-6, 1 - 1e-6)
)
rho_mle = result.x

Properties of MLE:

  • Asymptotically unbiased as \(n \to \infty\)

  • Achieves the Cramér-Rao lower bound (efficient)

  • Computationally requires numerical optimization

  • May be sensitive to outliers

Linear Dispersion Model#

For large datasets with variable coverage, WASP2 offers a model where dispersion varies linearly with total count on the logit scale:

\[\text{logit}(\rho_i) = \beta_0 + \beta_1 \cdot N_i\]

The logit link ensures \(\rho_i \in (0, 1)\):

\[\rho_i = \frac{\exp(\beta_0 + \beta_1 N_i)}{1 + \exp(\beta_0 + \beta_1 N_i)}\]

Motivation:

Empirically, regions with different coverage levels may exhibit different dispersion characteristics:

  • Low coverage: Greater sampling noise, potentially higher apparent dispersion

  • High coverage: More stable estimates, may reveal true biological variance

Implementation:

from scipy.optimize import minimize
from scipy.special import expit

def neg_ll_linear(params, ref_counts, n_counts):
    beta0, beta1 = params
    logit_rho = beta0 + beta1 * n_counts
    # Clip to avoid numerical issues
    logit_rho = np.clip(logit_rho, -10, 10)
    rho = expit(logit_rho)
    alpha = 0.5 * (1 - rho) / rho
    beta = 0.5 * (1 - rho) / rho
    return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta))

result = minimize(
    neg_ll_linear,
    x0=(0, 0),
    method='Nelder-Mead',
    args=(ref_array, n_array)
)
beta0, beta1 = result.x

Method of Moments#

MoM estimates \(\rho\) by equating theoretical and sample moments.

Variance-Based Estimator#

For a beta-binomial with \(\mu = 0.5\), the variance is:

\[\text{Var}(X) = \frac{N}{4}[1 + (N-1)\rho]\]

Solving for \(\rho\):

\[\hat{\rho}_{\text{MoM}} = \frac{4S^2/N - 1}{N - 1}\]

where \(S^2\) is the sample variance of \(X/N\) (the allelic ratio).

Pooled Estimator:

For observations with varying \(N\):

\[\hat{\rho}_{\text{MoM}} = \frac{\sum_i (X_i - N_i/2)^2 - \sum_i N_i/4}{\sum_i N_i(N_i-1)/4}\]

Properties of MoM:

  • Closed-form solution (fast computation)

  • May produce negative estimates (truncate to 0)

  • Less efficient than MLE, especially for small samples

  • More robust to model misspecification

Comparison: MLE vs MoM#

MLE vs MoM for Dispersion Estimation#

Property

MLE

MoM

Computation

Iterative optimization

Closed-form

Efficiency

Optimal (achieves CRLB)

Suboptimal

Bias

Asymptotically unbiased

May be biased for small n

Robustness

Sensitive to outliers

More robust

Boundary behavior

Always in (0,1)

May give ρ < 0

WASP2 default

Yes

No

WASP2 uses MLE because:

  1. The beta-binomial likelihood is well-behaved

  2. Modern optimization is fast enough for typical datasets

  3. MLE provides consistent estimates across sample sizes

Practical Considerations#

Convergence Issues#

The MLE optimizer may fail to converge if:

  • \(\rho\) is very close to 0 (nearly binomial data)

  • \(\rho\) is very close to 1 (extreme overdispersion)

  • The data contains extreme outliers

WASP2 handles these by:

  • Bounding \(\rho\) away from 0 and 1

  • Clipping logit values to avoid overflow

  • Using robust optimization methods (bounded, Nelder-Mead)

Sample Size Requirements#

MLE performance depends on sample size:

Dispersion Estimate Quality by Sample Size#

n (regions)

CV of estimate

Recommendation

< 50

> 50%

Use pooled estimate or prior

50-200

20-50%

MLE reasonable but uncertain

200-1000

10-20%

MLE reliable

> 1000

< 10%

MLE highly accurate

Model Selection#

Choosing between single and linear dispersion models:

Use Single Dispersion When:

  • Dataset is small (< 1000 regions)

  • Coverage is relatively uniform

  • Quick analysis is needed

Use Linear Dispersion When:

  • Large dataset (> 10,000 regions)

  • Wide range of coverage values

  • Systematic coverage-dispersion relationship suspected

Model comparison can be done via AIC/BIC:

\[\text{AIC} = 2k - 2\ell(\hat{\theta})\]

where \(k\) is the number of parameters (1 for single, 2 for linear).

See Also#

References#

[Robinson2010]

Robinson MD, Smyth GK (2010). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9:321-332.

[Yu2013]

Yu D, Huber W, Vitek O (2013). Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics 29:1275-1282.