Beta-Binomial Model for Allelic Imbalance#

This document describes the statistical framework WASP2 uses to detect allelic imbalance from allele count data.

Motivation #

Why Not Use the Binomial?#

The simplest model for allele counts is the binomial distribution. If reads are sampled independently from two alleles with equal probability:

\[X \sim \text{Binomial}(N, 0.5)\]

where \(X\) is the reference count and \(N\) is the total count.

However, real allele count data exhibits overdispersion: the variance exceeds the binomial expectation. Sources of overdispersion include:

Biological variation: True allelic imbalance varies across cells
Technical noise: PCR amplification introduces correlated errors
Aggregation effects: Combining counts across SNPs within a region
Sampling from a population: Different individuals have different AI

The Beta-Binomial Model #

The beta-binomial distribution naturally accommodates overdispersion by modeling the success probability as a random variable:

\[\begin{split}p &\sim \text{Beta}(\alpha, \beta) \\ X | p &\sim \text{Binomial}(N, p)\end{split}\]

Marginalizing over \(p\) gives the beta-binomial:

\[P(X = k) = \binom{N}{k} \frac{B(k + \alpha, N - k + \beta)}{B(\alpha, \beta)}\]

where \(B(\cdot, \cdot)\) is the beta function.

Parameterization #

WASP2 uses the mean-dispersion parameterization:

\[\begin{split}\mu &= \frac{\alpha}{\alpha + \beta} \\ \rho &= \frac{1}{\alpha + \beta + 1}\end{split}\]

The dispersion parameter \(\rho \in (0, 1)\) controls overdispersion:

\(\rho \to 0\): Approaches binomial (no overdispersion)
\(\rho \to 1\): Maximum overdispersion (all probability at 0 or N)

The inverse transformation:

\[\begin{split}\alpha &= \mu \cdot \frac{1 - \rho}{\rho} \\ \beta &= (1 - \mu) \cdot \frac{1 - \rho}{\rho}\end{split}\]

Variance Structure #

The beta-binomial variance is:

\[\text{Var}(X) = N\mu(1-\mu) \left[ 1 + (N-1)\rho \right]\]

The factor \([1 + (N-1)\rho]\) is the variance inflation factor. For typical values (\(\rho \approx 0.01\), \(N \approx 100\)), this gives roughly 2x the binomial variance.

Hypothesis Testing #

WASP2 tests for allelic imbalance using a likelihood ratio test:

Null Hypothesis \(H_0\): No imbalance (\(\mu = 0.5\))

Alternative Hypothesis \(H_1\): Imbalance present (\(\mu \neq 0.5\))

Likelihood Functions #

Under the null hypothesis:

\[\mathcal{L}_0 = \prod_{i=1}^{n} P(X_i | N_i, \mu=0.5, \rho)\]

Under the alternative:

\[\mathcal{L}_1 = \prod_{i=1}^{n} P(X_i | N_i, \hat{\mu}_{\text{MLE}}, \rho)\]

where the product is over SNPs within a region (peak, gene).

Likelihood Ratio Test #

The test statistic:

\[\Lambda = -2 \left[ \log \mathcal{L}_0 - \log \mathcal{L}_1 \right]\]

Under the null hypothesis, \(\Lambda\) follows a chi-squared distribution with 1 degree of freedom:

\[\Lambda \sim \chi^2_1\]

The p-value is:

\[p = P(\chi^2_1 > \Lambda)\]

MLE Estimation #

The MLE for \(\mu\) under the alternative is found by numerical optimization:

from scipy.optimize import minimize_scalar
from scipy.stats import betabinom

def neg_log_likelihood(mu, ref_counts, n_counts, rho):
    alpha = mu * (1 - rho) / rho
    beta = (1 - mu) * (1 - rho) / rho
    return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta))

result = minimize_scalar(
    neg_log_likelihood,
    args=(ref_counts, n_counts, rho),
    method='bounded',
    bounds=(0, 1)
)
mu_mle = result.x

Implementation in WASP2 #

Single Dispersion Model #

The default model estimates a single \(\rho\) for all data:

from analysis.as_analysis import single_model

results = single_model(df, region_col="region", phased=False)

This assumes homogeneous overdispersion across regions, which is often reasonable for moderately-sized datasets.

Linear Dispersion Model #

For large datasets, WASP2 offers a linear model where dispersion varies with total count:

\[\text{logit}(\rho) = \beta_0 + \beta_1 \cdot N\]

This captures the observation that regions with higher coverage often show different dispersion characteristics:

from analysis.as_analysis import linear_model

results = linear_model(df, region_col="region", phased=False)

Phased vs Unphased Analysis #

WASP2 supports both phased and unphased genotype data.

Unphased Model #

When genotype phase is unknown, WASP2 uses a mixture model that marginalizes over possible phase configurations:

For a region with multiple SNPs, if we don’t know which haplotype each ref allele belongs to, we sum over phase assignments using dynamic programming:

\[P(\mathbf{X}) = \sum_{\phi \in \text{phases}} P(\mathbf{X} | \phi) \cdot P(\phi)\]

where \(\phi\) indexes phase configurations with prior \(P(\phi) = 0.5^{n-1}\).

Phased Model #

With phased genotypes (e.g., from read-backed phasing), the model is simpler:

\[P(X_i | \text{hap}_i) = \text{BetaBinom}(X_i; N_i, \mu_{\text{hap}_i}, \rho)\]

where \(\mu_{\text{hap}_i}\) is \(\mu\) or \(1-\mu\) depending on which haplotype the reference allele belongs to.

Output Interpretation #

WASP2 returns the following statistics for each region:

Beta-Binomial Analysis Output#
Column	Type	Interpretation
null_ll	float	Log-likelihood under null (μ=0.5)
alt_ll	float	Log-likelihood under alternative (μ=MLE)
mu	float	MLE of imbalance proportion
lrt	float	Likelihood ratio test statistic
pval	float	p-value from χ² distribution
fdr_pval	float	FDR-corrected p-value (BH method)

Interpreting μ (mu):

\(\mu = 0.5\): No imbalance (equal allele expression)
\(\mu > 0.5\): Reference allele preference
\(\mu < 0.5\): Alternate allele preference
\(|\mu - 0.5|\): Effect size (magnitude of imbalance)

Practical Considerations #

Pseudocounts #

WASP2 adds pseudocounts to avoid log(0) issues:

\[X' = X + c, \quad N' = N + 2c\]

Default \(c = 1\) (Laplace smoothing). This slightly shrinks estimates toward 0.5, providing conservative inference.

Minimum Count Threshold #

Regions with low total counts have poor statistical power. WASP2 filters regions with \(N < \text{min\_count}\) (default: 10).

Power depends on coverage and effect size:

Approximate Power (α=0.05)#
Total N	μ=0.55	μ=0.60	μ=0.65	μ=0.70
20	5%	10%	20%	35%
50	8%	25%	50%	75%
100	15%	45%	80%	95%
200	30%	75%	95%	99%

Region Aggregation #

Analyzing at the region level (genes, peaks) rather than individual SNPs:

Increases power: More counts per test
Reduces multiple testing burden: Fewer tests to correct
Captures regulatory effects: ASE often affects entire genes

References #

[Mayba2014]

Mayba O, Gilbert HN, Liu J, et al. (2014). MBASED: allele-specific expression detection in cancer tissues and cell lines. Genome Biology 15:405.