Beta-Binomial Model for Allelic Imbalance#

This document describes the statistical framework WASP2 uses to detect allelic imbalance from allele count data.

Motivation#

Why Not Use the Binomial?#

The simplest model for allele counts is the binomial distribution. If reads are sampled independently from two alleles with equal probability:

\[X \sim \text{Binomial}(N, 0.5)\]

where \(X\) is the reference count and \(N\) is the total count.

However, real allele count data exhibits overdispersion: the variance exceeds the binomial expectation. Sources of overdispersion include:

  • Biological variation: True allelic imbalance varies across cells

  • Technical noise: PCR amplification introduces correlated errors

  • Aggregation effects: Combining counts across SNPs within a region

  • Sampling from a population: Different individuals have different AI

The Beta-Binomial Model#

The beta-binomial distribution naturally accommodates overdispersion by modeling the success probability as a random variable:

\[\begin{split}p &\sim \text{Beta}(\alpha, \beta) \\ X | p &\sim \text{Binomial}(N, p)\end{split}\]

Marginalizing over \(p\) gives the beta-binomial:

\[P(X = k) = \binom{N}{k} \frac{B(k + \alpha, N - k + \beta)}{B(\alpha, \beta)}\]

where \(B(\cdot, \cdot)\) is the beta function.

Parameterization#

WASP2 uses the mean-dispersion parameterization:

\[\begin{split}\mu &= \frac{\alpha}{\alpha + \beta} \\ \rho &= \frac{1}{\alpha + \beta + 1}\end{split}\]

The dispersion parameter \(\rho \in (0, 1)\) controls overdispersion:

  • \(\rho \to 0\): Approaches binomial (no overdispersion)

  • \(\rho \to 1\): Maximum overdispersion (all probability at 0 or N)

The inverse transformation:

\[\begin{split}\alpha &= \mu \cdot \frac{1 - \rho}{\rho} \\ \beta &= (1 - \mu) \cdot \frac{1 - \rho}{\rho}\end{split}\]

Variance Structure#

The beta-binomial variance is:

\[\text{Var}(X) = N\mu(1-\mu) \left[ 1 + (N-1)\rho \right]\]

The factor \([1 + (N-1)\rho]\) is the variance inflation factor. For typical values (\(\rho \approx 0.01\), \(N \approx 100\)), this gives roughly 2x the binomial variance.

Hypothesis Testing#

WASP2 tests for allelic imbalance using a likelihood ratio test:

Null Hypothesis \(H_0\): No imbalance (\(\mu = 0.5\))

Alternative Hypothesis \(H_1\): Imbalance present (\(\mu \neq 0.5\))

Likelihood Functions#

Under the null hypothesis:

\[\mathcal{L}_0 = \prod_{i=1}^{n} P(X_i | N_i, \mu=0.5, \rho)\]

Under the alternative:

\[\mathcal{L}_1 = \prod_{i=1}^{n} P(X_i | N_i, \hat{\mu}_{\text{MLE}}, \rho)\]

where the product is over SNPs within a region (peak, gene).

Likelihood Ratio Test#

The test statistic:

\[\Lambda = -2 \left[ \log \mathcal{L}_0 - \log \mathcal{L}_1 \right]\]

Under the null hypothesis, \(\Lambda\) follows a chi-squared distribution with 1 degree of freedom:

\[\Lambda \sim \chi^2_1\]

The p-value is:

\[p = P(\chi^2_1 > \Lambda)\]

MLE Estimation#

The MLE for \(\mu\) under the alternative is found by numerical optimization:

from scipy.optimize import minimize_scalar
from scipy.stats import betabinom

def neg_log_likelihood(mu, ref_counts, n_counts, rho):
    alpha = mu * (1 - rho) / rho
    beta = (1 - mu) * (1 - rho) / rho
    return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta))

result = minimize_scalar(
    neg_log_likelihood,
    args=(ref_counts, n_counts, rho),
    method='bounded',
    bounds=(0, 1)
)
mu_mle = result.x

Implementation in WASP2#

Single Dispersion Model#

The default model estimates a single \(\rho\) for all data:

from analysis.as_analysis import single_model

results = single_model(df, region_col="region", phased=False)

This assumes homogeneous overdispersion across regions, which is often reasonable for moderately-sized datasets.

Linear Dispersion Model#

For large datasets, WASP2 offers a linear model where dispersion varies with total count:

\[\text{logit}(\rho) = \beta_0 + \beta_1 \cdot N\]

This captures the observation that regions with higher coverage often show different dispersion characteristics:

from analysis.as_analysis import linear_model

results = linear_model(df, region_col="region", phased=False)

Phased vs Unphased Analysis#

WASP2 supports both phased and unphased genotype data.

Unphased Model#

When genotype phase is unknown, WASP2 uses a mixture model that marginalizes over possible phase configurations:

For a region with multiple SNPs, if we don’t know which haplotype each ref allele belongs to, we sum over phase assignments using dynamic programming:

\[P(\mathbf{X}) = \sum_{\phi \in \text{phases}} P(\mathbf{X} | \phi) \cdot P(\phi)\]

where \(\phi\) indexes phase configurations with prior \(P(\phi) = 0.5^{n-1}\).

Phased Model#

With phased genotypes (e.g., from read-backed phasing), the model is simpler:

\[P(X_i | \text{hap}_i) = \text{BetaBinom}(X_i; N_i, \mu_{\text{hap}_i}, \rho)\]

where \(\mu_{\text{hap}_i}\) is \(\mu\) or \(1-\mu\) depending on which haplotype the reference allele belongs to.

Output Interpretation#

WASP2 returns the following statistics for each region:

Beta-Binomial Analysis Output#

Column

Type

Interpretation

null_ll

float

Log-likelihood under null (μ=0.5)

alt_ll

float

Log-likelihood under alternative (μ=MLE)

mu

float

MLE of imbalance proportion

lrt

float

Likelihood ratio test statistic

pval

float

p-value from χ² distribution

fdr_pval

float

FDR-corrected p-value (BH method)

Interpreting μ (mu):

  • \(\mu = 0.5\): No imbalance (equal allele expression)

  • \(\mu > 0.5\): Reference allele preference

  • \(\mu < 0.5\): Alternate allele preference

  • \(|\mu - 0.5|\): Effect size (magnitude of imbalance)

Practical Considerations#

Pseudocounts#

WASP2 adds pseudocounts to avoid log(0) issues:

\[X' = X + c, \quad N' = N + 2c\]

Default \(c = 1\) (Laplace smoothing). This slightly shrinks estimates toward 0.5, providing conservative inference.

Minimum Count Threshold#

Regions with low total counts have poor statistical power. WASP2 filters regions with \(N < \text{min\_count}\) (default: 10).

Power depends on coverage and effect size:

Approximate Power (α=0.05)#

Total N

μ=0.55

μ=0.60

μ=0.65

μ=0.70

20

5%

10%

20%

35%

50

8%

25%

50%

75%

100

15%

45%

80%

95%

200

30%

75%

95%

99%

Region Aggregation#

Analyzing at the region level (genes, peaks) rather than individual SNPs:

  • Increases power: More counts per test

  • Reduces multiple testing burden: Fewer tests to correct

  • Captures regulatory effects: ASE often affects entire genes

See Also#

References#

[Mayba2014]

Mayba O, Gilbert HN, Liu J, et al. (2014). MBASED: allele-specific expression detection in cancer tissues and cell lines. Genome Biology 15:405.