Beta-Binomial Model for Allelic Imbalance#
This document describes the statistical framework WASP2 uses to detect allelic imbalance from allele count data.
Motivation#
Why Not Use the Binomial?#
The simplest model for allele counts is the binomial distribution. If reads are sampled independently from two alleles with equal probability:
where \(X\) is the reference count and \(N\) is the total count.
However, real allele count data exhibits overdispersion: the variance exceeds the binomial expectation. Sources of overdispersion include:
Biological variation: True allelic imbalance varies across cells
Technical noise: PCR amplification introduces correlated errors
Aggregation effects: Combining counts across SNPs within a region
Sampling from a population: Different individuals have different AI
The Beta-Binomial Model#
The beta-binomial distribution naturally accommodates overdispersion by modeling the success probability as a random variable:
Marginalizing over \(p\) gives the beta-binomial:
where \(B(\cdot, \cdot)\) is the beta function.
Parameterization#
WASP2 uses the mean-dispersion parameterization:
The dispersion parameter \(\rho \in (0, 1)\) controls overdispersion:
\(\rho \to 0\): Approaches binomial (no overdispersion)
\(\rho \to 1\): Maximum overdispersion (all probability at 0 or N)
The inverse transformation:
Variance Structure#
The beta-binomial variance is:
The factor \([1 + (N-1)\rho]\) is the variance inflation factor. For typical values (\(\rho \approx 0.01\), \(N \approx 100\)), this gives roughly 2x the binomial variance.
Hypothesis Testing#
WASP2 tests for allelic imbalance using a likelihood ratio test:
Null Hypothesis \(H_0\): No imbalance (\(\mu = 0.5\))
Alternative Hypothesis \(H_1\): Imbalance present (\(\mu \neq 0.5\))
Likelihood Functions#
Under the null hypothesis:
Under the alternative:
where the product is over SNPs within a region (peak, gene).
Likelihood Ratio Test#
The test statistic:
Under the null hypothesis, \(\Lambda\) follows a chi-squared distribution with 1 degree of freedom:
The p-value is:
MLE Estimation#
The MLE for \(\mu\) under the alternative is found by numerical optimization:
from scipy.optimize import minimize_scalar
from scipy.stats import betabinom
def neg_log_likelihood(mu, ref_counts, n_counts, rho):
alpha = mu * (1 - rho) / rho
beta = (1 - mu) * (1 - rho) / rho
return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta))
result = minimize_scalar(
neg_log_likelihood,
args=(ref_counts, n_counts, rho),
method='bounded',
bounds=(0, 1)
)
mu_mle = result.x
Implementation in WASP2#
Single Dispersion Model#
The default model estimates a single \(\rho\) for all data:
from analysis.as_analysis import single_model
results = single_model(df, region_col="region", phased=False)
This assumes homogeneous overdispersion across regions, which is often reasonable for moderately-sized datasets.
Linear Dispersion Model#
For large datasets, WASP2 offers a linear model where dispersion varies with total count:
This captures the observation that regions with higher coverage often show different dispersion characteristics:
from analysis.as_analysis import linear_model
results = linear_model(df, region_col="region", phased=False)
Phased vs Unphased Analysis#
WASP2 supports both phased and unphased genotype data.
Unphased Model#
When genotype phase is unknown, WASP2 uses a mixture model that marginalizes over possible phase configurations:
For a region with multiple SNPs, if we don’t know which haplotype each ref allele belongs to, we sum over phase assignments using dynamic programming:
where \(\phi\) indexes phase configurations with prior \(P(\phi) = 0.5^{n-1}\).
Phased Model#
With phased genotypes (e.g., from read-backed phasing), the model is simpler:
where \(\mu_{\text{hap}_i}\) is \(\mu\) or \(1-\mu\) depending on which haplotype the reference allele belongs to.
Output Interpretation#
WASP2 returns the following statistics for each region:
Column |
Type |
Interpretation |
|---|---|---|
null_ll |
float |
Log-likelihood under null (μ=0.5) |
alt_ll |
float |
Log-likelihood under alternative (μ=MLE) |
mu |
float |
MLE of imbalance proportion |
lrt |
float |
Likelihood ratio test statistic |
pval |
float |
p-value from χ² distribution |
fdr_pval |
float |
FDR-corrected p-value (BH method) |
Interpreting μ (mu):
\(\mu = 0.5\): No imbalance (equal allele expression)
\(\mu > 0.5\): Reference allele preference
\(\mu < 0.5\): Alternate allele preference
\(|\mu - 0.5|\): Effect size (magnitude of imbalance)
Practical Considerations#
Pseudocounts#
WASP2 adds pseudocounts to avoid log(0) issues:
Default \(c = 1\) (Laplace smoothing). This slightly shrinks estimates toward 0.5, providing conservative inference.
Minimum Count Threshold#
Regions with low total counts have poor statistical power. WASP2 filters regions with \(N < \text{min\_count}\) (default: 10).
Power depends on coverage and effect size:
Total N |
μ=0.55 |
μ=0.60 |
μ=0.65 |
μ=0.70 |
|---|---|---|---|---|
20 |
5% |
10% |
20% |
35% |
50 |
8% |
25% |
50% |
75% |
100 |
15% |
45% |
80% |
95% |
200 |
30% |
75% |
95% |
99% |
Region Aggregation#
Analyzing at the region level (genes, peaks) rather than individual SNPs:
Increases power: More counts per test
Reduces multiple testing burden: Fewer tests to correct
Captures regulatory effects: ASE often affects entire genes
See Also#
Dispersion Parameter Estimation - Estimating the dispersion parameter
False Discovery Rate Correction - Multiple testing correction methods
References#
Mayba O, Gilbert HN, Liu J, et al. (2014). MBASED: allele-specific expression detection in cancer tissues and cell lines. Genome Biology 15:405.