Beta-Binomial Model for Allelic Imbalance ========================================== This document describes the statistical framework WASP2 uses to detect allelic imbalance from allele count data. .. contents:: Contents :local: :depth: 2 Motivation ---------- Why Not Use the Binomial? ^^^^^^^^^^^^^^^^^^^^^^^^^ The simplest model for allele counts is the binomial distribution. If reads are sampled independently from two alleles with equal probability: .. math:: X \sim \text{Binomial}(N, 0.5) where :math:`X` is the reference count and :math:`N` is the total count. However, real allele count data exhibits **overdispersion**: the variance exceeds the binomial expectation. Sources of overdispersion include: - **Biological variation**: True allelic imbalance varies across cells - **Technical noise**: PCR amplification introduces correlated errors - **Aggregation effects**: Combining counts across SNPs within a region - **Sampling from a population**: Different individuals have different AI The Beta-Binomial Model ----------------------- The beta-binomial distribution naturally accommodates overdispersion by modeling the success probability as a random variable: .. math:: p &\sim \text{Beta}(\alpha, \beta) \\ X | p &\sim \text{Binomial}(N, p) Marginalizing over :math:`p` gives the beta-binomial: .. math:: P(X = k) = \binom{N}{k} \frac{B(k + \alpha, N - k + \beta)}{B(\alpha, \beta)} where :math:`B(\cdot, \cdot)` is the beta function. Parameterization ^^^^^^^^^^^^^^^^ WASP2 uses the **mean-dispersion parameterization**: .. math:: \mu &= \frac{\alpha}{\alpha + \beta} \\ \rho &= \frac{1}{\alpha + \beta + 1} The dispersion parameter :math:`\rho \in (0, 1)` controls overdispersion: - :math:`\rho \to 0`: Approaches binomial (no overdispersion) - :math:`\rho \to 1`: Maximum overdispersion (all probability at 0 or N) The inverse transformation: .. math:: \alpha &= \mu \cdot \frac{1 - \rho}{\rho} \\ \beta &= (1 - \mu) \cdot \frac{1 - \rho}{\rho} Variance Structure ^^^^^^^^^^^^^^^^^^ The beta-binomial variance is: .. math:: \text{Var}(X) = N\mu(1-\mu) \left[ 1 + (N-1)\rho \right] The factor :math:`[1 + (N-1)\rho]` is the **variance inflation factor**. For typical values (:math:`\rho \approx 0.01`, :math:`N \approx 100`), this gives roughly 2x the binomial variance. Hypothesis Testing ------------------ WASP2 tests for allelic imbalance using a likelihood ratio test: **Null Hypothesis** :math:`H_0`: No imbalance (:math:`\mu = 0.5`) **Alternative Hypothesis** :math:`H_1`: Imbalance present (:math:`\mu \neq 0.5`) Likelihood Functions ^^^^^^^^^^^^^^^^^^^^ Under the null hypothesis: .. math:: \mathcal{L}_0 = \prod_{i=1}^{n} P(X_i | N_i, \mu=0.5, \rho) Under the alternative: .. math:: \mathcal{L}_1 = \prod_{i=1}^{n} P(X_i | N_i, \hat{\mu}_{\text{MLE}}, \rho) where the product is over SNPs within a region (peak, gene). Likelihood Ratio Test ^^^^^^^^^^^^^^^^^^^^^ The test statistic: .. math:: \Lambda = -2 \left[ \log \mathcal{L}_0 - \log \mathcal{L}_1 \right] Under the null hypothesis, :math:`\Lambda` follows a chi-squared distribution with 1 degree of freedom: .. math:: \Lambda \sim \chi^2_1 The p-value is: .. math:: p = P(\chi^2_1 > \Lambda) MLE Estimation ^^^^^^^^^^^^^^ The MLE for :math:`\mu` under the alternative is found by numerical optimization: .. code-block:: python from scipy.optimize import minimize_scalar from scipy.stats import betabinom def neg_log_likelihood(mu, ref_counts, n_counts, rho): alpha = mu * (1 - rho) / rho beta = (1 - mu) * (1 - rho) / rho return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta)) result = minimize_scalar( neg_log_likelihood, args=(ref_counts, n_counts, rho), method='bounded', bounds=(0, 1) ) mu_mle = result.x Implementation in WASP2 ----------------------- Single Dispersion Model ^^^^^^^^^^^^^^^^^^^^^^^ The default model estimates a single :math:`\rho` for all data: .. code-block:: python from analysis.as_analysis import single_model results = single_model(df, region_col="region", phased=False) This assumes homogeneous overdispersion across regions, which is often reasonable for moderately-sized datasets. Linear Dispersion Model ^^^^^^^^^^^^^^^^^^^^^^^ For large datasets, WASP2 offers a linear model where dispersion varies with total count: .. math:: \text{logit}(\rho) = \beta_0 + \beta_1 \cdot N This captures the observation that regions with higher coverage often show different dispersion characteristics: .. code-block:: python from analysis.as_analysis import linear_model results = linear_model(df, region_col="region", phased=False) Phased vs Unphased Analysis --------------------------- WASP2 supports both phased and unphased genotype data. Unphased Model ^^^^^^^^^^^^^^ When genotype phase is unknown, WASP2 uses a mixture model that marginalizes over possible phase configurations: For a region with multiple SNPs, if we don't know which haplotype each ref allele belongs to, we sum over phase assignments using dynamic programming: .. math:: P(\mathbf{X}) = \sum_{\phi \in \text{phases}} P(\mathbf{X} | \phi) \cdot P(\phi) where :math:`\phi` indexes phase configurations with prior :math:`P(\phi) = 0.5^{n-1}`. Phased Model ^^^^^^^^^^^^ With phased genotypes (e.g., from read-backed phasing), the model is simpler: .. math:: P(X_i | \text{hap}_i) = \text{BetaBinom}(X_i; N_i, \mu_{\text{hap}_i}, \rho) where :math:`\mu_{\text{hap}_i}` is :math:`\mu` or :math:`1-\mu` depending on which haplotype the reference allele belongs to. Output Interpretation --------------------- WASP2 returns the following statistics for each region: .. table:: Beta-Binomial Analysis Output :widths: 20 15 65 ============ ======== ====================================================== Column Type Interpretation ============ ======== ====================================================== null_ll float Log-likelihood under null (μ=0.5) alt_ll float Log-likelihood under alternative (μ=MLE) mu float MLE of imbalance proportion lrt float Likelihood ratio test statistic pval float p-value from χ² distribution fdr_pval float FDR-corrected p-value (BH method) ============ ======== ====================================================== **Interpreting μ (mu)**: - :math:`\mu = 0.5`: No imbalance (equal allele expression) - :math:`\mu > 0.5`: Reference allele preference - :math:`\mu < 0.5`: Alternate allele preference - :math:`|\mu - 0.5|`: Effect size (magnitude of imbalance) Practical Considerations ------------------------ Pseudocounts ^^^^^^^^^^^^ WASP2 adds pseudocounts to avoid log(0) issues: .. math:: X' = X + c, \quad N' = N + 2c Default :math:`c = 1` (Laplace smoothing). This slightly shrinks estimates toward 0.5, providing conservative inference. Minimum Count Threshold ^^^^^^^^^^^^^^^^^^^^^^^ Regions with low total counts have poor statistical power. WASP2 filters regions with :math:`N < \text{min\_count}` (default: 10). Power depends on coverage and effect size: .. table:: Approximate Power (α=0.05) :widths: 20 20 20 20 20 ========= ======== ======== ======== ======== Total N μ=0.55 μ=0.60 μ=0.65 μ=0.70 ========= ======== ======== ======== ======== 20 5% 10% 20% 35% 50 8% 25% 50% 75% 100 15% 45% 80% 95% 200 30% 75% 95% 99% ========= ======== ======== ======== ======== Region Aggregation ^^^^^^^^^^^^^^^^^^ Analyzing at the region level (genes, peaks) rather than individual SNPs: - **Increases power**: More counts per test - **Reduces multiple testing burden**: Fewer tests to correct - **Captures regulatory effects**: ASE often affects entire genes See Also -------- - :doc:`dispersion_estimation` - Estimating the dispersion parameter - :doc:`fdr_correction` - Multiple testing correction methods References ---------- .. [Mayba2014] Mayba O, Gilbert HN, Liu J, et al. (2014). MBASED: allele-specific expression detection in cancer tissues and cell lines. *Genome Biology* 15:405.