Analysis Module#
Overview#
The analysis module detects statistically significant allelic imbalance using beta-binomial models.
Purpose#
Detect allelic imbalance at genomic regions
Control for biological and technical variation
Support single-cell and bulk RNA-seq
Compare imbalance between groups/conditions
Statistical Models#
Beta-Binomial Model#
WASP2 uses beta-binomial distribution to model: * Overdispersion (variation beyond binomial) * Biological variability between regions * Technical noise in sequencing
The model: * Null hypothesis: Equal expression from both alleles (p=0.5) * Alternative: Allelic imbalance (p ≠ 0.5) * FDR correction for multiple testing
Dispersion Parameter#
Two models: 1. Single: One dispersion parameter for all regions 2. Linear: Dispersion varies with read depth
CLI Usage#
Basic Analysis#
wasp2-analyze find-imbalance counts.tsv
Options#
wasp2-analyze find-imbalance \
counts.tsv \
--min-count 10 \
--pseudocount 1 \
--model single \
--output results.tsv
Parameters#
--min-count#
Minimum total read count per region (default: 10):
--min-count 20 # More stringent
--pseudocount#
Pseudocount added to avoid zero counts (default: 1):
--pseudocount 0 # No pseudocount
--model#
Dispersion model (default: single):
--model linear # Depth-dependent dispersion
--phased#
Use phased genotype information:
--phased # Requires phased VCF
Output Format#
Tab-separated file with columns:
Statistical Columns#
region: Genomic region identifierref_count: Total reference allele countsalt_count: Total alternate allele countsp_value: Likelihood ratio test p-valuefdr_pval: FDR-corrected p-valueeffect_size: Log2 fold-change (ref/alt)
Model Parameters#
dispersion: Beta-binomial dispersion parameterlog_likelihood_null: Null model log-likelihoodlog_likelihood_alt: Alternative model log-likelihood
Interpreting Results#
Significant Imbalance#
FDR < 0.05 indicates significant imbalance:
Biological: cis-regulatory variation, ASE
Technical: mapping bias (check WASP), PCR artifacts
Effect Size#
log2FC > 1: Strong imbalance (2-fold difference)
log2FC > 2: Very strong imbalance (4-fold difference)
Single-Cell Analysis#
For single-cell data, WASP2 detects allelic imbalance within specific cell populations using aggregated counts across cells of the same type.
wasp2-analyze find-imbalance-sc \
adata.h5ad \
barcode_map.tsv \
--sample donor1 \
--min-count 10
Output: Per-celltype TSV files (ai_results_[CELLTYPE].tsv).
Single-Cell Comparative Imbalance#
Overview#
The comparative imbalance analysis detects differential allelic imbalance between cell types, conditions, or biological groups. This is useful for identifying:
Cell-type-specific regulatory variation
Sex differences in chromatin accessibility
Condition-dependent allelic effects (e.g., treatment vs control)
Developmental stage-specific imbalance
Statistical Model#
The analysis uses a likelihood ratio test (LRT) comparing two hypotheses:
Null (H0): Both groups share the same allelic imbalance (μ_combined)
Alternative (H1): Groups have different imbalance (μ₁ ≠ μ₂)
The test statistic follows a chi-squared distribution with 1 degree of freedom:
LRT = -2 × (log L_null - log L_alt)
p-value = P(χ²(df=1) > LRT)
Input Format: Count Matrix (.h5ad)#
The count matrix must be an AnnData object with the following structure:
AnnData object (n_obs × n_vars)
├── .obs # SNP metadata (rows)
│ ├── index # SNP identifiers (0, 1, 2, ...)
│ └── [sample_name] # Genotypes: '0|1', '1|0', '0/1', '1/0'
│
├── .var # Cell metadata (columns)
│ └── group # Cell type/group assignment
│
├── .layers
│ ├── "ref" # Reference allele counts (sparse matrix)
│ └── "alt" # Alternate allele counts (sparse matrix)
│
└── .uns
├── feature # DataFrame: SNP → region mapping
└── samples # List of sample names
Example count matrix creation:
# Generate counts from BAM + variants + barcodes
wasp2-count count-variants-sc \
aligned.bam \
variants.vcf.gz \
barcodes.txt \
--samples NA12878 \
--feature peaks.bed \
--out_file allele_counts.h5ad
Input Format: Barcode Map (TSV)#
A two-column TSV file (no header) mapping cell barcodes to groups:
AAACGAACAGTCAGTT-1 excitatory_neurons
AAACGAAGTCGCTCTA-1 inhibitory_neurons
AAACGAAGTGAACCTA-1 excitatory_neurons
AAAGGATCATCGATGT-1 astrocytes
AAAGGATGTGCAACGA-1 microglia
Requirements:
Tab-separated (
\t)No header row
Barcodes must match those in the count matrix
Groups can be cell types, conditions, sex, or any categorical variable
Basic Usage#
Compare imbalance between all groups:
wasp2-analyze compare-imbalance \
allele_counts.h5ad \
barcode_map.tsv
Compare specific groups only:
wasp2-analyze compare-imbalance \
allele_counts.h5ad \
barcode_map.tsv \
--groups "excitatory_neurons,inhibitory_neurons"
Output Format#
Results are written to ai_results_[GROUP1]_[GROUP2].tsv:
Column |
Description |
|---|---|
|
Genomic region identifier (peak or gene) |
|
Number of shared heterozygous SNPs in region |
|
Reference allele frequency under null hypothesis (shared) |
|
Reference allele frequency in group 1 |
|
Reference allele frequency in group 2 |
|
Log-likelihood under null (shared μ) |
|
Log-likelihood under alternative (separate μ values) |
|
Likelihood ratio test p-value |
|
FDR-corrected p-value (Benjamini-Hochberg) |
Interpreting results:
fdr_pval < 0.05: Significant differential imbalance|mu1 - mu2| > 0.1: Meaningful effect size (~20% difference)mu < 0.5: Alternate allele favored;mu > 0.5: Reference allele favored
Parameters#
--groupsComma-separated list of groups to compare. If omitted, compares all pairwise combinations found in the barcode map.
--minMinimum total allele count per region per group (default: 10). Higher values increase statistical power but reduce the number of testable regions.
--pseudocountPseudocount added to avoid zero counts (default: 1). Affects dispersion estimation.
--sampleSample name for heterozygous SNP filtering. Required if multiple samples are present in the count matrix.
--phasedUse phased genotype information from VCF. Requires genotypes in
0|1or1|0format. Improves power when haplotype phase is known.-z/--z_cutoffRemove SNPs with counts exceeding this z-score threshold. Useful for removing outliers caused by mapping artifacts or copy number variation.
Example: Sex Differences Analysis#
Identify chromatin accessibility regions with sex-biased allelic imbalance:
Step 1: Prepare barcode map with sex labels
# barcode_sex_map.tsv
AAACGAACAGTCAGTT-1 male
AAACGAAGTCGCTCTA-1 female
AAACGAAGTGAACCTA-1 male
AAAGGATCATCGATGT-1 female
Step 2: Run comparative analysis
wasp2-analyze compare-imbalance \
allele_counts.h5ad \
barcode_sex_map.tsv \
--groups "male,female" \
--min 20 \
--out_file ai_results_sex_comparison.tsv
Step 3: Filter significant results
# Extract regions with significant sex differences
awk -F'\t' 'NR==1 || $9 < 0.05' ai_results_male_female.tsv > significant_sex_diff.tsv
# Find regions with large effect size
awk -F'\t' 'NR==1 || ($4 - $5 > 0.15 || $5 - $4 > 0.15)' significant_sex_diff.tsv
Example: snATAC-seq Cell Type Analysis#
Complete workflow for analyzing cell-type-specific chromatin accessibility imbalance:
Step 1: Count alleles from snATAC-seq BAM
# Extract valid barcodes from Cell Ranger output
zcat filtered_barcodes.tsv.gz > barcodes.txt
# Count alleles at heterozygous SNPs overlapping peaks
wasp2-count count-variants-sc \
possorted_bam.bam \
phased_variants.vcf.gz \
barcodes.txt \
--samples sample1 \
--feature atac_peaks.bed \
--out_file snATAC_counts.h5ad
Step 2: Create barcode-to-celltype mapping
Export cell type annotations from your clustering analysis (e.g., Seurat, ArchR):
# R/Seurat example
write.table(
data.frame(barcode = Cells(seurat_obj),
celltype = Idents(seurat_obj)),
"barcode_celltype_map.tsv",
sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE
)
Step 3: Run single-cell imbalance analysis
# Per-celltype analysis
wasp2-analyze find-imbalance-sc \
snATAC_counts.h5ad \
barcode_celltype_map.tsv \
--sample sample1 \
--phased \
--min 10 \
-z 3.0
Step 4: Compare imbalance between cell types
# Compare specific cell types
wasp2-analyze compare-imbalance \
snATAC_counts.h5ad \
barcode_celltype_map.tsv \
--sample sample1 \
--groups "excitatory,inhibitory,astrocyte" \
--phased \
--min 15
# This produces:
# - ai_results_excitatory_inhibitory.tsv
# - ai_results_excitatory_astrocyte.tsv
# - ai_results_inhibitory_astrocyte.tsv
Step 5: Identify cell-type-specific regulatory regions
# Find peaks with differential imbalance between excitatory and inhibitory neurons
awk -F'\t' '$9 < 0.01 && ($4 > 0.6 || $4 < 0.4)' \
ai_results_excitatory_inhibitory.tsv > neuron_subtype_specific_peaks.tsv
Best Practices for Single-Cell Analysis#
Data Quality:
Use WASP-filtered BAM files to remove mapping bias
Require ≥10 total counts per region per group (
--min 10)Apply z-score filtering to remove outliers (
-z 3.0)
Statistical Power:
Merge similar cell types if individual populations have low coverage
Use phased genotypes when available (
--phased)Focus on regions with multiple SNPs for better estimates
Interpretation:
Consider biological replication across samples
Validate top hits with orthogonal methods (allele-specific CRISPR, etc.)
Integrate with eQTL data to identify causal variants
Example Workflow#
# 1. Count alleles
wasp2-count count-variants \
wasp_filtered.bam \
variants.vcf \
--region genes.gtf \
--samples NA12878 \
--output counts.tsv
# 2. Analyze imbalance
wasp2-analyze find-imbalance \
counts.tsv \
--min-count 20 \
--model single \
--output imbalance.tsv
# 3. Filter significant results
awk '$5 < 0.05' imbalance.tsv > significant.tsv
Best Practices#
Read Depth#
Minimum 10 reads per region (use
--min-count)Higher depth = more power
Consider downsampling very deep regions
Quality Control#
Use WASP-filtered reads
Remove low-complexity regions
Filter low-quality SNPs
Multiple Testing#
FDR correction is automatic
Consider Bonferroni for very important regions
Validate top hits experimentally
Common Issues#
No Significant Results#
Increase sample size
Check read depth (use deeper sequencing)
Verify heterozygous SNPs present
Many Significant Results#
Check for batch effects
Verify WASP filtering was applied
Consider stricter FDR threshold
Next Steps#
Validate results with qPCR or DNA-seq
Integrate with eQTL data
Perform pathway enrichment analysis