Analysis Module#

Overview#

The analysis module detects statistically significant allelic imbalance using beta-binomial models.

Purpose#

  • Detect allelic imbalance at genomic regions

  • Control for biological and technical variation

  • Support single-cell and bulk RNA-seq

  • Compare imbalance between groups/conditions

Statistical Models#

Beta-Binomial Model#

WASP2 uses beta-binomial distribution to model: * Overdispersion (variation beyond binomial) * Biological variability between regions * Technical noise in sequencing

The model: * Null hypothesis: Equal expression from both alleles (p=0.5) * Alternative: Allelic imbalance (p ≠ 0.5) * FDR correction for multiple testing

Dispersion Parameter#

Two models: 1. Single: One dispersion parameter for all regions 2. Linear: Dispersion varies with read depth

CLI Usage#

Basic Analysis#

wasp2-analyze find-imbalance counts.tsv

Options#

wasp2-analyze find-imbalance \
  counts.tsv \
  --min-count 10 \
  --pseudocount 1 \
  --model single \
  --output results.tsv

Parameters#

--min-count#

Minimum total read count per region (default: 10):

--min-count 20  # More stringent

--pseudocount#

Pseudocount added to avoid zero counts (default: 1):

--pseudocount 0  # No pseudocount

--model#

Dispersion model (default: single):

--model linear  # Depth-dependent dispersion

--phased#

Use phased genotype information:

--phased  # Requires phased VCF

Output Format#

Tab-separated file with columns:

Statistical Columns#

  • region: Genomic region identifier

  • ref_count: Total reference allele counts

  • alt_count: Total alternate allele counts

  • p_value: Likelihood ratio test p-value

  • fdr_pval: FDR-corrected p-value

  • effect_size: Log2 fold-change (ref/alt)

Model Parameters#

  • dispersion: Beta-binomial dispersion parameter

  • log_likelihood_null: Null model log-likelihood

  • log_likelihood_alt: Alternative model log-likelihood

Interpreting Results#

Significant Imbalance#

FDR < 0.05 indicates significant imbalance:

  • Biological: cis-regulatory variation, ASE

  • Technical: mapping bias (check WASP), PCR artifacts

Effect Size#

  • log2FC > 1: Strong imbalance (2-fold difference)

  • log2FC > 2: Very strong imbalance (4-fold difference)

Single-Cell Analysis#

For single-cell data, WASP2 detects allelic imbalance within specific cell populations using aggregated counts across cells of the same type.

wasp2-analyze find-imbalance-sc \
  adata.h5ad \
  barcode_map.tsv \
  --sample donor1 \
  --min-count 10

Output: Per-celltype TSV files (ai_results_[CELLTYPE].tsv).

Single-Cell Comparative Imbalance#

Overview#

The comparative imbalance analysis detects differential allelic imbalance between cell types, conditions, or biological groups. This is useful for identifying:

  • Cell-type-specific regulatory variation

  • Sex differences in chromatin accessibility

  • Condition-dependent allelic effects (e.g., treatment vs control)

  • Developmental stage-specific imbalance

Statistical Model#

The analysis uses a likelihood ratio test (LRT) comparing two hypotheses:

  • Null (H0): Both groups share the same allelic imbalance (μ_combined)

  • Alternative (H1): Groups have different imbalance (μ₁ ≠ μ₂)

The test statistic follows a chi-squared distribution with 1 degree of freedom:

LRT = -2 × (log L_null - log L_alt)
p-value = P(χ²(df=1) > LRT)

Input Format: Count Matrix (.h5ad)#

The count matrix must be an AnnData object with the following structure:

AnnData object (n_obs × n_vars)
├── .obs                 # SNP metadata (rows)
│   ├── index            # SNP identifiers (0, 1, 2, ...)
│   └── [sample_name]    # Genotypes: '0|1', '1|0', '0/1', '1/0'
│
├── .var                 # Cell metadata (columns)
│   └── group            # Cell type/group assignment
│
├── .layers
│   ├── "ref"            # Reference allele counts (sparse matrix)
│   └── "alt"            # Alternate allele counts (sparse matrix)
│
└── .uns
    ├── feature          # DataFrame: SNP → region mapping
    └── samples          # List of sample names

Example count matrix creation:

# Generate counts from BAM + variants + barcodes
wasp2-count count-variants-sc \
  aligned.bam \
  variants.vcf.gz \
  barcodes.txt \
  --samples NA12878 \
  --feature peaks.bed \
  --out_file allele_counts.h5ad

Input Format: Barcode Map (TSV)#

A two-column TSV file (no header) mapping cell barcodes to groups:

AAACGAACAGTCAGTT-1    excitatory_neurons
AAACGAAGTCGCTCTA-1    inhibitory_neurons
AAACGAAGTGAACCTA-1    excitatory_neurons
AAAGGATCATCGATGT-1    astrocytes
AAAGGATGTGCAACGA-1    microglia

Requirements:

  • Tab-separated (\t)

  • No header row

  • Barcodes must match those in the count matrix

  • Groups can be cell types, conditions, sex, or any categorical variable

Basic Usage#

Compare imbalance between all groups:

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_map.tsv

Compare specific groups only:

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_map.tsv \
  --groups "excitatory_neurons,inhibitory_neurons"

Output Format#

Results are written to ai_results_[GROUP1]_[GROUP2].tsv:

Column

Description

region

Genomic region identifier (peak or gene)

num_snps

Number of shared heterozygous SNPs in region

combined_mu

Reference allele frequency under null hypothesis (shared)

mu1

Reference allele frequency in group 1

mu2

Reference allele frequency in group 2

null_ll

Log-likelihood under null (shared μ)

alt_ll

Log-likelihood under alternative (separate μ values)

pval

Likelihood ratio test p-value

fdr_pval

FDR-corrected p-value (Benjamini-Hochberg)

Interpreting results:

  • fdr_pval < 0.05: Significant differential imbalance

  • |mu1 - mu2| > 0.1: Meaningful effect size (~20% difference)

  • mu < 0.5: Alternate allele favored; mu > 0.5: Reference allele favored

Parameters#

--groups

Comma-separated list of groups to compare. If omitted, compares all pairwise combinations found in the barcode map.

--min

Minimum total allele count per region per group (default: 10). Higher values increase statistical power but reduce the number of testable regions.

--pseudocount

Pseudocount added to avoid zero counts (default: 1). Affects dispersion estimation.

--sample

Sample name for heterozygous SNP filtering. Required if multiple samples are present in the count matrix.

--phased

Use phased genotype information from VCF. Requires genotypes in 0|1 or 1|0 format. Improves power when haplotype phase is known.

-z/--z_cutoff

Remove SNPs with counts exceeding this z-score threshold. Useful for removing outliers caused by mapping artifacts or copy number variation.

Example: Sex Differences Analysis#

Identify chromatin accessibility regions with sex-biased allelic imbalance:

Step 1: Prepare barcode map with sex labels

# barcode_sex_map.tsv
AAACGAACAGTCAGTT-1    male
AAACGAAGTCGCTCTA-1    female
AAACGAAGTGAACCTA-1    male
AAAGGATCATCGATGT-1    female

Step 2: Run comparative analysis

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_sex_map.tsv \
  --groups "male,female" \
  --min 20 \
  --out_file ai_results_sex_comparison.tsv

Step 3: Filter significant results

# Extract regions with significant sex differences
awk -F'\t' 'NR==1 || $9 < 0.05' ai_results_male_female.tsv > significant_sex_diff.tsv

# Find regions with large effect size
awk -F'\t' 'NR==1 || ($4 - $5 > 0.15 || $5 - $4 > 0.15)' significant_sex_diff.tsv

Example: snATAC-seq Cell Type Analysis#

Complete workflow for analyzing cell-type-specific chromatin accessibility imbalance:

Step 1: Count alleles from snATAC-seq BAM

# Extract valid barcodes from Cell Ranger output
zcat filtered_barcodes.tsv.gz > barcodes.txt

# Count alleles at heterozygous SNPs overlapping peaks
wasp2-count count-variants-sc \
  possorted_bam.bam \
  phased_variants.vcf.gz \
  barcodes.txt \
  --samples sample1 \
  --feature atac_peaks.bed \
  --out_file snATAC_counts.h5ad

Step 2: Create barcode-to-celltype mapping

Export cell type annotations from your clustering analysis (e.g., Seurat, ArchR):

# R/Seurat example
write.table(
  data.frame(barcode = Cells(seurat_obj),
             celltype = Idents(seurat_obj)),
  "barcode_celltype_map.tsv",
  sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE
)

Step 3: Run single-cell imbalance analysis

# Per-celltype analysis
wasp2-analyze find-imbalance-sc \
  snATAC_counts.h5ad \
  barcode_celltype_map.tsv \
  --sample sample1 \
  --phased \
  --min 10 \
  -z 3.0

Step 4: Compare imbalance between cell types

# Compare specific cell types
wasp2-analyze compare-imbalance \
  snATAC_counts.h5ad \
  barcode_celltype_map.tsv \
  --sample sample1 \
  --groups "excitatory,inhibitory,astrocyte" \
  --phased \
  --min 15

# This produces:
# - ai_results_excitatory_inhibitory.tsv
# - ai_results_excitatory_astrocyte.tsv
# - ai_results_inhibitory_astrocyte.tsv

Step 5: Identify cell-type-specific regulatory regions

# Find peaks with differential imbalance between excitatory and inhibitory neurons
awk -F'\t' '$9 < 0.01 && ($4 > 0.6 || $4 < 0.4)' \
  ai_results_excitatory_inhibitory.tsv > neuron_subtype_specific_peaks.tsv

Best Practices for Single-Cell Analysis#

Data Quality:

  • Use WASP-filtered BAM files to remove mapping bias

  • Require ≥10 total counts per region per group (--min 10)

  • Apply z-score filtering to remove outliers (-z 3.0)

Statistical Power:

  • Merge similar cell types if individual populations have low coverage

  • Use phased genotypes when available (--phased)

  • Focus on regions with multiple SNPs for better estimates

Interpretation:

  • Consider biological replication across samples

  • Validate top hits with orthogonal methods (allele-specific CRISPR, etc.)

  • Integrate with eQTL data to identify causal variants

Example Workflow#

# 1. Count alleles
wasp2-count count-variants \
  wasp_filtered.bam \
  variants.vcf \
  --region genes.gtf \
  --samples NA12878 \
  --output counts.tsv

# 2. Analyze imbalance
wasp2-analyze find-imbalance \
  counts.tsv \
  --min-count 20 \
  --model single \
  --output imbalance.tsv

# 3. Filter significant results
awk '$5 < 0.05' imbalance.tsv > significant.tsv

Best Practices#

Read Depth#

  • Minimum 10 reads per region (use --min-count)

  • Higher depth = more power

  • Consider downsampling very deep regions

Quality Control#

  • Use WASP-filtered reads

  • Remove low-complexity regions

  • Filter low-quality SNPs

Multiple Testing#

  • FDR correction is automatic

  • Consider Bonferroni for very important regions

  • Validate top hits experimentally

Common Issues#

No Significant Results#

  • Increase sample size

  • Check read depth (use deeper sequencing)

  • Verify heterozygous SNPs present

Many Significant Results#

  • Check for batch effects

  • Verify WASP filtering was applied

  • Consider stricter FDR threshold

Next Steps#

  • Validate results with qPCR or DNA-seq

  • Integrate with eQTL data

  • Perform pathway enrichment analysis