Analysis Module#

Overview#

The analysis module detects statistically significant allelic imbalance using beta-binomial models.

Purpose#

Detect allelic imbalance at genomic regions
Control for biological and technical variation
Support single-cell and bulk RNA-seq
Compare imbalance between groups/conditions

Statistical Models#

Beta-Binomial Model#

WASP2 uses beta-binomial distribution to model: * Overdispersion (variation beyond binomial) * Biological variability between regions * Technical noise in sequencing

The model: * Null hypothesis: Equal expression from both alleles (p=0.5) * Alternative: Allelic imbalance (p ≠ 0.5) * FDR correction for multiple testing

Dispersion Parameter#

Two models: 1. Single: One dispersion parameter for all regions 2. Linear: Dispersion varies with read depth

CLI Usage#

Basic Analysis#

wasp2-analyze find-imbalance counts.tsv

Options#

wasp2-analyze find-imbalance \
  counts.tsv \
  --min-count 10 \
  --pseudocount 1 \
  --model single \
  --output results.tsv

Parameters#

`--min-count`#

Minimum total read count per region (default: 10):

--min-count 20  # More stringent

`--pseudocount`#

Pseudocount added to avoid zero counts (default: 1):

--pseudocount 0  # No pseudocount

`--model`#

Dispersion model (default: single):

--model linear  # Depth-dependent dispersion

`--phased`#

Use phased genotype information:

--phased  # Requires phased VCF

Output Format#

Tab-separated file with columns:

Statistical Columns#

region: Genomic region identifier
ref_count: Total reference allele counts
alt_count: Total alternate allele counts
p_value: Likelihood ratio test p-value
fdr_pval: FDR-corrected p-value
effect_size: Log2 fold-change (ref/alt)

Model Parameters#

dispersion: Beta-binomial dispersion parameter
log_likelihood_null: Null model log-likelihood
log_likelihood_alt: Alternative model log-likelihood

Interpreting Results#

Significant Imbalance#

FDR < 0.05 indicates significant imbalance:

Biological: cis-regulatory variation, ASE
Technical: mapping bias (check WASP), PCR artifacts

Effect Size#

log2FC > 1: Strong imbalance (2-fold difference)
log2FC > 2: Very strong imbalance (4-fold difference)

Single-Cell Analysis#

For single-cell data, WASP2 detects allelic imbalance within specific cell populations using aggregated counts across cells of the same type.

wasp2-analyze find-imbalance-sc \
  adata.h5ad \
  barcode_map.tsv \
  --sample donor1 \
  --min-count 10

Output: Per-celltype TSV files (ai_results_[CELLTYPE].tsv).

Single-Cell Comparative Imbalance#

Overview#

The comparative imbalance analysis detects differential allelic imbalance between cell types, conditions, or biological groups. This is useful for identifying:

Cell-type-specific regulatory variation
Sex differences in chromatin accessibility
Condition-dependent allelic effects (e.g., treatment vs control)
Developmental stage-specific imbalance

Statistical Model#

The analysis uses a likelihood ratio test (LRT) comparing two hypotheses:

Null (H0): Both groups share the same allelic imbalance (μ_combined)
Alternative (H1): Groups have different imbalance (μ₁ ≠ μ₂)

The test statistic follows a chi-squared distribution with 1 degree of freedom:

LRT = -2 × (log L_null - log L_alt)
p-value = P(χ²(df=1) > LRT)

Input Format: Count Matrix (.h5ad)#

The count matrix must be an AnnData object with the following structure:

AnnData object (n_obs × n_vars)
├── .obs                 # SNP metadata (rows)
│   ├── index            # SNP identifiers (0, 1, 2, ...)
│   └── [sample_name]    # Genotypes: '0|1', '1|0', '0/1', '1/0'
│
├── .var                 # Cell metadata (columns)
│   └── group            # Cell type/group assignment
│
├── .layers
│   ├── "ref"            # Reference allele counts (sparse matrix)
│   └── "alt"            # Alternate allele counts (sparse matrix)
│
└── .uns
    ├── feature          # DataFrame: SNP → region mapping
    └── samples          # List of sample names

Example count matrix creation:

# Generate counts from BAM + variants + barcodes
wasp2-count count-variants-sc \
  aligned.bam \
  variants.vcf.gz \
  barcodes.txt \
  --samples NA12878 \
  --feature peaks.bed \
  --out_file allele_counts.h5ad

Input Format: Barcode Map (TSV)#

A two-column TSV file (no header) mapping cell barcodes to groups:

AAACGAACAGTCAGTT-1    excitatory_neurons
AAACGAAGTCGCTCTA-1    inhibitory_neurons
AAACGAAGTGAACCTA-1    excitatory_neurons
AAAGGATCATCGATGT-1    astrocytes
AAAGGATGTGCAACGA-1    microglia

Requirements:

Tab-separated (\t)
No header row
Barcodes must match those in the count matrix
Groups can be cell types, conditions, sex, or any categorical variable

Basic Usage#

Compare imbalance between all groups:

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_map.tsv

Compare specific groups only:

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_map.tsv \
  --groups "excitatory_neurons,inhibitory_neurons"

Output Format#

Results are written to ai_results_[GROUP1]_[GROUP2].tsv:

Column	Description
`region`	Genomic region identifier (peak or gene)
`num_snps`	Number of shared heterozygous SNPs in region
`combined_mu`	Reference allele frequency under null hypothesis (shared)
`mu1`	Reference allele frequency in group 1
`mu2`	Reference allele frequency in group 2
`null_ll`	Log-likelihood under null (shared μ)
`alt_ll`	Log-likelihood under alternative (separate μ values)
`pval`	Likelihood ratio test p-value
`fdr_pval`	FDR-corrected p-value (Benjamini-Hochberg)

Interpreting results:

fdr_pval < 0.05: Significant differential imbalance
|mu1 - mu2| > 0.1: Meaningful effect size (~20% difference)
mu < 0.5: Alternate allele favored; mu > 0.5: Reference allele favored

Parameters#

--groups: Comma-separated list of groups to compare. If omitted, compares all pairwise combinations found in the barcode map.
--min: Minimum total allele count per region per group (default: 10). Higher values increase statistical power but reduce the number of testable regions.
--pseudocount: Pseudocount added to avoid zero counts (default: 1). Affects dispersion estimation.
--sample: Sample name for heterozygous SNP filtering. Required if multiple samples are present in the count matrix.
--phased: Use phased genotype information from VCF. Requires genotypes in 0|1 or 1|0 format. Improves power when haplotype phase is known.
-z/--z_cutoff: Remove SNPs with counts exceeding this z-score threshold. Useful for removing outliers caused by mapping artifacts or copy number variation.

Example: Sex Differences Analysis#

Identify chromatin accessibility regions with sex-biased allelic imbalance:

Step 1: Prepare barcode map with sex labels

# barcode_sex_map.tsv
AAACGAACAGTCAGTT-1    male
AAACGAAGTCGCTCTA-1    female
AAACGAAGTGAACCTA-1    male
AAAGGATCATCGATGT-1    female

Step 2: Run comparative analysis

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_sex_map.tsv \
  --groups "male,female" \
  --min 20 \
  --out_file ai_results_sex_comparison.tsv

Step 3: Filter significant results

# Extract regions with significant sex differences
awk -F'\t' 'NR==1 || $9 < 0.05' ai_results_male_female.tsv > significant_sex_diff.tsv

# Find regions with large effect size
awk -F'\t' 'NR==1 || ($4 - $5 > 0.15 || $5 - $4 > 0.15)' significant_sex_diff.tsv

Example: snATAC-seq Cell Type Analysis#

Complete workflow for analyzing cell-type-specific chromatin accessibility imbalance:

Step 1: Count alleles from snATAC-seq BAM

# Extract valid barcodes from Cell Ranger output
zcat filtered_barcodes.tsv.gz > barcodes.txt

# Count alleles at heterozygous SNPs overlapping peaks
wasp2-count count-variants-sc \
  possorted_bam.bam \
  phased_variants.vcf.gz \
  barcodes.txt \
  --samples sample1 \
  --feature atac_peaks.bed \
  --out_file snATAC_counts.h5ad

Step 2: Create barcode-to-celltype mapping

Export cell type annotations from your clustering analysis (e.g., Seurat, ArchR):

# R/Seurat example
write.table(
  data.frame(barcode = Cells(seurat_obj),
             celltype = Idents(seurat_obj)),
  "barcode_celltype_map.tsv",
  sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE
)

Step 3: Run single-cell imbalance analysis

# Per-celltype analysis
wasp2-analyze find-imbalance-sc \
  snATAC_counts.h5ad \
  barcode_celltype_map.tsv \
  --sample sample1 \
  --phased \
  --min 10 \
  -z 3.0

Step 4: Compare imbalance between cell types

# Compare specific cell types
wasp2-analyze compare-imbalance \
  snATAC_counts.h5ad \
  barcode_celltype_map.tsv \
  --sample sample1 \
  --groups "excitatory,inhibitory,astrocyte" \
  --phased \
  --min 15

# This produces:
# - ai_results_excitatory_inhibitory.tsv
# - ai_results_excitatory_astrocyte.tsv
# - ai_results_inhibitory_astrocyte.tsv

Step 5: Identify cell-type-specific regulatory regions

# Find peaks with differential imbalance between excitatory and inhibitory neurons
awk -F'\t' '$9 < 0.01 && ($4 > 0.6 || $4 < 0.4)' \
  ai_results_excitatory_inhibitory.tsv > neuron_subtype_specific_peaks.tsv

Best Practices for Single-Cell Analysis#

Data Quality:

Use WASP-filtered BAM files to remove mapping bias
Require ≥10 total counts per region per group (--min 10)
Apply z-score filtering to remove outliers (-z 3.0)

Statistical Power:

Merge similar cell types if individual populations have low coverage
Use phased genotypes when available (--phased)
Focus on regions with multiple SNPs for better estimates

Interpretation:

Consider biological replication across samples
Validate top hits with orthogonal methods (allele-specific CRISPR, etc.)
Integrate with eQTL data to identify causal variants

Example Workflow#

# 1. Count alleles
wasp2-count count-variants \
  wasp_filtered.bam \
  variants.vcf \
  --region genes.gtf \
  --samples NA12878 \
  --output counts.tsv

# 2. Analyze imbalance
wasp2-analyze find-imbalance \
  counts.tsv \
  --min-count 20 \
  --model single \
  --output imbalance.tsv

# 3. Filter significant results
awk '$5 < 0.05' imbalance.tsv > significant.tsv

Best Practices#

Read Depth#

Minimum 10 reads per region (use --min-count)
Higher depth = more power
Consider downsampling very deep regions

Quality Control#

Use WASP-filtered reads
Remove low-complexity regions
Filter low-quality SNPs

Multiple Testing#

FDR correction is automatic
Consider Bonferroni for very important regions
Validate top hits experimentally

Common Issues#

No Significant Results#

Increase sample size
Check read depth (use deeper sequencing)
Verify heterozygous SNPs present

Many Significant Results#

Check for batch effects
Verify WASP filtering was applied
Consider stricter FDR threshold

Next Steps#

Validate results with qPCR or DNA-seq
Integrate with eQTL data
Perform pathway enrichment analysis

Analysis Module#

Overview#

Purpose#

Statistical Models#

Beta-Binomial Model#

Dispersion Parameter#

CLI Usage#

Basic Analysis#

Options#

Parameters#

--min-count#

--pseudocount#

--model#

--phased#

Output Format#

Statistical Columns#

Model Parameters#

Interpreting Results#

Significant Imbalance#

Effect Size#

Single-Cell Analysis#

Single-Cell Comparative Imbalance#

Overview#

Statistical Model#

Input Format: Count Matrix (.h5ad)#

Input Format: Barcode Map (TSV)#

Basic Usage#

Output Format#

Parameters#

Example: Sex Differences Analysis#

Example: snATAC-seq Cell Type Analysis#

Best Practices for Single-Cell Analysis#

Example Workflow#

Best Practices#

Read Depth#

Quality Control#

Multiple Testing#

Common Issues#

No Significant Results#

Many Significant Results#

Next Steps#

This Page

`--min-count`#

`--pseudocount`#

`--model`#

`--phased`#