Comparative Imbalance Analysis Tutorial#

This tutorial provides a comprehensive guide to detecting differential allelic imbalance between cell types, conditions, or biological groups using WASP2’s comparative analysis module.

Overview#

What is Comparative Imbalance Analysis?

While standard allelic imbalance (AI) analysis detects whether a genomic region shows preferential expression of one allele, comparative imbalance analysis asks a different question: Does the degree of imbalance differ between groups?

This is powerful for identifying:

Cell-type-specific regulatory variation - Regions where different cell types show distinct allelic preferences
Condition-dependent effects - Treatment-induced changes in allelic regulation
Sex differences - Chromatin regions with sex-biased allelic accessibility
Developmental dynamics - Stage-specific changes in allelic regulation

Statistical Approach

WASP2 uses a likelihood ratio test (LRT) to compare two hypotheses:

Null Hypothesis (H0):     Both groups share the same allelic imbalance (μ_combined)
Alternative Hypothesis (H1): Groups have different imbalance (μ₁ ≠ μ₂)

Test Statistic:  LRT = -2 × (log L_null - log L_alt)
P-value:         P(χ²(df=1) > LRT)

The test accounts for overdispersion using beta-binomial modeling and applies Benjamini-Hochberg FDR correction for multiple testing.

Prerequisites#

Software:

WASP2 (pip install wasp2)
Python with scanpy/pandas for visualization (optional)
R with Seurat for cell type annotation (optional)

Data Requirements:

AnnData count matrix (.h5ad) with allele counts per cell per SNP
Barcode-to-group mapping file (TSV)
Groups can be: cell types, conditions, sex, treatment status, etc.

Input Data Format#

Count Matrix (.h5ad)#

Your AnnData object should have this structure:

AnnData object (n_snps × n_cells)
├── .obs                 # SNP metadata (rows)
│   ├── index            # SNP identifiers
│   └── [sample_name]    # Genotypes: '0|1', '1|0', '0/1', '1/0'
│
├── .var                 # Cell metadata (columns)
│   └── group            # Cell type/group assignment
│
├── .layers
│   ├── "ref"            # Reference allele counts (sparse matrix)
│   └── "alt"            # Alternate allele counts (sparse matrix)
│
└── .uns
    ├── feature          # DataFrame: SNP → region mapping
    └── samples          # List of sample names

Create counts from BAM + VCF:

wasp2-count count-variants-sc \
  aligned.bam \
  phased_variants.vcf.gz \
  barcodes.txt \
  --samples SAMPLE_ID \
  --feature peaks.bed \
  --out_file allele_counts.h5ad

Barcode Map (TSV)#

A two-column tab-separated file (no header) mapping cell barcodes to groups:

AAACGAACAGTCAGTT-1    excitatory_neurons
AAACGAAGTCGCTCTA-1    inhibitory_neurons
AAACGAAGTGAACCTA-1    excitatory_neurons
AAAGGATCATCGATGT-1    astrocytes
AAAGGATGTGCAACGA-1    microglia

Important: Barcodes must exactly match those in the count matrix (including any -1 suffix).

Tutorial 1: Cell Type Comparison#

This tutorial demonstrates comparing allelic imbalance between neuronal subtypes.

Step 1: Prepare Input Files#

Export cell type annotations from Seurat:

library(Seurat)

# Load your analyzed Seurat object
seurat_obj <- readRDS("brain_snATAC.rds")

# Create barcode-to-celltype mapping
barcode_df <- data.frame(
  barcode = colnames(seurat_obj),
  celltype = Idents(seurat_obj)
)

# Write without header
write.table(
  barcode_df,
  "barcode_celltype_map.tsv",
  sep = "\t", quote = FALSE,
  row.names = FALSE, col.names = FALSE
)

Verify the barcode file:

# Check format
head barcode_celltype_map.tsv
# AAACGAACAGTCAGTT-1 excitatory_neurons
# AAACGAAGTCGCTCTA-1 inhibitory_neurons

# Count cells per type
cut -f2 barcode_celltype_map.tsv | sort | uniq -c | sort -rn
#   2500 excitatory_neurons
#   1800 inhibitory_neurons
#   1200 astrocytes
#    800 oligodendrocytes

Step 2: Run Per-Group Imbalance Analysis#

First, analyze imbalance within each cell type:

wasp2-analyze find-imbalance-sc \
  allele_counts.h5ad \
  barcode_celltype_map.tsv \
  --sample SAMPLE_ID \
  --phased \
  --min 10 \
  -z 3

This produces per-celltype result files (e.g., ai_results_excitatory_neurons.tsv).

Step 3: Run Comparative Analysis#

Compare imbalance between specific cell types:

# Compare excitatory vs inhibitory neurons
wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_celltype_map.tsv \
  --sample SAMPLE_ID \
  --groups "excitatory_neurons,inhibitory_neurons" \
  --phased \
  --min 15

Compare all pairwise combinations:

# Omit --groups to compare all cell types
wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_celltype_map.tsv \
  --sample SAMPLE_ID \
  --phased \
  --min 15

This produces output files for each pairwise comparison:

ai_results_excitatory_neurons_inhibitory_neurons.tsv
ai_results_excitatory_neurons_astrocytes.tsv
ai_results_inhibitory_neurons_astrocytes.tsv
…

Step 4: Interpret Results#

Output columns explained:

Column	Description
`region`	Genomic region (peak or gene) identifier
`num_snps`	Number of shared heterozygous SNPs used for comparison
`combined_mu`	Reference allele frequency under null hypothesis (shared between groups)
`mu1`	Reference allele frequency in group 1 (e.g., excitatory neurons)
`mu2`	Reference allele frequency in group 2 (e.g., inhibitory neurons)
`null_ll`	Log-likelihood under null hypothesis (shared μ)
`alt_ll`	Log-likelihood under alternative hypothesis (separate μ values)
`pval`	Likelihood ratio test p-value
`fdr_pval`	FDR-corrected p-value (Benjamini-Hochberg)

Filtering significant results:

# Significant differential imbalance (FDR < 0.05)
awk -F'\t' 'NR==1 || $9 < 0.05' ai_results_excitatory_neurons_inhibitory_neurons.tsv \
  > significant_differential_AI.tsv

# Large effect size (>15% difference in allele frequency)
awk -F'\t' 'NR==1 || ($4 - $5 > 0.15 || $5 - $4 > 0.15)' significant_differential_AI.tsv \
  > large_effect_differential_AI.tsv

Interpret μ values:

mu < 0.5: Alternate allele favored
mu > 0.5: Reference allele favored
|mu1 - mu2| > 0.1: Meaningful difference (~20% shift in allele preference)

Tutorial 2: Sex Differences Analysis#

Identify regions with sex-biased allelic imbalance in chromatin accessibility.

Step 1: Create Sex-Labeled Barcode Map#

import pandas as pd
import scanpy as sc

# Load your annotated data
adata = sc.read_h5ad("processed_snATAC.h5ad")

# Create barcode-to-sex mapping
barcode_df = pd.DataFrame({
    'barcode': adata.obs_names,
    'sex': adata.obs['donor_sex']  # 'male' or 'female'
})

# Write without header
barcode_df.to_csv('barcode_sex_map.tsv', sep='\t', header=False, index=False)

Step 2: Run Comparative Analysis#

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_sex_map.tsv \
  --sample SAMPLE_ID \
  --groups "male,female" \
  --phased \
  --min 20 \
  --out_file ai_results_sex_comparison.tsv

Step 3: Identify Sex-Biased Regions#

# Extract significant sex differences
awk -F'\t' 'NR==1 || $9 < 0.01' ai_results_sex_comparison.tsv > sex_biased_regions.tsv

# Count by chromosome (expect enrichment on X)
cut -f1 sex_biased_regions.tsv | grep -E "^chr" | cut -d: -f1 | sort | uniq -c

Tutorial 3: Treatment vs Control#

Compare allelic imbalance before and after drug treatment.

Step 1: Prepare Condition Labels#

import pandas as pd

# Load metadata with treatment status
metadata = pd.read_csv("sample_metadata.csv")

# Create barcode-to-condition mapping
barcode_df = pd.DataFrame({
    'barcode': metadata['cell_barcode'],
    'condition': metadata['treatment_status']  # 'treated' or 'control'
})

barcode_df.to_csv('barcode_treatment_map.tsv', sep='\t', header=False, index=False)

Step 2: Run Analysis#

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcode_treatment_map.tsv \
  --sample SAMPLE_ID \
  --groups "treated,control" \
  --min 15 \
  --out_file ai_results_treatment.tsv

Step 3: Identify Treatment-Responsive Regions#

import pandas as pd

# Load results
results = pd.read_csv('ai_results_treated_control.tsv', sep='\t')

# Significant treatment effects
significant = results[results['fdr_pval'] < 0.05]
print(f"Found {len(significant)} treatment-responsive regions")

# Direction of change
treatment_gain = significant[significant['mu1'] > significant['mu2'] + 0.1]
treatment_loss = significant[significant['mu2'] > significant['mu1'] + 0.1]

print(f"Regions with increased ref allele in treatment: {len(treatment_gain)}")
print(f"Regions with decreased ref allele in treatment: {len(treatment_loss)}")

Visualization Examples#

Volcano Plot#

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load results
results = pd.read_csv('ai_results_excitatory_neurons_inhibitory_neurons.tsv', sep='\t')

# Calculate effect size (difference in mu)
results['effect_size'] = results['mu1'] - results['mu2']
results['-log10_pval'] = -np.log10(results['pval'] + 1e-300)

# Create volcano plot
fig, ax = plt.subplots(figsize=(10, 8))

# Non-significant points
ns = results['fdr_pval'] >= 0.05
ax.scatter(results.loc[ns, 'effect_size'],
           results.loc[ns, '-log10_pval'],
           c='gray', alpha=0.5, s=10, label='Not significant')

# Significant points
sig = results['fdr_pval'] < 0.05
ax.scatter(results.loc[sig, 'effect_size'],
           results.loc[sig, '-log10_pval'],
           c='red', alpha=0.7, s=20, label='FDR < 0.05')

ax.axhline(-np.log10(0.05), color='black', linestyle='--', alpha=0.5)
ax.axvline(0, color='black', linestyle='-', alpha=0.3)

ax.set_xlabel('Effect Size (μ₁ - μ₂)', fontsize=12)
ax.set_ylabel('-log₁₀(p-value)', fontsize=12)
ax.set_title('Differential Allelic Imbalance:\nExcitatory vs Inhibitory Neurons', fontsize=14)
ax.legend()

plt.tight_layout()
plt.savefig('differential_AI_volcano.png', dpi=150)

Heatmap of Top Hits#

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load results from multiple comparisons
comparisons = [
    ('excitatory', 'inhibitory'),
    ('excitatory', 'astrocyte'),
    ('inhibitory', 'astrocyte'),
]

# Collect mu values for top regions
all_results = {}
for g1, g2 in comparisons:
    df = pd.read_csv(f'ai_results_{g1}_{g2}.tsv', sep='\t')
    all_results[(g1, g2)] = df.set_index('region')

# Find regions significant in any comparison
sig_regions = set()
for df in all_results.values():
    sig_regions.update(df[df['fdr_pval'] < 0.05].index[:20])  # Top 20 each

# Build heatmap matrix (mu values per cell type)
celltypes = ['excitatory', 'inhibitory', 'astrocyte']
heatmap_data = pd.DataFrame(index=list(sig_regions), columns=celltypes)

for region in sig_regions:
    for g1, g2 in comparisons:
        if region in all_results[(g1, g2)].index:
            row = all_results[(g1, g2)].loc[region]
            heatmap_data.loc[region, g1] = row['mu1']
            heatmap_data.loc[region, g2] = row['mu2']

# Plot heatmap
fig, ax = plt.subplots(figsize=(8, 12))
sns.heatmap(heatmap_data.astype(float), cmap='RdBu_r', center=0.5,
            vmin=0, vmax=1, ax=ax, cbar_kws={'label': 'Ref Allele Frequency (μ)'})
ax.set_title('Cell-Type-Specific Allelic Imbalance', fontsize=14)
plt.tight_layout()
plt.savefig('differential_AI_heatmap.png', dpi=150)

Command-Line Reference#

Full Parameter List#

wasp2-analyze compare-imbalance --help

Usage: wasp2-analyze compare-imbalance [OPTIONS] ADATA BARCODE_MAP

Arguments:
  ADATA        AnnData file with allele counts (.h5ad)
  BARCODE_MAP  TSV file mapping barcodes to groups

Options:
  --groups TEXT        Comma-separated groups to compare (default: all)
  --min INTEGER        Minimum allele count per region per group (default: 10)
  --pseudocount INT    Pseudocount for zero counts (default: 1)
  --sample TEXT        Sample name for genotype filtering
  --phased             Use phased genotype information
  -z, --z_cutoff INT   Remove outlier SNPs above z-score threshold
  --out_file TEXT      Output file path

Best Practices#

Data Quality#

Use WASP-filtered BAMs to remove mapping bias artifacts
Require sufficient counts (--min 15 or higher for robust estimates)
Apply z-score filtering (-z 3) to remove outliers from CNVs or mapping artifacts

Statistical Power#

Merge similar groups if individual populations have low cell counts
Use phased genotypes when available for improved power
Focus on regions with multiple SNPs for more reliable estimates

Interpretation#

Biological replication - Validate across independent samples
Effect size matters - Consider the absolute difference between μ₁ and μ₂ alongside p-values
Integrate with eQTL data - Connect to known regulatory variants
Orthogonal validation - Confirm top hits with targeted methods

Common Issues#

Low Power / Few Significant Results#

Increase sequencing depth
Merge similar cell types to increase counts per group
Lower --min threshold (with caution)
Use phased genotypes if available

Too Many Significant Results#

Check for batch effects between groups
Verify WASP filtering was applied
Use stricter FDR threshold (e.g., 0.01)
Check that groups have similar sequencing depth

Memory Issues#

Process chromosomes separately:

for chr in chr{1..22}; do
  wasp2-count count-variants-sc \
    sample.bam variants.vcf.gz barcodes.tsv \
    --region peaks_${chr}.bed \
    --out_file counts_${chr}.h5ad

  wasp2-analyze compare-imbalance \
    counts_${chr}.h5ad \
    barcode_celltype_map.tsv \
    --out_file results_${chr}.tsv
done