ATAC-seq Allelic Imbalance Analysis Tutorial#

Estimated time: ~30 minutes

This tutorial demonstrates a complete WASP2 workflow for detecting allelic imbalance in chromatin accessibility from ATAC-seq data. You will learn to:

Prepare ATAC-seq peak files for analysis
Count alleles at heterozygous SNPs within accessibility peaks
Detect significant allelic imbalance using beta-binomial statistical testing
Visualize results with volcano plots and effect size distributions
Integrate with caQTL/eQTL data for biological interpretation

Background#

Allelic Imbalance in Chromatin Accessibility

ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) measures open chromatin regions genome-wide. When a heterozygous individual shows unequal accessibility between maternal and paternal alleles at a regulatory region, this indicates allelic imbalance in chromatin accessibility.

Such imbalance often reflects:

cis-regulatory variants affecting transcription factor binding
Chromatin accessibility QTLs (caQTLs)
Haplotype-specific enhancer activity

WASP2 uses a beta-binomial model to detect significant departures from the expected 50:50 allelic ratio while accounting for overdispersion in sequencing data.

Prerequisites#

Software#

WASP2 (pip install wasp2)
Python 3.10+ with pandas, matplotlib, numpy
samtools (for BAM operations)
tabix (for VCF indexing)

Input Data#

File	Description	Format
`sample.bam`	ATAC-seq aligned reads	BAM (indexed)
`variants.vcf.gz`	Phased heterozygous variants	VCF (indexed)
`peaks.bed`	ATAC-seq peaks from MACS2/SEACR	BED or narrowPeak

Note: For best results, use WASP-filtered BAM files to correct reference mapping bias. See the mapping documentation for details.

Setup#

[ ]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

# Define paths (modify these for your data)
DATA_DIR = Path("data")
RESULTS_DIR = Path("results")
RESULTS_DIR.mkdir(exist_ok=True)

# Input files
BAM_FILE = DATA_DIR / "atac_sample.bam"
VCF_FILE = DATA_DIR / "phased_variants.vcf.gz"
PEAKS_FILE = DATA_DIR / "atac_peaks.narrowPeak"
SAMPLE_ID = "SAMPLE1"  # Sample name in VCF

print("WASP2 ATAC-seq Tutorial")
print("=" * 40)

Section 1: Data Loading and Preparation#

1.1 Inspect Peak File Format#

ATAC-seq peaks from MACS2 are typically in narrowPeak format (BED6+4). WASP2 accepts both BED and narrowPeak formats.

[ ]:

# Load and inspect peaks
peak_columns = ['chr', 'start', 'end', 'name', 'score', 'strand',
                'signalValue', 'pValue', 'qValue', 'peak']
peaks_df = pd.read_csv(PEAKS_FILE, sep='\t', header=None, names=peak_columns)

print(f"Total peaks: {len(peaks_df):,}")
print(f"\nPeak size distribution:")
peaks_df['size'] = peaks_df['end'] - peaks_df['start']
print(peaks_df['size'].describe())

print(f"\nFirst 5 peaks:")
peaks_df.head()

1.2 Verify BAM and VCF Files#

Check that your input files are properly formatted and indexed.

[ ]:

%%bash -s "$BAM_FILE" "$VCF_FILE" "$SAMPLE_ID"
BAM_FILE=$1
VCF_FILE=$2
SAMPLE_ID=$3

echo "=== BAM File Check ==="
echo "File: $BAM_FILE"
samtools view -H "$BAM_FILE" 2>/dev/null | head -5 || echo "Note: Using example paths"

echo ""
echo "=== VCF File Check ==="
echo "File: $VCF_FILE"
echo "Checking for sample: $SAMPLE_ID"
bcftools query -l "$VCF_FILE" 2>/dev/null | head -5 || echo "Note: Using example paths"

echo ""
echo "=== Index Check ==="
ls -la "${BAM_FILE}.bai" 2>/dev/null || echo "BAM index (.bai): Using example paths"
ls -la "${VCF_FILE}.tbi" 2>/dev/null || echo "VCF index (.tbi): Using example paths"

Section 2: Allele Counting at Peaks#

WASP2 counts reads supporting reference and alternate alleles at each heterozygous SNP within ATAC-seq peaks. The --region parameter restricts counting to SNPs overlapping your peaks.

2.1 Run Allele Counting#

Key Parameters:

--region: Peak file to restrict SNPs to accessible regions
--samples: Sample ID for genotype filtering (heterozygous sites only)
--out_file: Output path for count results

[ ]:

# Define output file
COUNTS_FILE = RESULTS_DIR / "atac_allele_counts.tsv"

# Build the command
count_cmd = f"""
wasp2-count count-variants \\
    {BAM_FILE} \\
    {VCF_FILE} \\
    --region {PEAKS_FILE} \\
    --samples {SAMPLE_ID} \\
    --out_file {COUNTS_FILE}
"""

print("Command to run:")
print(count_cmd)

[ ]:

%%bash -s "$BAM_FILE" "$VCF_FILE" "$PEAKS_FILE" "$SAMPLE_ID" "$COUNTS_FILE"
# Uncomment to run (requires actual data files)
# wasp2-count count-variants \
#     "$1" \
#     "$2" \
#     --region "$3" \
#     --samples "$4" \
#     --out_file "$5"

echo "Note: Uncomment the command above to run with your data"
echo "For this tutorial, we'll use simulated example output."

2.2 Inspect Count Results#

The output contains per-SNP allele counts with peak annotations.

[ ]:

# For demonstration, create example data
# (Replace with: counts_df = pd.read_csv(COUNTS_FILE, sep='\t') for real data)

np.random.seed(42)
n_snps = 5000

# Simulate realistic ATAC-seq allele counts
counts_df = pd.DataFrame({
    'chr': np.random.choice(['chr1', 'chr2', 'chr3', 'chr4', 'chr5'], n_snps),
    'pos': np.random.randint(1e6, 2e8, n_snps),
    'ref': np.random.choice(['A', 'C', 'G', 'T'], n_snps),
    'alt': np.random.choice(['A', 'C', 'G', 'T'], n_snps),
    'region_id': [f'peak_{i}' for i in np.random.randint(0, 1500, n_snps)],
})

# Generate counts with some true imbalanced regions
total_depth = np.random.negative_binomial(5, 0.3, n_snps) + 5
imbalance_prob = np.where(
    np.random.random(n_snps) < 0.1,  # 10% truly imbalanced
    np.random.choice([0.3, 0.7], n_snps),  # Imbalanced allele freq
    0.5  # Balanced
)
counts_df['ref_count'] = np.random.binomial(total_depth, imbalance_prob)
counts_df['alt_count'] = total_depth - counts_df['ref_count']
counts_df['other_count'] = 0

print(f"Total SNPs counted: {len(counts_df):,}")
print(f"Unique peaks with SNPs: {counts_df['region_id'].nunique():,}")
print(f"\nCount statistics:")
print(counts_df[['ref_count', 'alt_count']].describe())
counts_df.head(10)

2.3 Quality Control: Count Distribution#

ATAC-seq typically has lower coverage per peak than RNA-seq genes. Check the distribution to set appropriate filtering thresholds.

[ ]:

# Calculate total counts per SNP
counts_df['total'] = counts_df['ref_count'] + counts_df['alt_count']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Total count distribution
ax = axes[0]
ax.hist(counts_df['total'], bins=50, edgecolor='black', alpha=0.7)
ax.axvline(10, color='red', linestyle='--', label='min_count=10')
ax.axvline(5, color='orange', linestyle='--', label='min_count=5')
ax.set_xlabel('Total Read Count per SNP')
ax.set_ylabel('Number of SNPs')
ax.set_title('Read Depth Distribution')
ax.legend()
ax.set_xlim(0, 100)

# Allele ratio distribution
ax = axes[1]
ratio = counts_df['ref_count'] / counts_df['total']
ax.hist(ratio[counts_df['total'] >= 10], bins=50, edgecolor='black', alpha=0.7)
ax.axvline(0.5, color='red', linestyle='--', label='Expected (0.5)')
ax.set_xlabel('Reference Allele Frequency')
ax.set_ylabel('Number of SNPs')
ax.set_title('Allele Ratio Distribution (depth ≥10)')
ax.legend()

plt.tight_layout()
plt.savefig(RESULTS_DIR / 'count_qc.png', dpi=150)
plt.show()

# Summary statistics
print(f"\nSNPs with depth ≥5: {(counts_df['total'] >= 5).sum():,} ({100*(counts_df['total'] >= 5).mean():.1f}%)")
print(f"SNPs with depth ≥10: {(counts_df['total'] >= 10).sum():,} ({100*(counts_df['total'] >= 10).mean():.1f}%)")
print(f"SNPs with depth ≥20: {(counts_df['total'] >= 20).sum():,} ({100*(counts_df['total'] >= 20).mean():.1f}%)")

Section 3: Statistical Testing (Beta-Binomial)#

WASP2’s find-imbalance command uses a beta-binomial model to test for allelic imbalance:

Null hypothesis (H₀): Reference allele frequency = 0.5 (balanced)
Alternative (H₁): Reference allele frequency ≠ 0.5 (imbalanced)

The beta-binomial distribution accounts for overdispersion - the extra variability beyond binomial sampling that’s common in sequencing data.

3.1 Run Imbalance Detection#

[ ]:

# Define output file
IMBALANCE_FILE = RESULTS_DIR / "atac_imbalance_results.tsv"

# Build the command
# Note: Using --min 5 for ATAC-seq (lower coverage than RNA-seq)
# The --model single option uses a single dispersion parameter for all regions
analysis_cmd = f"""
wasp2-analyze find-imbalance \\
    {COUNTS_FILE} \\
    --min 5 \\
    --model single \\
    --output {IMBALANCE_FILE}
"""

print("Command to run:")
print(analysis_cmd)

[ ]:

%%bash -s "$COUNTS_FILE" "$IMBALANCE_FILE"
# Uncomment to run (requires actual count file)
# wasp2-analyze find-imbalance \
#     "$1" \
#     --min 5 \
#     --model single \
#     --output "$2"

echo "Note: Uncomment the command above to run with your data"

3.2 Simulate Results for Demonstration#

For this tutorial, we simulate realistic analysis results.

[ ]:

# Aggregate counts by peak (region)
peak_counts = counts_df.groupby('region_id').agg({
    'chr': 'first',
    'pos': ['min', 'max'],
    'ref_count': 'sum',
    'alt_count': 'sum',
}).reset_index()
peak_counts.columns = ['region', 'chr', 'start', 'end', 'ref_count', 'alt_count']

# Calculate statistics
peak_counts['total'] = peak_counts['ref_count'] + peak_counts['alt_count']
peak_counts['mu'] = peak_counts['ref_count'] / peak_counts['total']
# Add pseudocount (+1) to avoid log(0) and stabilize ratios for low-count regions
peak_counts['effect_size'] = np.log2(
    (peak_counts['ref_count'] + 1) / (peak_counts['alt_count'] + 1)
)

# Simulate p-values (truly imbalanced peaks get low p-values)
# Note: This simulation uses binomial for simplicity. Real ATAC-seq data exhibits
# overdispersion, which is why WASP2 uses the beta-binomial model.
np.random.seed(42)
is_imbalanced = np.abs(peak_counts['mu'] - 0.5) > 0.15
peak_counts['p_value'] = np.where(
    is_imbalanced,
    10 ** (-np.random.uniform(2, 10, len(peak_counts))),  # Significant
    np.random.uniform(0.05, 1, len(peak_counts))  # Not significant
)

# FDR correction (Benjamini-Hochberg)
# Note: This manual BH implementation is for demonstration.
# WASP2 internally uses scipy.stats.false_discovery_control()
from scipy import stats
peak_counts = peak_counts.sort_values('p_value')
n_tests = len(peak_counts)
peak_counts['rank'] = range(1, n_tests + 1)
peak_counts['fdr_pval'] = np.minimum(
    peak_counts['p_value'] * n_tests / peak_counts['rank'],
    1.0
)
peak_counts['fdr_pval'] = peak_counts['fdr_pval'][::-1].cummin()[::-1]

# Filter to testable peaks
results_df = peak_counts[peak_counts['total'] >= 5].copy()
results_df = results_df.drop('rank', axis=1)

print(f"Peaks tested: {len(results_df):,}")
print(f"Significant (FDR < 0.05): {(results_df['fdr_pval'] < 0.05).sum():,}")
print(f"Significant (FDR < 0.01): {(results_df['fdr_pval'] < 0.01).sum():,}")
results_df.head(10)

Section 4: Result Interpretation and Visualization#

4.1 Volcano Plot#

The volcano plot shows effect size (x-axis) vs. statistical significance (y-axis), helping identify peaks with both strong and significant allelic imbalance.

[ ]:

fig, ax = plt.subplots(figsize=(10, 8))

# Calculate -log10(p-value) for plotting
results_df['neg_log10_pval'] = -np.log10(results_df['p_value'].clip(lower=1e-50))

# Define significance thresholds
sig_mask = results_df['fdr_pval'] < 0.05
effect_mask = np.abs(results_df['effect_size']) > 0.5  # |log2FC| > 0.5

# Plot non-significant points
ns = ~sig_mask
ax.scatter(results_df.loc[ns, 'effect_size'],
           results_df.loc[ns, 'neg_log10_pval'],
           c='lightgray', s=15, alpha=0.6, label=f'Not significant (n={ns.sum():,})')

# Plot significant but small effect
sig_small = sig_mask & ~effect_mask
ax.scatter(results_df.loc[sig_small, 'effect_size'],
           results_df.loc[sig_small, 'neg_log10_pval'],
           c='steelblue', s=25, alpha=0.7, label=f'FDR<0.05, small effect (n={sig_small.sum():,})')

# Plot significant and large effect
sig_large = sig_mask & effect_mask
ax.scatter(results_df.loc[sig_large, 'effect_size'],
           results_df.loc[sig_large, 'neg_log10_pval'],
           c='firebrick', s=40, alpha=0.8, label=f'FDR<0.05, |log2FC|>0.5 (n={sig_large.sum():,})')

# Add threshold lines
ax.axhline(-np.log10(0.05), color='black', linestyle='--', alpha=0.3, linewidth=1)
ax.axvline(0.5, color='gray', linestyle=':', alpha=0.5)
ax.axvline(-0.5, color='gray', linestyle=':', alpha=0.5)
ax.axvline(0, color='black', linestyle='-', alpha=0.2)

ax.set_xlabel('Effect Size (log₂ Ref/Alt)', fontsize=12)
ax.set_ylabel('-log₁₀(p-value)', fontsize=12)
ax.set_title('ATAC-seq Allelic Imbalance\nVolcano Plot', fontsize=14)
ax.legend(loc='upper right', fontsize=9)

plt.tight_layout()
plt.savefig(RESULTS_DIR / 'volcano_plot.png', dpi=150)
plt.show()

4.2 Effect Size Distribution#

[ ]:

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# All peaks
ax = axes[0]
ax.hist(results_df['effect_size'], bins=50, edgecolor='black', alpha=0.7, color='steelblue')
ax.axvline(0, color='red', linestyle='--', linewidth=2)
ax.set_xlabel('Effect Size (log₂ Ref/Alt)')
ax.set_ylabel('Number of Peaks')
ax.set_title('All Tested Peaks')

# Significant peaks only
ax = axes[1]
sig_effects = results_df.loc[sig_mask, 'effect_size']
ax.hist(sig_effects, bins=30, edgecolor='black', alpha=0.7, color='firebrick')
ax.axvline(0, color='black', linestyle='--', linewidth=2)
ax.set_xlabel('Effect Size (log₂ Ref/Alt)')
ax.set_ylabel('Number of Peaks')
ax.set_title(f'Significant Peaks (FDR < 0.05, n={sig_mask.sum():,})')

plt.tight_layout()
plt.savefig(RESULTS_DIR / 'effect_size_distribution.png', dpi=150)
plt.show()

# Summary statistics
print("Effect size statistics (significant peaks):")
print(f"  Mean: {sig_effects.mean():.3f}")
print(f"  Median: {sig_effects.median():.3f}")
print(f"  Std: {sig_effects.std():.3f}")
print(f"  Ref-biased (log2FC > 0): {(sig_effects > 0).sum()}")
print(f"  Alt-biased (log2FC < 0): {(sig_effects < 0).sum()}")

4.3 Top Imbalanced Peaks#

[ ]:

# Get top hits by significance
top_hits = results_df[results_df['fdr_pval'] < 0.05].nsmallest(20, 'fdr_pval')

print("Top 20 Peaks with Allelic Imbalance")
print("=" * 80)
display_cols = ['region', 'chr', 'ref_count', 'alt_count', 'mu', 'effect_size', 'p_value', 'fdr_pval']
top_hits[display_cols].round(4)

Section 5: QTL Overlap Analysis#

Peaks with allelic imbalance often harbor chromatin accessibility QTLs (caQTLs) or overlap with expression QTLs (eQTLs). Integrating your results with published QTL databases helps validate findings and identify regulatory mechanisms.

5.1 Prepare BED File for Overlap Analysis#

[ ]:

# Export significant peaks as BED for overlap analysis
sig_peaks = results_df[results_df['fdr_pval'] < 0.05][['chr', 'start', 'end', 'region', 'effect_size', 'fdr_pval']].copy()
sig_peaks.to_csv(RESULTS_DIR / 'significant_peaks.bed', sep='\t', header=False, index=False)

print(f"Exported {len(sig_peaks)} significant peaks to: {RESULTS_DIR / 'significant_peaks.bed'}")
print("\nUse this file for overlap analysis with:")
print("  - GTEx eQTLs (https://gtexportal.org)")
print("  - ENCODE cCREs (https://www.encodeproject.org)")
print("  - Published caQTL datasets")

5.2 Example: GTEx eQTL Overlap#

This example shows how to intersect your imbalanced peaks with GTEx eQTL SNPs to identify potential regulatory relationships.

[ ]:

# Example overlap analysis with bedtools (requires eQTL BED file)
overlap_cmd = """
# Download GTEx eQTLs for your tissue of interest
# Example: Brain cortex significant eQTLs

# Intersect imbalanced peaks with eQTL positions
bedtools intersect \\
    -a results/significant_peaks.bed \\
    -b gtex_brain_cortex_eqtls.bed \\
    -wa -wb \\
    > results/peak_eqtl_overlap.bed

# Count overlaps
wc -l results/peak_eqtl_overlap.bed
"""

print("Example bedtools command for eQTL overlap:")
print(overlap_cmd)

[ ]:

# Simulate overlap results for demonstration
np.random.seed(123)

# Create simulated eQTL overlap data
n_overlap = 150
overlap_df = pd.DataFrame({
    'peak': [f'peak_{i}' for i in np.random.choice(range(1500), n_overlap, replace=False)],
    'eqtl_gene': [f'GENE{i}' for i in np.random.randint(1, 500, n_overlap)],
    'eqtl_pval': 10 ** (-np.random.uniform(3, 15, n_overlap)),
    'tissue': np.random.choice(['Brain_Cortex', 'Brain_Hippocampus', 'Liver', 'Heart'], n_overlap)
})

print(f"Peaks overlapping eQTLs: {len(overlap_df)}")
print(f"\nOverlap by tissue:")
print(overlap_df['tissue'].value_counts())

# Enrichment analysis
n_sig_peaks = sig_mask.sum()
n_total_peaks = len(results_df)
n_eqtl_overlap = len(overlap_df)

# Fisher's exact test for enrichment
from scipy.stats import fisher_exact

# Assume 10% of all peaks overlap eQTLs by chance
expected_overlap = int(n_sig_peaks * 0.10)
contingency = [
    [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap],
    [expected_overlap, n_sig_peaks - expected_overlap]
]
odds_ratio, p_value = fisher_exact(contingency)

print(f"\nEnrichment Analysis:")
print(f"  Imbalanced peaks: {n_sig_peaks}")
print(f"  Overlapping eQTLs: {n_eqtl_overlap}")
print(f"  Expected by chance: ~{expected_overlap}")
print(f"  Fold enrichment: {n_eqtl_overlap / max(expected_overlap, 1):.2f}x")

5.3 Visualization: eQTL Overlap#

[ ]:

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Pie chart: Overlap proportion
ax = axes[0]
overlap_counts = [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap]
labels = ['Overlap with eQTL', 'No eQTL overlap']
colors = ['#e74c3c', '#95a5a6']
ax.pie(overlap_counts, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
ax.set_title('Imbalanced Peaks Overlapping eQTLs')

# Bar chart: Overlap by tissue
ax = axes[1]
tissue_counts = overlap_df['tissue'].value_counts()
colors = plt.cm.Set2(range(len(tissue_counts)))
bars = ax.bar(tissue_counts.index, tissue_counts.values, color=colors, edgecolor='black')
ax.set_xlabel('Tissue')
ax.set_ylabel('Number of Overlapping eQTLs')
ax.set_title('eQTL Overlap by Tissue')
ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig(RESULTS_DIR / 'eqtl_overlap.png', dpi=150)
plt.show()

Section 6: Downstream Analysis Hints#

6.1 Motif Enrichment Analysis#

Imbalanced peaks may disrupt transcription factor binding sites. Use tools like HOMER or MEME-ChIP for motif analysis.

[ ]:

motif_cmd = """
# Extract sequences around imbalanced SNPs
bedtools slop -i significant_peaks.bed -g genome.chrom.sizes -b 50 | \\
bedtools getfasta -fi genome.fa -bed - -fo imbalanced_seqs.fa

# Run HOMER motif analysis
findMotifsGenome.pl significant_peaks.bed hg38 motif_results/ -size 200

# Alternative: MEME-ChIP
meme-chip -oc meme_results imbalanced_seqs.fa
"""

print("Example commands for motif enrichment analysis:")
print(motif_cmd)

6.2 Gene Ontology Enrichment#

Identify biological processes associated with genes near imbalanced peaks.

[ ]:

go_cmd = """
# Annotate peaks with nearest genes using GREAT or bedtools
bedtools closest -a significant_peaks.bed -b genes.bed -d > peak_gene_assignments.bed

# Extract gene list
cut -f8 peak_gene_assignments.bed | sort -u > imbalanced_genes.txt

# Use DAVID, Enrichr, or clusterProfiler for GO enrichment
# Web interface: https://david.ncifcrf.gov/
# Web interface: https://maayanlab.cloud/Enrichr/
"""

print("Example commands for GO enrichment analysis:")
print(go_cmd)

6.3 Single-Cell ATAC-seq Extension#

For single-cell ATAC-seq (scATAC-seq), use WASP2’s single-cell workflow to detect cell-type-specific allelic imbalance.

[ ]:

sc_cmd = """
# Count alleles in single-cell ATAC-seq
wasp2-count count-variants-sc \\
    scatac_possorted.bam \\
    variants.vcf.gz \\
    barcodes.tsv \\
    --samples SAMPLE_ID \\
    --feature peaks.bed \\
    --out_file scatac_counts.h5ad

# Detect imbalance per cell type
wasp2-analyze find-imbalance-sc \\
    scatac_counts.h5ad \\
    barcode_celltype_map.tsv \\
    --sample SAMPLE_ID \\
    --min 5 \\
    --phased

# Compare imbalance between cell types
wasp2-analyze compare-imbalance \\
    scatac_counts.h5ad \\
    barcode_celltype_map.tsv \\
    --groups "excitatory,inhibitory" \\
    --sample SAMPLE_ID
"""

print("Commands for single-cell ATAC-seq analysis:")
print(sc_cmd)

Summary#

In this tutorial, you learned to:

Prepare ATAC-seq data - Load peaks and verify input file formats
Count alleles - Use wasp2-count count-variants with peak regions
Detect imbalance - Apply beta-binomial testing with wasp2-analyze find-imbalance
Visualize results - Create volcano plots and effect size distributions
Integrate with QTLs - Overlap with eQTL databases for biological validation

Key Takeaways#

ATAC-seq has lower coverage per peak than RNA-seq; use --min-count 5 instead of 10
FDR correction is essential for multiple testing across thousands of peaks
Consider effect size alongside significance for biological relevance
QTL overlap helps validate findings and identify causal variants

Next Steps#

Comparative Imbalance Tutorial - Compare imbalance between conditions
Single-Cell Tutorial - Cell-type-specific analysis
Statistical Methods - Deep dive into the beta-binomial model

Troubleshooting#

Common Issues#

Low SNP counts in peaks:

Ensure VCF contains heterozygous variants for your sample
Check that peak coordinates use the same reference genome as VCF
Verify --samples matches the sample name in VCF header

Memory errors with large datasets:

Process chromosomes separately with --region chr1_peaks.bed, etc.
Use WASP2_RUST_THREADS=4 to limit parallel processing

No significant results:

Check read depth (may need deeper sequencing)
Verify WASP filtering was applied to remove mapping bias
Consider lowering --min-count threshold (with caution)

Diagnostic Commands#

# Check VCF sample names
bcftools query -l variants.vcf.gz

# Count heterozygous SNPs in your sample
bcftools view -s SAMPLE_ID variants.vcf.gz | bcftools view -g het | wc -l

# Check BAM read depth at a peak
samtools depth -r chr1:1000000-1001000 sample.bam | awk '{sum+=$3} END {print sum/NR}'

[ ]:

# Save final results
results_df.to_csv(RESULTS_DIR / 'final_imbalance_results.tsv', sep='\t', index=False)
print(f"Results saved to: {RESULTS_DIR / 'final_imbalance_results.tsv'}")
print(f"\nAnalysis complete!")