Comparative Imbalance Analysis Tutorial ======================================= This tutorial provides a comprehensive guide to detecting **differential allelic imbalance** between cell types, conditions, or biological groups using WASP2's comparative analysis module. Overview -------- **What is Comparative Imbalance Analysis?** While standard allelic imbalance (AI) analysis detects whether a genomic region shows preferential expression of one allele, comparative imbalance analysis asks a different question: **Does the degree of imbalance differ between groups?** This is powerful for identifying: * **Cell-type-specific regulatory variation** - Regions where different cell types show distinct allelic preferences * **Condition-dependent effects** - Treatment-induced changes in allelic regulation * **Sex differences** - Chromatin regions with sex-biased allelic accessibility * **Developmental dynamics** - Stage-specific changes in allelic regulation **Statistical Approach** WASP2 uses a **likelihood ratio test (LRT)** to compare two hypotheses: .. code-block:: text Null Hypothesis (H0): Both groups share the same allelic imbalance (μ_combined) Alternative Hypothesis (H1): Groups have different imbalance (μ₁ ≠ μ₂) Test Statistic: LRT = -2 × (log L_null - log L_alt) P-value: P(χ²(df=1) > LRT) The test accounts for overdispersion using beta-binomial modeling and applies Benjamini-Hochberg FDR correction for multiple testing. Prerequisites ------------- **Software:** * WASP2 (``pip install wasp2``) * Python with scanpy/pandas for visualization (optional) * R with Seurat for cell type annotation (optional) **Data Requirements:** * AnnData count matrix (``.h5ad``) with allele counts per cell per SNP * Barcode-to-group mapping file (TSV) * Groups can be: cell types, conditions, sex, treatment status, etc. Input Data Format ----------------- Count Matrix (.h5ad) ~~~~~~~~~~~~~~~~~~~~ Your AnnData object should have this structure: .. code-block:: text AnnData object (n_snps × n_cells) ├── .obs # SNP metadata (rows) │ ├── index # SNP identifiers │ └── [sample_name] # Genotypes: '0|1', '1|0', '0/1', '1/0' │ ├── .var # Cell metadata (columns) │ └── group # Cell type/group assignment │ ├── .layers │ ├── "ref" # Reference allele counts (sparse matrix) │ └── "alt" # Alternate allele counts (sparse matrix) │ └── .uns ├── feature # DataFrame: SNP → region mapping └── samples # List of sample names **Create counts from BAM + VCF:** .. code-block:: bash wasp2-count count-variants-sc \ aligned.bam \ phased_variants.vcf.gz \ barcodes.txt \ --samples SAMPLE_ID \ --feature peaks.bed \ --out_file allele_counts.h5ad Barcode Map (TSV) ~~~~~~~~~~~~~~~~~ A two-column tab-separated file (no header) mapping cell barcodes to groups: .. code-block:: text AAACGAACAGTCAGTT-1 excitatory_neurons AAACGAAGTCGCTCTA-1 inhibitory_neurons AAACGAAGTGAACCTA-1 excitatory_neurons AAAGGATCATCGATGT-1 astrocytes AAAGGATGTGCAACGA-1 microglia **Important:** Barcodes must exactly match those in the count matrix (including any ``-1`` suffix). Tutorial 1: Cell Type Comparison -------------------------------- This tutorial demonstrates comparing allelic imbalance between neuronal subtypes. Step 1: Prepare Input Files ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Export cell type annotations from Seurat:** .. code-block:: r library(Seurat) # Load your analyzed Seurat object seurat_obj <- readRDS("brain_snATAC.rds") # Create barcode-to-celltype mapping barcode_df <- data.frame( barcode = colnames(seurat_obj), celltype = Idents(seurat_obj) ) # Write without header write.table( barcode_df, "barcode_celltype_map.tsv", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE ) **Verify the barcode file:** .. code-block:: bash # Check format head barcode_celltype_map.tsv # AAACGAACAGTCAGTT-1 excitatory_neurons # AAACGAAGTCGCTCTA-1 inhibitory_neurons # Count cells per type cut -f2 barcode_celltype_map.tsv | sort | uniq -c | sort -rn # 2500 excitatory_neurons # 1800 inhibitory_neurons # 1200 astrocytes # 800 oligodendrocytes Step 2: Run Per-Group Imbalance Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, analyze imbalance within each cell type: .. code-block:: bash wasp2-analyze find-imbalance-sc \ allele_counts.h5ad \ barcode_celltype_map.tsv \ --sample SAMPLE_ID \ --phased \ --min 10 \ -z 3 This produces per-celltype result files (e.g., ``ai_results_excitatory_neurons.tsv``). Step 3: Run Comparative Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compare imbalance between specific cell types: .. code-block:: bash # Compare excitatory vs inhibitory neurons wasp2-analyze compare-imbalance \ allele_counts.h5ad \ barcode_celltype_map.tsv \ --sample SAMPLE_ID \ --groups "excitatory_neurons,inhibitory_neurons" \ --phased \ --min 15 **Compare all pairwise combinations:** .. code-block:: bash # Omit --groups to compare all cell types wasp2-analyze compare-imbalance \ allele_counts.h5ad \ barcode_celltype_map.tsv \ --sample SAMPLE_ID \ --phased \ --min 15 This produces output files for each pairwise comparison: * ``ai_results_excitatory_neurons_inhibitory_neurons.tsv`` * ``ai_results_excitatory_neurons_astrocytes.tsv`` * ``ai_results_inhibitory_neurons_astrocytes.tsv`` * ... Step 4: Interpret Results ~~~~~~~~~~~~~~~~~~~~~~~~~ **Output columns explained:** .. list-table:: :header-rows: 1 :widths: 15 85 * - Column - Description * - ``region`` - Genomic region (peak or gene) identifier * - ``num_snps`` - Number of shared heterozygous SNPs used for comparison * - ``combined_mu`` - Reference allele frequency under null hypothesis (shared between groups) * - ``mu1`` - Reference allele frequency in group 1 (e.g., excitatory neurons) * - ``mu2`` - Reference allele frequency in group 2 (e.g., inhibitory neurons) * - ``null_ll`` - Log-likelihood under null hypothesis (shared μ) * - ``alt_ll`` - Log-likelihood under alternative hypothesis (separate μ values) * - ``pval`` - Likelihood ratio test p-value * - ``fdr_pval`` - FDR-corrected p-value (Benjamini-Hochberg) **Filtering significant results:** .. code-block:: bash # Significant differential imbalance (FDR < 0.05) awk -F'\t' 'NR==1 || $9 < 0.05' ai_results_excitatory_neurons_inhibitory_neurons.tsv \ > significant_differential_AI.tsv # Large effect size (>15% difference in allele frequency) awk -F'\t' 'NR==1 || ($4 - $5 > 0.15 || $5 - $4 > 0.15)' significant_differential_AI.tsv \ > large_effect_differential_AI.tsv **Interpret μ values:** * ``mu < 0.5``: Alternate allele favored * ``mu > 0.5``: Reference allele favored * ``|mu1 - mu2| > 0.1``: Meaningful difference (~20% shift in allele preference) Tutorial 2: Sex Differences Analysis ------------------------------------ Identify regions with sex-biased allelic imbalance in chromatin accessibility. Step 1: Create Sex-Labeled Barcode Map ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pandas as pd import scanpy as sc # Load your annotated data adata = sc.read_h5ad("processed_snATAC.h5ad") # Create barcode-to-sex mapping barcode_df = pd.DataFrame({ 'barcode': adata.obs_names, 'sex': adata.obs['donor_sex'] # 'male' or 'female' }) # Write without header barcode_df.to_csv('barcode_sex_map.tsv', sep='\t', header=False, index=False) Step 2: Run Comparative Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash wasp2-analyze compare-imbalance \ allele_counts.h5ad \ barcode_sex_map.tsv \ --sample SAMPLE_ID \ --groups "male,female" \ --phased \ --min 20 \ --out_file ai_results_sex_comparison.tsv Step 3: Identify Sex-Biased Regions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Extract significant sex differences awk -F'\t' 'NR==1 || $9 < 0.01' ai_results_sex_comparison.tsv > sex_biased_regions.tsv # Count by chromosome (expect enrichment on X) cut -f1 sex_biased_regions.tsv | grep -E "^chr" | cut -d: -f1 | sort | uniq -c Tutorial 3: Treatment vs Control -------------------------------- Compare allelic imbalance before and after drug treatment. Step 1: Prepare Condition Labels ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pandas as pd # Load metadata with treatment status metadata = pd.read_csv("sample_metadata.csv") # Create barcode-to-condition mapping barcode_df = pd.DataFrame({ 'barcode': metadata['cell_barcode'], 'condition': metadata['treatment_status'] # 'treated' or 'control' }) barcode_df.to_csv('barcode_treatment_map.tsv', sep='\t', header=False, index=False) Step 2: Run Analysis ~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash wasp2-analyze compare-imbalance \ allele_counts.h5ad \ barcode_treatment_map.tsv \ --sample SAMPLE_ID \ --groups "treated,control" \ --min 15 \ --out_file ai_results_treatment.tsv Step 3: Identify Treatment-Responsive Regions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pandas as pd # Load results results = pd.read_csv('ai_results_treated_control.tsv', sep='\t') # Significant treatment effects significant = results[results['fdr_pval'] < 0.05] print(f"Found {len(significant)} treatment-responsive regions") # Direction of change treatment_gain = significant[significant['mu1'] > significant['mu2'] + 0.1] treatment_loss = significant[significant['mu2'] > significant['mu1'] + 0.1] print(f"Regions with increased ref allele in treatment: {len(treatment_gain)}") print(f"Regions with decreased ref allele in treatment: {len(treatment_loss)}") Visualization Examples ---------------------- Volcano Plot ~~~~~~~~~~~~ .. code-block:: python import pandas as pd import matplotlib.pyplot as plt import numpy as np # Load results results = pd.read_csv('ai_results_excitatory_neurons_inhibitory_neurons.tsv', sep='\t') # Calculate effect size (difference in mu) results['effect_size'] = results['mu1'] - results['mu2'] results['-log10_pval'] = -np.log10(results['pval'] + 1e-300) # Create volcano plot fig, ax = plt.subplots(figsize=(10, 8)) # Non-significant points ns = results['fdr_pval'] >= 0.05 ax.scatter(results.loc[ns, 'effect_size'], results.loc[ns, '-log10_pval'], c='gray', alpha=0.5, s=10, label='Not significant') # Significant points sig = results['fdr_pval'] < 0.05 ax.scatter(results.loc[sig, 'effect_size'], results.loc[sig, '-log10_pval'], c='red', alpha=0.7, s=20, label='FDR < 0.05') ax.axhline(-np.log10(0.05), color='black', linestyle='--', alpha=0.5) ax.axvline(0, color='black', linestyle='-', alpha=0.3) ax.set_xlabel('Effect Size (μ₁ - μ₂)', fontsize=12) ax.set_ylabel('-log₁₀(p-value)', fontsize=12) ax.set_title('Differential Allelic Imbalance:\nExcitatory vs Inhibitory Neurons', fontsize=14) ax.legend() plt.tight_layout() plt.savefig('differential_AI_volcano.png', dpi=150) Heatmap of Top Hits ~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Load results from multiple comparisons comparisons = [ ('excitatory', 'inhibitory'), ('excitatory', 'astrocyte'), ('inhibitory', 'astrocyte'), ] # Collect mu values for top regions all_results = {} for g1, g2 in comparisons: df = pd.read_csv(f'ai_results_{g1}_{g2}.tsv', sep='\t') all_results[(g1, g2)] = df.set_index('region') # Find regions significant in any comparison sig_regions = set() for df in all_results.values(): sig_regions.update(df[df['fdr_pval'] < 0.05].index[:20]) # Top 20 each # Build heatmap matrix (mu values per cell type) celltypes = ['excitatory', 'inhibitory', 'astrocyte'] heatmap_data = pd.DataFrame(index=list(sig_regions), columns=celltypes) for region in sig_regions: for g1, g2 in comparisons: if region in all_results[(g1, g2)].index: row = all_results[(g1, g2)].loc[region] heatmap_data.loc[region, g1] = row['mu1'] heatmap_data.loc[region, g2] = row['mu2'] # Plot heatmap fig, ax = plt.subplots(figsize=(8, 12)) sns.heatmap(heatmap_data.astype(float), cmap='RdBu_r', center=0.5, vmin=0, vmax=1, ax=ax, cbar_kws={'label': 'Ref Allele Frequency (μ)'}) ax.set_title('Cell-Type-Specific Allelic Imbalance', fontsize=14) plt.tight_layout() plt.savefig('differential_AI_heatmap.png', dpi=150) Command-Line Reference ---------------------- Full Parameter List ~~~~~~~~~~~~~~~~~~~ .. code-block:: bash wasp2-analyze compare-imbalance --help Usage: wasp2-analyze compare-imbalance [OPTIONS] ADATA BARCODE_MAP Arguments: ADATA AnnData file with allele counts (.h5ad) BARCODE_MAP TSV file mapping barcodes to groups Options: --groups TEXT Comma-separated groups to compare (default: all) --min INTEGER Minimum allele count per region per group (default: 10) --pseudocount INT Pseudocount for zero counts (default: 1) --sample TEXT Sample name for genotype filtering --phased Use phased genotype information -z, --z_cutoff INT Remove outlier SNPs above z-score threshold --out_file TEXT Output file path Best Practices -------------- Data Quality ~~~~~~~~~~~~ * **Use WASP-filtered BAMs** to remove mapping bias artifacts * **Require sufficient counts** (``--min 15`` or higher for robust estimates) * **Apply z-score filtering** (``-z 3``) to remove outliers from CNVs or mapping artifacts Statistical Power ~~~~~~~~~~~~~~~~~ * **Merge similar groups** if individual populations have low cell counts * **Use phased genotypes** when available for improved power * **Focus on regions with multiple SNPs** for more reliable estimates Interpretation ~~~~~~~~~~~~~~ * **Biological replication** - Validate across independent samples * **Effect size matters** - Consider the absolute difference between μ₁ and μ₂ alongside p-values * **Integrate with eQTL data** - Connect to known regulatory variants * **Orthogonal validation** - Confirm top hits with targeted methods Common Issues ------------- Low Power / Few Significant Results ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Increase sequencing depth * Merge similar cell types to increase counts per group * Lower ``--min`` threshold (with caution) * Use phased genotypes if available Too Many Significant Results ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Check for batch effects between groups * Verify WASP filtering was applied * Use stricter FDR threshold (e.g., 0.01) * Check that groups have similar sequencing depth Memory Issues ~~~~~~~~~~~~~ Process chromosomes separately: .. code-block:: bash for chr in chr{1..22}; do wasp2-count count-variants-sc \ sample.bam variants.vcf.gz barcodes.tsv \ --region peaks_${chr}.bed \ --out_file counts_${chr}.h5ad wasp2-analyze compare-imbalance \ counts_${chr}.h5ad \ barcode_celltype_map.tsv \ --out_file results_${chr}.tsv done See Also -------- * :doc:`/user_guide/analysis` - Statistical methods and parameters * :doc:`/user_guide/single_cell` - Single-cell data formats * :doc:`/tutorials/scrna_seq` - Basic scRNA-seq tutorial