RNA-seq Allelic Imbalance Tutorial#
This tutorial demonstrates a complete workflow for detecting allele-specific expression (ASE) in bulk RNA-seq data using WASP2.
Estimated time: ~30 minutes
Overview#
The tutorial covers the complete RNA-seq allelic imbalance analysis pipeline:
Data Loading - BAM, VCF, and gene annotations (GTF)
Allele Counting - Count reads at heterozygous SNPs within genes/exons
Statistical Testing - Beta-binomial model for allelic imbalance detection
ASE Visualization - Volcano plots and allele ratio distributions
Imprinting Detection - Identify monoallelic expression patterns
eQTL Integration - Connect ASE signals to regulatory variants
Prerequisites#
Software:
WASP2 (
pip install wasp2)Python packages: pandas, numpy, matplotlib, seaborn, scipy
Data:
Aligned BAM file (coordinate-sorted, indexed)
Phased VCF file with heterozygous variants
Gene annotation file (GTF format, e.g., GENCODE)
Workflow Summary#
Step 1: Count Alleles at Genes#
Use wasp2-count count-variants to count allele-specific reads at heterozygous SNPs:
wasp2-count count-variants \
sample.bam \
variants.vcf.gz \
--samples SAMPLE_ID \
--region genes.gtf \
--out_file allele_counts.tsv
This produces a TSV file with columns:
chr,pos: SNP locationref,alt: Allelesref_count,alt_count: Read counts per alleleother_count: Reads supporting other alleles (non-ref, non-alt)gene_id,gene_name: Overlapping gene annotationfeature: Feature type (exon, intron, etc.) when using GTF
Step 2: Test for Allelic Imbalance#
Use wasp2-analyze find-imbalance to detect significant ASE:
wasp2-analyze find-imbalance \
allele_counts.tsv \
--min 10 \
--pseudocount 1 \
--phased \
--out_file ai_results.tsv
The beta-binomial model tests for deviation from 50:50 allele ratios, accounting for biological overdispersion.
Output columns:
region: Gene identifierref_count,alt_count: Aggregated countsp_value: Likelihood ratio test p-valuefdr_pval: FDR-corrected p-valueeffect_size: Log2 fold change (ref/alt)
Step 3: Identify Significant ASE#
Filter results by significance and effect size:
# Significant ASE (FDR < 0.05, |log2FC| > 1)
# Column indices: $5 = fdr_pval, $6 = effect_size
awk -F'\t' 'NR==1 || ($5 < 0.05 && ($6 > 1 || $6 < -1))' ai_results.tsv > significant_ase.tsv
Step 4: Detect Imprinting Patterns#
Monoallelic expression (>90:10 allele ratio) may indicate genomic imprinting:
import pandas as pd
results = pd.read_csv('ai_results.tsv', sep='\t')
total = results['ref_count'] + results['alt_count']
results['ref_ratio'] = results['ref_count'] / total
# Monoallelic genes
monoallelic = results[
(results['ref_ratio'] > 0.9) | (results['ref_ratio'] < 0.1)
]
Step 5: Integrate with eQTL Data#
Cross-reference ASE results with eQTL databases (e.g., GTEx):
import pandas as pd
ase = pd.read_csv('ai_results.tsv', sep='\t')
eqtl = pd.read_csv('gtex_eqtl.tsv', sep='\t')
# Merge on gene ID
integrated = ase.merge(eqtl, left_on='region', right_on='gene_id')
# Check direction concordance between ASE and eQTL
# ASE effect_size > 0 means REFERENCE allele is more expressed
# eQTL slope > 0 means ALTERNATE allele INCREASES expression
# Therefore, concordance = OPPOSITE signs (ref high in ASE = alt low in eQTL)
integrated['concordant'] = (
(integrated['effect_size'] > 0) != (integrated['slope'] > 0)
)
Key Concepts#
Beta-Binomial Model#
WASP2 uses a beta-binomial distribution to model allele counts:
Accounts for overdispersion (biological variation beyond binomial sampling)
Models technical noise from PCR amplification and sequencing
Aggregates information across multiple SNPs per gene
The null hypothesis is equal expression from both alleles (p = 0.5).
Effect Size Interpretation#
log2FC > 1: Reference allele 2x more expressed (strong ASE)
log2FC > 2: Reference allele 4x more expressed (very strong ASE)
log2FC near 0: Balanced expression (no ASE)
Significance Thresholds#
FDR < 0.05: Standard significance threshold
FDR < 0.01: Stringent threshold for high-confidence hits
Combine with effect size filters to focus on biologically meaningful results
Troubleshooting#
Low SNP Counts#
If few heterozygous SNPs are detected:
Verify VCF contains heterozygous genotypes:
Phased format: GT =
0|1or1|0(pipe separator)Unphased format: GT =
0/1or1/0(slash separator)Use
--phasedflag only with phased genotypes
Check sample ID matches VCF sample column
Ensure BAM and VCF use the same reference genome
No Significant Results#
Increase sequencing depth (more reads = more power)
Lower
--minthreshold (but interpret with caution)Check for batch effects or technical artifacts
Too Many Significant Results#
Verify WASP mapping bias correction was applied
Check for copy number variation (CNV) artifacts
Use stricter FDR threshold (e.g., 0.01)
See Also#
Counting Module - Detailed counting options
Analysis Module - Statistical methods and parameters
10X scRNA-seq Tutorial - Single-cell RNA-seq tutorial
Comparative Imbalance Analysis Tutorial - Comparing ASE between groups