WASP Mapping Bias Correction#

This document describes the WASP algorithm for correcting reference mapping bias in allele-specific analysis.

The Problem: Reference Mapping Bias #

Standard read aligners map sequencing reads against a reference genome. When a read originates from the alternate allele at a heterozygous site, it carries a mismatch relative to the reference:

Reference:  ...ACGT[A]CGTA...
Read (ref): ...ACGT[A]CGTA...  → Maps perfectly (0 mismatches)
Read (alt): ...ACGT[G]CGTA...  → Maps with 1 mismatch

This asymmetry causes reference mapping bias: reads carrying the reference allele are more likely to map successfully and with higher quality, leading to inflated reference allele counts.

The effect is particularly pronounced when:

The variant is near other polymorphisms (haplotype effects)
The read has low overall quality
The aligner uses strict mismatch penalties
The region has repetitive sequence

Uncorrected mapping bias inflates reference allele counts, causing false positive ASE signals and biased effect sizes in QTL mapping.

The WASP Algorithm #

WASP (WASP Allele-Specific Pipeline) [vandeGeijn2015] corrects mapping bias through a remap-and-filter strategy:

Algorithm Overview #

Identify overlapping reads: Find reads that overlap heterozygous SNPs
Swap alleles: For each overlapping read, create a version with the alternate allele swapped to the reference (and vice versa)
Remap: Align the swapped reads to the reference genome
Filter: Keep only reads that map to the same location after swapping

Mathematical Justification #

Let \(M(r)\) be the mapping location of read \(r\), and let \(r'\) be the allele-swapped version of \(r\).

A read passes the WASP filter if and only if:

\[M(r) = M(r')\]

This criterion ensures that the read would have mapped identically regardless of which allele it carried, eliminating differential mappability.

Theorem: After WASP filtering, the probability of mapping is equal for reads from either allele:

\[P(\text{map} | \text{ref allele}) = P(\text{map} | \text{alt allele})\]

Implementation Details #

Step 1: Create Reads for Remapping #

WASP2 identifies reads overlapping variants using interval trees (coitrees) for efficient coordinate queries:

wasp2-map make-reads sample.bam variants.vcf --samples SAMPLE1

For each read overlapping a heterozygous site:

Extract the original read sequence
Identify all variant positions within the read
Generate haplotype combinations with swapped alleles
Write swapped reads to FASTQ for remapping

Haplotype Generation

When a read overlaps multiple heterozygous sites, all combinations must be tested. For \(n\) het sites, there are \(2^n\) haplotypes:

Read overlaps 2 het sites: A/G and C/T

Original:    ...A...C...
Haplotype 1: ...G...C...  (swap first)
Haplotype 2: ...A...T...  (swap second)
Haplotype 3: ...G...T...  (swap both)

WASP2 caps the number of haplotypes per read (default: 64) to prevent combinatorial explosion with highly polymorphic regions.

Step 2: Remap with Original Aligner #

The swapped reads must be remapped using the same aligner and parameters as the original mapping:

bwa mem -M genome.fa swapped_r1.fq swapped_r2.fq | \
  samtools view -bS - > remapped.bam
samtools sort -o remapped.sorted.bam remapped.bam
samtools index remapped.sorted.bam

Critical: Using different alignment parameters will invalidate the WASP correction.

Step 3: Filter Remapped Reads #

The WASP filter compares original and remapped positions:

wasp2-map filter-remapped original_to_remap.bam remapped.bam output.bam

A read passes if:

The remapped read exists (didn’t fail to map)
The mapping position matches the original within tolerance
For paired-end reads, both mates satisfy the above

Same-Locus Test

For SNPs, exact position matching is required:

\[|M_{\text{original}} - M_{\text{remapped}}| = 0\]

For indels, a small tolerance (slop) accommodates alignment ambiguity:

\[|M_{\text{original}} - M_{\text{remapped}}| \leq \text{slop}\]

Paired-End Considerations #

For paired-end reads, WASP2 requires both mates to pass:

Both mates must remap successfully
Both mates must map to the same location
Insert size must remain consistent

This is more stringent than single-end filtering but ensures the read pair as a unit is unbiased.

Rust Acceleration #

WASP2 implements the filtering step in Rust for performance:

from wasp2_rust import filter_bam_wasp

kept, filtered, total = filter_bam_wasp(
    to_remap_bam="original.bam",
    remapped_bam="remapped.bam",
    remap_keep_bam="output.bam",
    threads=4,
    same_locus_slop=0,  # Exact matching for SNPs
)

The Rust implementation provides:

Parallel BAM I/O: Multi-threaded reading and writing
Streaming comparison: Memory-efficient position matching
htslib integration: Native BAM format support

Performance Characteristics #

WASP Filter Performance#
Reads to Filter	Python Implementation	Rust Implementation
1 million	~5 minutes	~30 seconds
10 million	~50 minutes	~5 minutes
100 million	~8 hours	~50 minutes

Benchmarks with coordinate-sorted BAM, single thread.

Expected Filter Rates #

Typical filter rates depend on data type and variant density:

Typical Filter Rates by Data Type#
Data Type	Filter Rate	Notes
RNA-seq	5-15%	Higher near splice junctions
ATAC-seq	2-8%	Lower due to shorter reads
ChIP-seq	3-10%	Depends on peak locations
WGS	1-5%	Lowest filter rate

Limitations and Considerations #

Indel Handling #

The original WASP algorithm was designed for SNPs. Indels present challenges:

Alignment ambiguity at indel boundaries
Multiple valid alignments for the same read
Gap penalties interact with variant detection

WASP2 supports indel mode with configurable parameters:

wasp2-map make-reads sample.bam variants.vcf \
  --include-indels --max-indel-len 10

Structural Variants #

WASP is not designed for structural variants (large deletions, inversions, translocations). These require specialized methods.

Reference Panel Quality #

The effectiveness of WASP depends on having accurate variant calls:

Missing variants leave mapping bias uncorrected
False positive variants cause unnecessary read filtering
Imputation errors can introduce systematic biases

Use high-quality variant calls from the same sample or a well-matched reference panel.

Pipeline Integration #

The typical WASP2 workflow:

# Step 1: Initial mapping
bwa mem -M genome.fa reads_r1.fq reads_r2.fq | \
  samtools sort -o sample.bam -

# Step 2: Create swapped reads
wasp2-map make-reads sample.bam variants.vcf \
  --samples SAMPLE1 --out-dir wasp_temp/

# Step 3: Remap swapped reads (SAME ALIGNER!)
bwa mem -M genome.fa wasp_temp/swapped_r1.fq wasp_temp/swapped_r2.fq | \
  samtools sort -o remapped.bam -

# Step 4: Filter biased reads
wasp2-map filter-remapped \
  wasp_temp/to_remap.bam remapped.bam wasp_filtered.bam

# Step 5: Count alleles on filtered BAM
wasp2-count count-variants wasp_filtered.bam variants.vcf \
  --samples SAMPLE1 --regions peaks.bed