Quickstart: WASP Mapping Filter#

Learn WASP2’s mapping bias correction in 5 minutes.

Overview #

Goal: Understand and apply the WASP mapping filter to remove reference bias from your alignment data.

Time: ~5 minutes to read, ~30 minutes to run on typical data

Prerequisites:

WASP2 installed (pip install wasp2)
Aligned BAM file (coordinate-sorted)
VCF file with heterozygous variants

The Problem: Reference Mapping Bias #

When reads are aligned to a reference genome, there’s an inherent asymmetry:

Reference:  ...ACGT[A]CGTA...  (reference allele: A)
Read (ref): ...ACGT[A]CGTA...  → Perfect match (0 mismatches)
Read (alt): ...ACGT[G]CGTA...  → 1 mismatch penalty

Result: Reads carrying the alternate allele are more likely to:

Fail to map entirely
Map with lower quality scores
Map to incorrect locations

This causes inflated reference allele counts, leading to false positive ASE signals.

The Solution: WASP Remap-and-Filter #

WASP corrects this by testing whether each read would map identically regardless of which allele it carries:

Identify: Find reads overlapping heterozygous SNPs
Swap: Create versions with alleles swapped (ref→alt, alt→ref)
Remap: Align swapped reads with the same aligner
Filter: Keep only reads that map to the same location after swapping

After filtering, the probability of mapping is equal for both alleles:

\[P(\text{map} | \text{ref allele}) = P(\text{map} | \text{alt allele})\]

Quick Workflow #

Step 1: Create Swapped Reads #

Identify reads overlapping heterozygous SNPs and generate allele-swapped versions:

wasp2-map make-reads sample.bam variants.vcf.gz \
  --samples SAMPLE1 \
  --out-dir wasp_output/

This produces (where sample is your BAM file prefix):

wasp_output/sample_to_remap.bam: Original reads needing remapping
wasp_output/sample_keep.bam: Reads not overlapping variants (kept as-is)
wasp_output/sample_swapped_alleles_r1.fq: Allele-swapped read 1
wasp_output/sample_swapped_alleles_r2.fq: Allele-swapped read 2
wasp_output/sample_wasp_data_files.json: Metadata for filter step

Step 2: Remap Swapped Reads #

Critical: Use the same aligner and parameters as your original mapping!

# Example with BWA (replace 'sample' with your BAM file prefix)
bwa mem -M -t 8 genome.fa \
  wasp_output/sample_swapped_alleles_r1.fq \
  wasp_output/sample_swapped_alleles_r2.fq | \
  samtools sort -o wasp_output/sample_remapped.bam -

samtools index wasp_output/sample_remapped.bam

Using different alignment parameters will invalidate the WASP correction.

Step 3: Filter Remapped Reads #

The WASP filter compares original and remapped positions:

wasp2-map filter-remapped \
  wasp_output/sample_to_remap.bam \
  wasp_output/sample_remapped.bam \
  wasp_output/sample_wasp_filtered.bam

Understanding Filter Statistics #

The WASP filter reports three key metrics:

Interpreting Filter Rates #

95-99% kept: Good - typical for most data types
90-95% kept: Acceptable - may indicate difficult regions
<90% kept: Investigate - check data quality or variant calls

Before/After Example #

At a site with mapping bias:

Example: Before and After WASP#
Metric	Before WASP	After WASP
Reference reads	150	95
Alternate reads	80	85
Ref fraction	0.65	0.53

The biased site (0.65 ref fraction) is corrected to near-balanced (0.53).

Complete Workflow Script #

#!/bin/bash
set -e

# Input files
BAM="sample.bam"
VCF="variants.vcf.gz"
SAMPLE="SAMPLE1"
GENOME="genome.fa"
OUTDIR="wasp_output"

# Extract BAM prefix (filename without .bam extension)
PREFIX=$(basename $BAM .bam)

mkdir -p $OUTDIR

# Step 1: Create allele-swapped reads
echo "Step 1: Creating swapped reads..."
wasp2-map make-reads $BAM $VCF \
  --samples $SAMPLE \
  --out-dir $OUTDIR/

# Step 2: Remap with same aligner
echo "Step 2: Remapping swapped reads..."
bwa mem -M -t 8 $GENOME \
  $OUTDIR/${PREFIX}_swapped_alleles_r1.fq \
  $OUTDIR/${PREFIX}_swapped_alleles_r2.fq | \
  samtools sort -o $OUTDIR/${PREFIX}_remapped.bam -
samtools index $OUTDIR/${PREFIX}_remapped.bam

# Step 3: Filter biased reads
echo "Step 3: Filtering biased reads..."
wasp2-map filter-remapped \
  $OUTDIR/${PREFIX}_to_remap.bam \
  $OUTDIR/${PREFIX}_remapped.bam \
  $OUTDIR/${PREFIX}_wasp_filtered.bam

# Step 4: Merge with non-overlapping reads
echo "Step 4: Merging final BAM..."
samtools merge -f $OUTDIR/${PREFIX}_final.bam \
  $OUTDIR/${PREFIX}_wasp_filtered.bam \
  $OUTDIR/${PREFIX}_keep.bam
samtools index $OUTDIR/${PREFIX}_final.bam

echo "Done! WASP-filtered BAM: $OUTDIR/${PREFIX}_final.bam"

Rust Acceleration #

WASP2 includes a high-performance Rust backend that accelerates the filter step:

Performance Comparison#
Dataset Size	Python	Rust
1M reads	~5 minutes	~30 seconds
10M reads	~50 minutes	~5 minutes
100M reads	~8 hours	~50 minutes

The Rust backend is used automatically when available.

Next Steps #

After WASP filtering:

Count alleles on the filtered BAM:

wasp2-count count-variants wasp_filtered.bam variants.vcf

Analyze allelic imbalance:

wasp2-analyze find-imbalance counts.tsv

Summary #

Key Takeaways#
Concept	Key Point
Problem	Reference bias inflates ref allele counts
Solution	WASP remap-and-filter removes biased reads
Workflow	make-reads → remap → filter-remapped
Expected	90-99% reads pass filter
Result	Unbiased allele counts for ASE analysis