Quickstart: WASP Mapping Filter ================================ Learn WASP2's mapping bias correction in 5 minutes. .. contents:: Contents :local: :depth: 2 Overview -------- **Goal:** Understand and apply the WASP mapping filter to remove reference bias from your alignment data. **Time:** ~5 minutes to read, ~30 minutes to run on typical data **Prerequisites:** * WASP2 installed (``pip install wasp2``) * Aligned BAM file (coordinate-sorted) * VCF file with heterozygous variants The Problem: Reference Mapping Bias ----------------------------------- When reads are aligned to a reference genome, there's an inherent asymmetry: .. code-block:: text Reference: ...ACGT[A]CGTA... (reference allele: A) Read (ref): ...ACGT[A]CGTA... → Perfect match (0 mismatches) Read (alt): ...ACGT[G]CGTA... → 1 mismatch penalty **Result**: Reads carrying the alternate allele are more likely to: - Fail to map entirely - Map with lower quality scores - Map to incorrect locations This causes **inflated reference allele counts**, leading to false positive ASE signals. The Solution: WASP Remap-and-Filter ----------------------------------- WASP corrects this by testing whether each read would map identically regardless of which allele it carries: 1. **Identify**: Find reads overlapping heterozygous SNPs 2. **Swap**: Create versions with alleles swapped (ref→alt, alt→ref) 3. **Remap**: Align swapped reads with the same aligner 4. **Filter**: Keep only reads that map to the **same location** after swapping After filtering, the probability of mapping is equal for both alleles: .. math:: P(\text{map} | \text{ref allele}) = P(\text{map} | \text{alt allele}) Quick Workflow -------------- Step 1: Create Swapped Reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Identify reads overlapping heterozygous SNPs and generate allele-swapped versions: .. code-block:: bash wasp2-map make-reads sample.bam variants.vcf.gz \ --samples SAMPLE1 \ --out-dir wasp_output/ This produces (where ``sample`` is your BAM file prefix): * ``wasp_output/sample_to_remap.bam``: Original reads needing remapping * ``wasp_output/sample_keep.bam``: Reads not overlapping variants (kept as-is) * ``wasp_output/sample_swapped_alleles_r1.fq``: Allele-swapped read 1 * ``wasp_output/sample_swapped_alleles_r2.fq``: Allele-swapped read 2 * ``wasp_output/sample_wasp_data_files.json``: Metadata for filter step Step 2: Remap Swapped Reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Critical**: Use the **same aligner and parameters** as your original mapping! .. code-block:: bash # Example with BWA (replace 'sample' with your BAM file prefix) bwa mem -M -t 8 genome.fa \ wasp_output/sample_swapped_alleles_r1.fq \ wasp_output/sample_swapped_alleles_r2.fq | \ samtools sort -o wasp_output/sample_remapped.bam - samtools index wasp_output/sample_remapped.bam Using different alignment parameters will invalidate the WASP correction. Step 3: Filter Remapped Reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The WASP filter compares original and remapped positions: .. code-block:: bash wasp2-map filter-remapped \ wasp_output/sample_to_remap.bam \ wasp_output/sample_remapped.bam \ wasp_output/sample_wasp_filtered.bam Understanding Filter Statistics ------------------------------- The WASP filter reports three key metrics: .. table:: WASP Filter Metrics :widths: 20 50 30 ================= ============================================ ============== Metric Description Typical Value ================= ============================================ ============== **Kept reads** Reads that passed the filter 90-99% **Removed (moved)** Reads that mapped to different locations 1-8% **Removed (missing)** Reads that failed to remap <1% ================= ============================================ ============== Interpreting Filter Rates ~~~~~~~~~~~~~~~~~~~~~~~~~ * **95-99% kept**: Good - typical for most data types * **90-95% kept**: Acceptable - may indicate difficult regions * **<90% kept**: Investigate - check data quality or variant calls Before/After Example -------------------- At a site with mapping bias: .. table:: Example: Before and After WASP :widths: 30 35 35 =============== =============== =============== Metric Before WASP After WASP =============== =============== =============== Reference reads 150 95 Alternate reads 80 85 Ref fraction 0.65 0.53 =============== =============== =============== The biased site (0.65 ref fraction) is corrected to near-balanced (0.53). Complete Workflow Script ------------------------ .. code-block:: bash #!/bin/bash set -e # Input files BAM="sample.bam" VCF="variants.vcf.gz" SAMPLE="SAMPLE1" GENOME="genome.fa" OUTDIR="wasp_output" # Extract BAM prefix (filename without .bam extension) PREFIX=$(basename $BAM .bam) mkdir -p $OUTDIR # Step 1: Create allele-swapped reads echo "Step 1: Creating swapped reads..." wasp2-map make-reads $BAM $VCF \ --samples $SAMPLE \ --out-dir $OUTDIR/ # Step 2: Remap with same aligner echo "Step 2: Remapping swapped reads..." bwa mem -M -t 8 $GENOME \ $OUTDIR/${PREFIX}_swapped_alleles_r1.fq \ $OUTDIR/${PREFIX}_swapped_alleles_r2.fq | \ samtools sort -o $OUTDIR/${PREFIX}_remapped.bam - samtools index $OUTDIR/${PREFIX}_remapped.bam # Step 3: Filter biased reads echo "Step 3: Filtering biased reads..." wasp2-map filter-remapped \ $OUTDIR/${PREFIX}_to_remap.bam \ $OUTDIR/${PREFIX}_remapped.bam \ $OUTDIR/${PREFIX}_wasp_filtered.bam # Step 4: Merge with non-overlapping reads echo "Step 4: Merging final BAM..." samtools merge -f $OUTDIR/${PREFIX}_final.bam \ $OUTDIR/${PREFIX}_wasp_filtered.bam \ $OUTDIR/${PREFIX}_keep.bam samtools index $OUTDIR/${PREFIX}_final.bam echo "Done! WASP-filtered BAM: $OUTDIR/${PREFIX}_final.bam" Rust Acceleration ----------------- WASP2 includes a high-performance Rust backend that accelerates the filter step: .. table:: Performance Comparison :widths: 30 35 35 ============= =============== =============== Dataset Size Python Rust ============= =============== =============== 1M reads ~5 minutes ~30 seconds 10M reads ~50 minutes ~5 minutes 100M reads ~8 hours ~50 minutes ============= =============== =============== The Rust backend is used automatically when available. Next Steps ---------- After WASP filtering: 1. **Count alleles** on the filtered BAM: .. code-block:: bash wasp2-count count-variants wasp_filtered.bam variants.vcf 2. **Analyze allelic imbalance**: .. code-block:: bash wasp2-analyze find-imbalance counts.tsv See Also -------- * :doc:`/user_guide/mapping` - Detailed mapping module documentation * :doc:`/methods/mapping_filter` - Algorithm details and mathematics * :doc:`/tutorials/scrna_seq` - Single-cell RNA-seq workflow Summary ------- .. table:: Key Takeaways :widths: 25 75 ============ =============================================== Concept Key Point ============ =============================================== **Problem** Reference bias inflates ref allele counts **Solution** WASP remap-and-filter removes biased reads **Workflow** make-reads → remap → filter-remapped **Expected** 90-99% reads pass filter **Result** Unbiased allele counts for ASE analysis ============ ===============================================