Mapping Module API#
The mapping module implements the WASP algorithm for unbiased read remapping to correct reference bias.
filter_remap_reads#
Filter and merge remapped BAM reads using WASP algorithm.
Provides functions for filtering reads that remap to the same locus after allele swapping and merging with non-remapped reads.
- mapping.filter_remap_reads.filt_remapped_reads(to_remap_bam, remapped_bam, filt_out_bam, keep_read_file=None, threads=1, same_locus_slop=0)[source]#
Filter remapped reads using WASP algorithm.
Uses Rust acceleration.
- Parameters:
to_remap_bam (
str) – Original BAM with reads to remapremapped_bam (
str) – Remapped BAM with swapped allelesfilt_out_bam (
str) – Output filtered BAMkeep_read_file (
str|None) – Optional file to write kept read namesthreads (
int) – Number of threads for BAM I/Osame_locus_slop (
int) – Tolerance (bp) for same locus test (for indels)
- Return type:
intersect_variant_data#
Variant intersection and BAM filtering utilities.
Provides functions for converting variants to BED format, filtering BAM files by variant overlap, and creating intersection files for the WASP pipeline.
- mapping.intersect_variant_data.vcf_to_bed(vcf_file, out_bed, samples=None, include_indels=False, max_indel_len=10)[source]#
Convert variant file to BED format.
Supports VCF, VCF.GZ, BCF, and PGEN formats via the VariantSource API.
Note: Parameter name ‘vcf_file’ is kept for backward compatibility, but accepts any supported variant format (VCF, BCF, PGEN).
- Parameters:
- Return type:
- Returns:
Path to output BED file as string
- mapping.intersect_variant_data.process_bam(bam_file, vcf_bed, remap_bam, remap_reads, keep_bam, is_paired=True, threads=1)[source]#
Filter BAM by variant overlap, splitting into remap/keep BAMs.
Uses Rust acceleration (~2x faster than samtools).
- Parameters:
bam_file (
str) – Input BAM file (coordinate-sorted)vcf_bed (
str) – Variant BED file from vcf_to_bedremap_bam (
str) – Output BAM for reads needing remappingremap_reads (
str) – Output file for unique read nameskeep_bam (
str) – Output BAM for reads not needing remappingis_paired (
bool) – Whether reads are paired-endthreads (
int) – Number of threads
- Return type:
- Returns:
Path to remap BAM file
- mapping.intersect_variant_data.intersect_reads(remap_bam, vcf_bed, out_bed, num_samples=1)[source]#
Intersect BAM reads with variant BED file.
Uses Rust/coitrees (15-30x faster than pybedtools).
- Parameters:
- Return type:
- Returns:
Path to output BED file
make_remap_reads#
Generate allele-swapped reads for remapping.
Provides functions for creating FASTQ files with haplotype-swapped reads that need to be remapped to check for mapping bias.
- mapping.make_remap_reads.write_remap_bam(bam_file, intersect_file, r1_out, r2_out, samples, max_seqs=64, include_indels=False, insert_qual=30)[source]#
Rust-accelerated remapping - parses intersect file once, processes chromosomes in parallel.
Uses Rust acceleration (required; no fallback).
- Parameters:
bam_file (
str) – Input BAM fileintersect_file (
str) – Intersect BED filer1_out (
str) – Output FASTQ for read 1r2_out (
str) – Output FASTQ for read 2max_seqs (
int) – Maximum haplotype sequences per read pairinclude_indels (
bool) – Include indels in remapping (not yet supported in Rust)insert_qual (
int) – Quality score for inserted bases (not yet supported in Rust)
- Return type:
remap_utils#
- mapping.remap_utils.get_read_het_data(read_df, read, col_list, max_seqs=None, include_indels=False, insert_qual=30)[source]#
Extract heterozygous variant data from read with indel support.
- Parameters:
read_df (
DataFrame) – DataFrame with variant positions and allelesread (
AlignedSegment) – pysam AlignedSegmentcol_list (
list[str]) – List of column names containing allelesmax_seqs (
int|None) – Maximum number of alternate sequences (unused currently)include_indels (
bool) – Whether to use indel-aware position mappinginsert_qual (
int) – Quality score for inserted bases (Phred scale)
- Return type:
- Returns:
Tuple of (split_seq, split_qual, allele_series) or None if mapping fails split_seq: List of sequence segments between variants split_qual: List of quality score segments allele_series: List of polars Series with allele data
- mapping.remap_utils.make_phased_seqs(split_seq, hap1_alleles, hap2_alleles)[source]#
Create phased sequences by swapping alleles (SNP-only version).
- mapping.remap_utils.make_phased_seqs_with_qual(split_seq, split_qual, hap1_alleles, hap2_alleles, insert_qual=30)[source]#
Create phased sequences with quality scores (indel-aware version).
- Parameters:
- Return type:
- Returns:
Tuple of ((hap1_seq, hap1_qual), (hap2_seq, hap2_qual))
- mapping.remap_utils.make_multi_seqs(split_seq, allele_combos)[source]#
Create multiple sequences for multi-sample analysis (SNP-only version).
- mapping.remap_utils.make_multi_seqs_with_qual(split_seq, split_qual, allele_combos, insert_qual=30)[source]#
Create multiple sequences with quality scores for multi-sample indel support.
- Parameters:
- Return type:
- Returns:
List of (sequence, quality) tuples, one per unique haplotype
run_mapping#
WASP mapping bias correction pipeline.
Main entry points for running the WASP allele-specific read filtering pipeline including the unified single-pass mode and traditional multi-pass mode.
- mapping.run_mapping.run_make_remap_reads_unified(bam_file, variant_file=None, bed_file=None, samples=None, out_dir=None, include_indels=False, max_indel_len=10, max_seqs=64, threads=8, compression_threads=1, use_parallel=True, compress_output=True)[source]#
FAST unified single-pass pipeline for generating remap reads.
This replaces the multi-pass approach (filter + intersect + remap) with a single BAM pass that’s ~39x faster: - Multi-pass: ~347s (filter ~257s + sort ~20s + intersect ~20s + remap ~50s) - Unified: ~9s (single pass with parallel chromosome processing)
REQUIREMENTS: - BAM must be coordinate-sorted - For parallel mode, BAM must have index (.bai file)
NOTE: This produces remap FASTQs only. For the full WASP workflow (which needs keep_bam for final merge), use run_make_remap_reads() or run the filter step separately.
- Parameters:
bam_file (
str) – Path to BAM file (coordinate-sorted)variant_file (
str|None) – Path to variant file (VCF, VCF.GZ, BCF). Required if bed_file not provided.bed_file (
str|None) – Path to pre-existing BED file. If provided, skips VCF conversion.samples (
str|list[str] |None) – Sample(s) to use from variant file. Required if using variant_file.include_indels (
bool) – Include indels in addition to SNPs (only used with variant_file)max_indel_len (
int) – Maximum indel length (bp) to include (only used with variant_file)max_seqs (
int) – Maximum haplotype sequences per read pairthreads (
int) – Number of threads for parallel processingcompression_threads (
int) – Threads per FASTQ file for gzip compressionuse_parallel (
bool) – Use parallel chromosome processing (requires BAM index)
- Returns:
remap_fq1, remap_fq2: Output FASTQ paths
bed_file: BED file used (created or provided)
pairs_processed, pairs_with_variants, haplotypes_written, etc.
- Return type:
Dictionary with pipeline statistics including output paths
Examples
With VCF (converts to BED automatically):
stats = run_make_remap_reads_unified( bam_file="input.bam", variant_file="variants.vcf.gz", samples=["NA12878"], threads=8 )
With pre-existing BED (faster, skips conversion):
stats = run_make_remap_reads_unified( bam_file="input.bam", bed_file="variants.bed", threads=8 )
- mapping.run_mapping.tempdir_decorator(func)[source]#
Checks and makes tempdir for run_make_remap_reads()
- mapping.run_mapping.run_make_remap_reads(bam_file, variant_file, is_paired=None, samples=None, is_phased=None, out_dir=None, temp_loc=None, out_json=None, include_indels=False, max_indel_len=10, insert_qual=30, max_seqs=64, threads=1)[source]#
Parser that parses initial input. Finds intersecting variants and generates swapped allele reads to be remapped.
- Parameters:
bam_file (str) – Path to BAM file
variant_file (str) – Path to variant file (VCF, VCF.GZ, BCF, or PGEN)
is_paired (bool, optional) – Whether reads are paired, defaults to None (auto-detect)
samples (str or List[str], optional) – Sample(s) to use from variant file, defaults to None
is_phased (bool, optional) – Whether variant file is phased, defaults to None (auto-detect)
out_dir (str, optional) – Output directory, defaults to None
temp_loc (str, optional) – Temp directory for intermediary files, defaults to None
out_json (str, optional) – Output JSON file path, defaults to None
include_indels (bool, optional) – Include indels in addition to SNPs, defaults to False
max_indel_len (int, optional) – Maximum indel length (bp) to include, defaults to 10
insert_qual (int, optional) – Quality score for inserted bases (Phred), defaults to 30
max_seqs (int, optional) – Maximum number of alternate sequences per read, defaults to 64
threads (int, optional) – Number of threads for BAM I/O, defaults to 1
- Return type:
- mapping.run_mapping.check_filt_input(func)[source]#
Decorator that parses valid input types for run_wasp_filt()
- Parameters:
func (_type_) – _description_
- Raises:
ValueError – _description_
- Returns:
_description_
- Return type:
_type_
- mapping.run_mapping.run_wasp_filt(remapped_bam, to_remap_bam, keep_bam, wasp_out_bam, remap_keep_bam=None, remap_keep_file=None, threads=1, use_rust=True, same_locus_slop=0)[source]#
Filter reads that remap to the same loc and merges with non-remapped reads to create a complete WASP filtered BAM
- Parameters:
remapped_bam (_type_) – _description_
to_remap_bam (_type_) – _description_
keep_bam (_type_) – _description_
wasp_out_bam (_type_) – _description_
remap_keep_bam (_type_, optional) – _description_, defaults to None
remap_keep_file (_type_, optional) – _description_, defaults to None
threads (int, optional) – Number of threads for BAM I/O, defaults to 1
use_rust (bool, optional) – Deprecated; Rust is now always used. Kept for backward compatibility.
same_locus_slop (int, optional) – Tolerance (bp) for same locus test, defaults to 0
- Return type:
wasp_data_files#
File path management for WASP mapping pipeline.
Provides the WaspDataFiles class for managing input/output paths and auto-detecting file properties.
- class mapping.wasp_data_files.WaspDataFiles(bam_file, variant_file, is_paired=None, samples=None, is_phased=None, out_dir=None, temp_loc=None)[source]#
Bases:
objectManage file paths and auto-detection for WASP mapping pipeline.
CLI Entry Point#
- mapping.__main__.main(ctx, version=False, verbose=False, quiet=False)[source]#
WASP2 read mapping commands for allele swapping and filtering.
- Return type: