Mapping Module API#

The mapping module implements the WASP algorithm for unbiased read remapping to correct reference bias.

filter_remap_reads#

Filter and merge remapped BAM reads using WASP algorithm.

Provides functions for filtering reads that remap to the same locus after allele swapping and merging with non-remapped reads.

mapping.filter_remap_reads.filt_remapped_reads(to_remap_bam, remapped_bam, filt_out_bam, keep_read_file=None, threads=1, same_locus_slop=0)[source]#

Filter remapped reads using WASP algorithm.

Uses Rust acceleration.

Parameters:

to_remap_bam (str) – Original BAM with reads to remap
remapped_bam (str) – Remapped BAM with swapped alleles
filt_out_bam (str) – Output filtered BAM
keep_read_file (str | None) – Optional file to write kept read names
threads (int) – Number of threads for BAM I/O
same_locus_slop (int) – Tolerance (bp) for same locus test (for indels)

Return type:

None

mapping.filter_remap_reads.merge_filt_bam(keep_bam, remapped_filt_bam, out_bam, threads=1)[source]#

Merge filtered BAM files using samtools (faster than pysam).

Both input BAMs are already coordinate-sorted, so samtools merge produces sorted output without needing an explicit sort step.

Parameters:

keep_bam (str) – BAM with reads that didn’t need remapping
remapped_filt_bam (str) – BAM with filtered remapped reads
out_bam (str) – Output merged BAM
threads (int) – Number of threads for samtools

Return type:

None

intersect_variant_data#

Variant intersection and BAM filtering utilities.

Provides functions for converting variants to BED format, filtering BAM files by variant overlap, and creating intersection files for the WASP pipeline.

mapping.intersect_variant_data.vcf_to_bed(vcf_file, out_bed, samples=None, include_indels=False, max_indel_len=10)[source]#

Convert variant file to BED format.

Supports VCF, VCF.GZ, BCF, and PGEN formats via the VariantSource API.

Note: Parameter name ‘vcf_file’ is kept for backward compatibility, but accepts any supported variant format (VCF, BCF, PGEN).

Parameters:

vcf_file (str | Path) – Path to variant file (VCF, VCF.GZ, BCF, or PGEN)
out_bed (str | Path) – Output BED file path
samples (list[str] | None) – Optional list of sample IDs. If provided, filters to het sites.
include_indels (bool) – Include indels in addition to SNPs
max_indel_len (int) – Maximum indel length (bp) to include

Return type:

str

Returns:

Path to output BED file as string

mapping.intersect_variant_data.process_bam(bam_file, vcf_bed, remap_bam, remap_reads, keep_bam, is_paired=True, threads=1)[source]#

Filter BAM by variant overlap, splitting into remap/keep BAMs.

Uses Rust acceleration (~2x faster than samtools).

Parameters:

bam_file (str) – Input BAM file (coordinate-sorted)
vcf_bed (str) – Variant BED file from vcf_to_bed
remap_bam (str) – Output BAM for reads needing remapping
remap_reads (str) – Output file for unique read names
keep_bam (str) – Output BAM for reads not needing remapping
is_paired (bool) – Whether reads are paired-end
threads (int) – Number of threads

Return type:

str

Returns:

Path to remap BAM file

mapping.intersect_variant_data.intersect_reads(remap_bam, vcf_bed, out_bed, num_samples=1)[source]#

Intersect BAM reads with variant BED file.

Uses Rust/coitrees (15-30x faster than pybedtools).

Parameters:

remap_bam (str) – Path to BAM file with reads overlapping variants
vcf_bed (str) – Path to BED file with variant positions
out_bed (str) – Output path for intersection results
num_samples (int) – Number of sample genotype columns in BED file (default 1)

Return type:

str

Returns:

Path to output BED file

mapping.intersect_variant_data.make_intersect_df(intersect_file, samples, is_paired=True)[source]#

Parse intersection file into a typed polars DataFrame.

Parameters:

intersect_file (str) – Path to intersection BED file.
samples (list[str]) – List of sample column names.
is_paired (bool, optional) – Whether reads are paired-end, by default True.

Returns:

Parsed intersection data with alleles split by sample.

Return type:

pl.DataFrame

make_remap_reads#

Generate allele-swapped reads for remapping.

Provides functions for creating FASTQ files with haplotype-swapped reads that need to be remapped to check for mapping bias.

mapping.make_remap_reads.write_remap_bam(bam_file, intersect_file, r1_out, r2_out, samples, max_seqs=64, include_indels=False, insert_qual=30)[source]#

Rust-accelerated remapping - parses intersect file once, processes chromosomes in parallel.

Uses Rust acceleration (required; no fallback).

Parameters:

bam_file (str) – Input BAM file
intersect_file (str) – Intersect BED file
r1_out (str) – Output FASTQ for read 1
r2_out (str) – Output FASTQ for read 2
samples (list[str]) – List of sample IDs
max_seqs (int) – Maximum haplotype sequences per read pair
include_indels (bool) – Include indels in remapping (not yet supported in Rust)
insert_qual (int) – Quality score for inserted bases (not yet supported in Rust)

Return type:

None

remap_utils#

mapping.remap_utils.paired_read_gen(bam, chrom=None)[source]#

Return type:: Generator[tuple[AlignedSegment, AlignedSegment], None, None]

mapping.remap_utils.paired_read_gen_stat(bam, read_stats, chrom=None)[source]#

Return type:: Generator[tuple[AlignedSegment, AlignedSegment], None, None]

mapping.remap_utils.align_pos_gen(read, align_dict, pos_list)[source]#

Return type:: Generator[int, None, None]

mapping.remap_utils.get_read_het_data(read_df, read, col_list, max_seqs=None, include_indels=False, insert_qual=30)[source]#

Extract heterozygous variant data from read with indel support.

Parameters:

read_df (DataFrame) – DataFrame with variant positions and alleles
read (AlignedSegment) – pysam AlignedSegment
col_list (list[str]) – List of column names containing alleles
max_seqs (int | None) – Maximum number of alternate sequences (unused currently)
include_indels (bool) – Whether to use indel-aware position mapping
insert_qual (int) – Quality score for inserted bases (Phred scale)

Return type:

tuple[list[str], list[Any], list[Series]] | None

Returns:

Tuple of (split_seq, split_qual, allele_series) or None if mapping fails split_seq: List of sequence segments between variants split_qual: List of quality score segments allele_series: List of polars Series with allele data

mapping.remap_utils.make_phased_seqs(split_seq, hap1_alleles, hap2_alleles)[source]#

Create phased sequences by swapping alleles (SNP-only version).

Parameters:

split_seq (list[str]) – List of sequence segments
hap1_alleles (Any) – Haplotype 1 alleles
hap2_alleles (Any) – Haplotype 2 alleles

Return type:

tuple[str, str]

Returns:

Tuple of (hap1_seq, hap2_seq) strings

mapping.remap_utils.make_phased_seqs_with_qual(split_seq, split_qual, hap1_alleles, hap2_alleles, insert_qual=30)[source]#

Create phased sequences with quality scores (indel-aware version).

Parameters:

split_seq (list[str]) – List of sequence segments
split_qual (list[ndarray]) – List of quality score arrays
hap1_alleles (Any) – Haplotype 1 alleles
hap2_alleles (Any) – Haplotype 2 alleles
insert_qual (int) – Quality score for inserted bases

Return type:

tuple[tuple[str, ndarray], tuple[str, ndarray]]

Returns:

Tuple of ((hap1_seq, hap1_qual), (hap2_seq, hap2_qual))

mapping.remap_utils.make_multi_seqs(split_seq, allele_combos)[source]#

Create multiple sequences for multi-sample analysis (SNP-only version).

Parameters:

split_seq (list[str]) – List of sequence segments
allele_combos (Any) – List of allele combinations across samples

Return type:

list[str]

Returns:

List of sequence strings, one per unique haplotype

mapping.remap_utils.make_multi_seqs_with_qual(split_seq, split_qual, allele_combos, insert_qual=30)[source]#

Create multiple sequences with quality scores for multi-sample indel support.

Parameters:

split_seq (list[str]) – List of sequence segments
split_qual (list[ndarray]) – List of quality score arrays
allele_combos (Any) – List of allele combinations across samples
insert_qual (int) – Quality score for inserted bases

Return type:

list[tuple[str, ndarray]]

Returns:

List of (sequence, quality) tuples, one per unique haplotype

mapping.remap_utils.write_read(out_bam, read, new_seq, new_name, new_qual=None)[source]#

Write a modified read to output BAM.

Parameters:

out_bam (AlignmentFile) – Output BAM file
read (AlignedSegment) – Original read
new_seq (str) – New sequence
new_name (str) – New read name
new_qual (ndarray | None) – Optional new quality scores (for indels)

Return type:

None

run_mapping#

WASP mapping bias correction pipeline.

Main entry points for running the WASP allele-specific read filtering pipeline including the unified single-pass mode and traditional multi-pass mode.

mapping.run_mapping.run_make_remap_reads_unified(bam_file, variant_file=None, bed_file=None, samples=None, out_dir=None, include_indels=False, max_indel_len=10, max_seqs=64, threads=8, compression_threads=1, use_parallel=True, compress_output=True)[source]#

FAST unified single-pass pipeline for generating remap reads.

This replaces the multi-pass approach (filter + intersect + remap) with a single BAM pass that’s ~39x faster: - Multi-pass: ~347s (filter ~257s + sort ~20s + intersect ~20s + remap ~50s) - Unified: ~9s (single pass with parallel chromosome processing)

REQUIREMENTS: - BAM must be coordinate-sorted - For parallel mode, BAM must have index (.bai file)

NOTE: This produces remap FASTQs only. For the full WASP workflow (which needs keep_bam for final merge), use run_make_remap_reads() or run the filter step separately.

Parameters:

bam_file (str) – Path to BAM file (coordinate-sorted)
variant_file (str | None) – Path to variant file (VCF, VCF.GZ, BCF). Required if bed_file not provided.
bed_file (str | None) – Path to pre-existing BED file. If provided, skips VCF conversion.
samples (str | list[str] | None) – Sample(s) to use from variant file. Required if using variant_file.
out_dir (str | None) – Output directory for FASTQ files
include_indels (bool) – Include indels in addition to SNPs (only used with variant_file)
max_indel_len (int) – Maximum indel length (bp) to include (only used with variant_file)
max_seqs (int) – Maximum haplotype sequences per read pair
threads (int) – Number of threads for parallel processing
compression_threads (int) – Threads per FASTQ file for gzip compression
use_parallel (bool) – Use parallel chromosome processing (requires BAM index)

Returns:

remap_fq1, remap_fq2: Output FASTQ paths
bed_file: BED file used (created or provided)
pairs_processed, pairs_with_variants, haplotypes_written, etc.

Return type:

Dictionary with pipeline statistics including output paths

Examples

With VCF (converts to BED automatically):

stats = run_make_remap_reads_unified(
    bam_file="input.bam",
    variant_file="variants.vcf.gz",
    samples=["NA12878"],
    threads=8
)

With pre-existing BED (faster, skips conversion):

stats = run_make_remap_reads_unified(
    bam_file="input.bam",
    bed_file="variants.bed",
    threads=8
)

mapping.run_mapping.tempdir_decorator(func)[source]#

Checks and makes tempdir for run_make_remap_reads()

Return type:: Callable[..., Any]

mapping.run_mapping.run_make_remap_reads(bam_file, variant_file, is_paired=None, samples=None, is_phased=None, out_dir=None, temp_loc=None, out_json=None, include_indels=False, max_indel_len=10, insert_qual=30, max_seqs=64, threads=1)[source]#

Parser that parses initial input. Finds intersecting variants and generates swapped allele reads to be remapped.

Parameters:

bam_file (str) – Path to BAM file
variant_file (str) – Path to variant file (VCF, VCF.GZ, BCF, or PGEN)
is_paired (bool, optional) – Whether reads are paired, defaults to None (auto-detect)
samples (str or List[str], optional) – Sample(s) to use from variant file, defaults to None
is_phased (bool, optional) – Whether variant file is phased, defaults to None (auto-detect)
out_dir (str, optional) – Output directory, defaults to None
temp_loc (str, optional) – Temp directory for intermediary files, defaults to None
out_json (str, optional) – Output JSON file path, defaults to None
include_indels (bool, optional) – Include indels in addition to SNPs, defaults to False
max_indel_len (int, optional) – Maximum indel length (bp) to include, defaults to 10
insert_qual (int, optional) – Quality score for inserted bases (Phred), defaults to 30
max_seqs (int, optional) – Maximum number of alternate sequences per read, defaults to 64
threads (int, optional) – Number of threads for BAM I/O, defaults to 1

Return type:

None

mapping.run_mapping.check_filt_input(func)[source]#

Decorator that parses valid input types for run_wasp_filt()

Parameters:: func (_type_) – _description_
Raises:: ValueError – _description_
Returns:: _description_
Return type:: _type_

mapping.run_mapping.run_wasp_filt(remapped_bam, to_remap_bam, keep_bam, wasp_out_bam, remap_keep_bam=None, remap_keep_file=None, threads=1, use_rust=True, same_locus_slop=0)[source]#

Filter reads that remap to the same loc and merges with non-remapped reads to create a complete WASP filtered BAM

Parameters:

remapped_bam (_type_) – _description_
to_remap_bam (_type_) – _description_
keep_bam (_type_) – _description_
wasp_out_bam (_type_) – _description_
remap_keep_bam (_type_, optional) – _description_, defaults to None
remap_keep_file (_type_, optional) – _description_, defaults to None
threads (int, optional) – Number of threads for BAM I/O, defaults to 1
use_rust (bool, optional) – Deprecated; Rust is now always used. Kept for backward compatibility.
same_locus_slop (int, optional) – Tolerance (bp) for same locus test, defaults to 0

Return type:

None

wasp_data_files#

File path management for WASP mapping pipeline.

Provides the WaspDataFiles class for managing input/output paths and auto-detecting file properties.

class mapping.wasp_data_files.WaspDataFiles(bam_file, variant_file, is_paired=None, samples=None, is_phased=None, out_dir=None, temp_loc=None)[source]#

Bases: object

Manage file paths and auto-detection for WASP mapping pipeline.

__init__(bam_file, variant_file, is_paired=None, samples=None, is_phased=None, out_dir=None, temp_loc=None)[source]#

write_data(out_file=None)[source]#

Export Relevant Files to JSON Used for parsing post remapping step easily

Parameters:: out_file (str, optional) – name for output file if not using default
Return type:: None

CLI Entry Point#

mapping.__main__.main(ctx, version=False, verbose=False, quiet=False)[source]#

WASP2 read mapping commands for allele swapping and filtering.

Return type:: None

mapping.__main__.make_reads(bam, variants, samples=None, out_dir=None, temp_loc=None, out_json=None, is_paired=None, is_phased=None, include_indels=False, max_indel_len=10, insert_qual=30, max_seqs=64, threads=1)[source]#

Generate reads with swapped alleles for remapping.

Return type:: None

mapping.__main__.filter_remapped(remapped_bam, to_remap_bam=None, keep_bam=None, wasp_data_json=None, out_bam=None, remap_keep_bam=None, remap_keep_file=None, threads=1, use_rust=True, same_locus_slop=0)[source]#

Filter remapped reads using WASP algorithm.

Return type:: None

Mapping Module API#

filter_remap_reads#

intersect_variant_data#

make_remap_reads#

remap_utils#

run_mapping#

wasp_data_files#

CLI Entry Point#

This Page