Counting Module API#

The counting module provides functions for allele-specific read counting from BAM files.

count_alleles#

Allele counting functions using Rust-accelerated BAM processing.

counting.count_alleles.count_snp_alleles_rust(bam_file, chrom, snp_list, threads=None)[source]#

Rust-accelerated version of count_snp_alleles

Parameters:
  • bam_file (str) – Path to BAM file

  • chrom (str) – Chromosome name

  • snp_list (Iterator[tuple[int, str, str]]) – Iterator of (pos, ref, alt) tuples

  • threads (int) – Optional number of threads (default 1 or WASP2_RUST_THREADS env)

Return list:

List of (chrom, pos, ref_count, alt_count, other_count) tuples

Return type:

list[tuple[str, int, int, int, int]]

counting.count_alleles.make_count_df(bam_file, df, use_rust=True)[source]#

Make DataFrame containing all intersections and allele counts.

Parameters:
  • bam_file (str) – Path to BAM file.

  • df (pl.DataFrame) – DataFrame of intersections, output from parse_(intersect/gene)_df().

  • use_rust (bool, optional) – Use Rust acceleration if available, by default True.

Returns:

DataFrame of counts joined with input intersections.

Return type:

pl.DataFrame

Raises:

RuntimeError – If Rust BAM counter is not available.

counting.count_alleles.find_read_aln_pos(read, pos)[source]#

Find query position for a given reference position using binary search.

Parameters:
  • read (pysam.AlignedSegment) – Aligned read from BAM file.

  • pos (int) – Reference position (0-based).

Returns:

Query position if found, None otherwise.

Return type:

int | None

count_alleles_sc#

Single-cell allele counting functions.

class counting.count_alleles_sc.CountStatsSC[source]#

Bases: object

Container for mutable single-cell counting statistics.

Tracks allele counts and metadata per chromosome during counting.

__init__()[source]#
stats_to_df()[source]#

Convert statistics to a pandas DataFrame.

Return type:

DataFrame

counting.count_alleles_sc.make_count_matrix(bam_file, df, bc_dict, include_samples=None, include_features=None)[source]#

Create sparse count matrix from BAM and variant data.

Parameters:
  • bam_file (str) – Path to BAM file with cell barcodes.

  • df (pl.DataFrame) – DataFrame with variant positions from intersection.

  • bc_dict (dict[str, int]) – Mapping of cell barcodes to integer indices.

  • include_samples (list[str] | None, optional) – Sample columns to include from variant data, by default None.

  • include_features (list[str] | None, optional) – Feature columns to include, by default None.

Returns:

AnnData object with count matrices in layers (ref, alt, other).

Return type:

ad.AnnData

counting.count_alleles_sc.count_bc_snp_alleles(bam, bc_dict, chrom, snp_list, sc_counts)[source]#

Count alleles at SNP positions for each cell barcode.

Parameters:
  • bam (AlignmentFile) – Open BAM file handle.

  • bc_dict (dict[str, int]) – Mapping of cell barcodes to indices.

  • chrom (str) – Chromosome to process.

  • snp_list (Iterator[tuple[int, int, str, str]]) – Iterator of (index, pos, ref, alt) tuples.

  • sc_counts (CountStatsSC) – Statistics container to update with counts.

Return type:

None

filter_variant_data#

Variant data filtering and conversion utilities.

counting.filter_variant_data.vcf_to_bed(vcf_file, out_bed, samples=None, include_gt=True, include_indels=False, max_indel_len=10)[source]#

Convert variant file to BED format.

Supports VCF, VCF.GZ, BCF, and PGEN formats via the VariantSource API. This is the unified version that replaces the duplicate implementation.

Note: Parameter name ‘vcf_file’ is kept for backward compatibility, but accepts any supported variant format (VCF, BCF, PGEN).

Parameters:
  • vcf_file (str | Path) – Path to variant file (VCF, VCF.GZ, BCF, or PGEN)

  • out_bed (str | Path) – Output BED file path

  • samples (list[str] | None) – Optional list of sample IDs. If provided, filters to het sites.

  • include_gt (bool) – Include genotype column in output (default True)

  • include_indels (bool) – Include indels in addition to SNPs (default False)

  • max_indel_len (int) – Maximum indel length in bp (default 10)

Return type:

str

Returns:

Path to output BED file as string

counting.filter_variant_data.gtf_to_bed(gtf_file, out_bed, feature, attribute)[source]#

Convert GTF/GFF3 file to BED format.

Parameters:
  • gtf_file (str | Path) – Path to GTF/GFF3 file.

  • out_bed (str | Path) – Output BED file path.

  • feature (str) – Feature type to extract (e.g., ‘gene’, ‘exon’).

  • attribute (str) – Attribute to use as region name.

Returns:

Path to output BED file.

Return type:

str | Path

counting.filter_variant_data.intersect_vcf_region(vcf_file, region_file, out_file)[source]#

Perform bedtools intersection of variants with regions.

Parameters:
  • vcf_file (str | Path) – Path to variant BED file.

  • region_file (str | Path) – Path to region BED file.

  • out_file (str | Path) – Output intersection file path.

Return type:

None

counting.filter_variant_data.parse_intersect_region_new(intersect_file, samples=None, use_region_names=False, region_col=None)[source]#

Parse intersection file to DataFrame with typed columns.

Parameters:
  • intersect_file (str | Path) – Path to bedtools intersection output.

  • samples (list[str] | None, optional) – Sample column names to include, by default None.

  • use_region_names (bool, optional) – Use region names from fourth column, by default False.

  • region_col (str | None, optional) – Column name for region, by default ‘region’.

Returns:

Parsed intersection data with typed columns.

Return type:

pl.DataFrame

counting.filter_variant_data.parse_intersect_region(intersect_file, use_region_names=False, region_col=None)[source]#

Parse intersection file to DataFrame (legacy version).

Parameters:
  • intersect_file (str | Path) – Path to bedtools intersection output.

  • use_region_names (bool, optional) – Use region names from fourth column, by default False.

  • region_col (str | None, optional) – Column name for region, by default ‘region’.

Returns:

Parsed intersection data.

Return type:

pl.DataFrame

Raises:

ValueError – If BED format is not recognized.

parse_gene_data#

Gene annotation parsing and data management.

class counting.parse_gene_data.ParsedGeneData(gene_df, feature, attribute, parent_attribute)[source]#

Bases: NamedTuple

Parsed gene data from GTF/GFF3 file.

gene_df: DataFrame#

Alias for field number 0

feature: str#

Alias for field number 1

attribute: str#

Alias for field number 2

parent_attribute: str#

Alias for field number 3

class counting.parse_gene_data.WaspGeneData(gene_file, feature=None, attribute=None, parent_attribute=None)[source]#

Bases: object

Container for gene annotation file paths and configuration.

gene_file#

Path to the gene annotation file.

Type:

str | Path

feature#

Feature type to extract.

Type:

str | None

attribute#

Attribute for region names.

Type:

str | None

parent_attribute#

Parent attribute for hierarchical features.

Type:

str | None

__init__(gene_file, feature=None, attribute=None, parent_attribute=None)[source]#
update_data(data)[source]#

Update attributes with parsed data.

Parameters:

data (ParsedGeneData) – Parsed gene data to update from.

Return type:

None

counting.parse_gene_data.parse_gene_file(gene_file, feature=None, attribute=None, parent_attribute=None)[source]#

Parse GTF/GFF3 gene annotation file.

Parameters:
  • gene_file (str | Path) – Path to GTF/GFF3 file.

  • feature (str | None, optional) – Feature type to extract (auto-detected if None).

  • attribute (str | None, optional) – Attribute for region names (auto-detected if None).

  • parent_attribute (str | None, optional) – Parent attribute for hierarchical features (auto-detected if None).

Returns:

Named tuple with (gene_df, feature, attribute, parent_attribute).

Return type:

ParsedGeneData

counting.parse_gene_data.make_gene_data(gene_file, out_bed, feature=None, attribute=None, parent_attribute=None)[source]#

Parse gene file and create BED for intersection.

Parameters:
  • gene_file (str | Path) – Path to GTF/GFF3 file.

  • out_bed (str | Path) – Output BED file path.

  • feature (str | None, optional) – Feature type to extract.

  • attribute (str | None, optional) – Attribute for region names.

  • parent_attribute (str | None, optional) – Parent attribute for hierarchical features.

Returns:

Container with parsed gene data and configuration.

Return type:

WaspGeneData

counting.parse_gene_data.parse_intersect_genes(intersect_file, attribute=None, parent_attribute=None)[source]#

Parse gene intersection file (legacy version).

Parameters:
  • intersect_file (str | Path) – Path to bedtools intersection output.

  • attribute (str | None, optional) – Attribute column name, by default ‘ID’.

  • parent_attribute (str | None, optional) – Parent attribute column name, by default ‘Parent’.

Returns:

Parsed intersection data.

Return type:

pl.DataFrame

counting.parse_gene_data.parse_intersect_genes_new(intersect_file, attribute=None, parent_attribute=None)[source]#

Parse gene intersection file with typed columns.

Parameters:
  • intersect_file (str | Path) – Path to bedtools intersection output.

  • attribute (str | None, optional) – Attribute column name, by default ‘ID’.

  • parent_attribute (str | None, optional) – Parent attribute column name, by default ‘Parent’.

Returns:

Parsed intersection data with typed columns.

Return type:

pl.DataFrame

run_counting#

class counting.run_counting.WaspCountFiles(bam_file, variant_file, region_file=None, samples=None, use_region_names=False, out_file=None, temp_loc=None, precomputed_vcf_bed=None, precomputed_intersect=None)[source]#

Bases: object

Container for WASP counting pipeline file paths and configuration.

Manages input/output file paths and parsing logic for the variant counting pipeline.

bam_file#

Path to the BAM alignment file.

variant_file#

Path to the variant file (VCF, BCF, or PGEN).

region_file#

Optional path to a region file (BED, GTF, or GFF3).

samples#

List of sample IDs to process, or None for all samples.

use_region_names#

Whether to use region names from the region file.

out_file#

Output file path for count results.

temp_loc#

Directory for temporary files.

is_gene_file#

Whether the region file is a gene annotation file.

gtf_bed#

Path to converted GTF/GFF3 BED file, if applicable.

variant_prefix#

Prefix extracted from variant filename.

vcf_bed#

Path to variant BED file.

skip_vcf_to_bed#

Whether to skip VCF-to-BED conversion.

region_type#

Type of regions (‘regions’ or ‘genes’).

intersect_file#

Path to intersected variant-region file.

skip_intersect#

Whether to skip intersection step.

__init__(bam_file, variant_file, region_file=None, samples=None, use_region_names=False, out_file=None, temp_loc=None, precomputed_vcf_bed=None, precomputed_intersect=None)[source]#
counting.run_counting.tempdir_decorator(func)[source]#

Decorator that creates a temporary directory for the wrapped function.

If ‘temp_loc’ is not provided in kwargs, creates a temporary directory and passes it to the function. The directory is cleaned up after execution.

Parameters:

func (Callable[[ParamSpec(P)], TypeVar(T)]) – The function to wrap.

Return type:

Callable[[ParamSpec(P)], TypeVar(T)]

Returns:

Wrapped function with automatic temporary directory management.

counting.run_counting.run_count_variants(bam_file, variant_file, region_file=None, samples=None, use_region_names=False, out_file=None, temp_loc=None, gene_feature=None, gene_attribute=None, gene_parent=None, use_rust=True, precomputed_vcf_bed=None, precomputed_intersect=None, include_indels=False)[source]#

Run the WASP variant counting pipeline.

Counts allele-specific reads at heterozygous variant positions within optional genomic regions.

Parameters:
  • bam_file (str) – Path to the BAM alignment file.

  • variant_file (str) – Path to the variant file (VCF, BCF, or PGEN).

  • region_file (str | None) – Optional path to a region file (BED, GTF, or GFF3).

  • samples (str | list[str] | None) – Sample ID(s) to process. Can be a single ID, comma-separated string, path to a file with one sample per line, or list of IDs.

  • use_region_names (bool) – Whether to use region names from the region file.

  • out_file (str | None) – Output file path. Defaults to ‘counts.tsv’ in current directory.

  • temp_loc (str | None) – Directory for temporary files. Auto-created if not provided.

  • gene_feature (str | None) – GTF/GFF3 feature type to extract (e.g., ‘gene’, ‘exon’).

  • gene_attribute (str | None) – GTF/GFF3 attribute for region names (e.g., ‘gene_name’).

  • gene_parent (str | None) – GTF/GFF3 parent attribute for hierarchical features.

  • use_rust (bool) – Whether to use the Rust backend for counting (faster).

  • precomputed_vcf_bed (str | None) – Path to pre-computed variant BED file (skips conversion).

  • precomputed_intersect (str | None) – Path to pre-computed intersection file.

  • include_indels (bool) – Whether to include indels in variant counting.

Return type:

None

Returns:

None. Results are written to out_file.

run_counting_sc#

Single-cell variant counting pipeline.

class counting.run_counting_sc.WaspCountSC(bam_file, variant_file, barcode_file, feature_file, samples=None, use_region_names=False, out_file=None, temp_loc=None)[source]#

Bases: object

Container for single-cell WASP counting pipeline configuration.

bam_file#

Path to the BAM alignment file.

Type:

str

variant_file#

Path to the variant file (VCF, BCF, or PGEN).

Type:

str

barcode_file#

Path to cell barcode file.

Type:

str

feature_file#

Optional path to feature/region file.

Type:

str | None

samples#

List of sample IDs to process.

Type:

list[str] | None

out_file#

Output file path for AnnData.

Type:

str

__init__(bam_file, variant_file, barcode_file, feature_file, samples=None, use_region_names=False, out_file=None, temp_loc=None)[source]#
counting.run_counting_sc.run_count_variants_sc(bam_file, variant_file, barcode_file, feature_file=None, samples=None, use_region_names=False, out_file=None, temp_loc=None)[source]#

Run single-cell variant counting pipeline.

Parameters:
  • bam_file (str) – Path to the BAM alignment file with cell barcodes.

  • variant_file (str) – Path to the variant file (VCF, BCF, or PGEN).

  • barcode_file (str) – Path to cell barcode file (one barcode per line).

  • feature_file (str | None, optional) – Path to feature/region file (BED, GTF, or GFF3).

  • samples (str | list[str] | None, optional) – Sample ID(s) to process.

  • use_region_names (bool, optional) – Whether to use region names from the feature file.

  • out_file (str | None, optional) – Output file path. Defaults to ‘allele_counts.h5ad’.

  • temp_loc (str | None, optional) – Directory for temporary files.

Returns:

Results are written to out_file as AnnData.

Return type:

None

CLI Entry Point#

counting.__main__.main(ctx, version=False, verbose=False, quiet=False)[source]#

WASP2 allele counting commands.

Return type:

None

counting.__main__.count_variants(bam, variants, samples=None, region_file=None, out_file=None, temp_loc=None, use_region_names=False, gene_feature=None, gene_attribute=None, gene_parent=None, use_rust=True, vcf_bed=None, intersect_bed=None, include_indels=False)[source]#
Return type:

None

counting.__main__.count_variants_sc(bam, variants, barcodes, samples=None, feature_file=None, out_file=None, temp_loc=None)[source]#
Return type:

None