Counting Module API#
The counting module provides functions for allele-specific read counting from BAM files.
count_alleles#
Allele counting functions using Rust-accelerated BAM processing.
- counting.count_alleles.count_snp_alleles_rust(bam_file, chrom, snp_list, threads=None)[source]#
Rust-accelerated version of count_snp_alleles
- Parameters:
- Return list:
List of (chrom, pos, ref_count, alt_count, other_count) tuples
- Return type:
- counting.count_alleles.make_count_df(bam_file, df, use_rust=True)[source]#
Make DataFrame containing all intersections and allele counts.
- Parameters:
- Returns:
DataFrame of counts joined with input intersections.
- Return type:
pl.DataFrame
- Raises:
RuntimeError – If Rust BAM counter is not available.
count_alleles_sc#
Single-cell allele counting functions.
- class counting.count_alleles_sc.CountStatsSC[source]#
Bases:
objectContainer for mutable single-cell counting statistics.
Tracks allele counts and metadata per chromosome during counting.
- counting.count_alleles_sc.make_count_matrix(bam_file, df, bc_dict, include_samples=None, include_features=None)[source]#
Create sparse count matrix from BAM and variant data.
- Parameters:
bam_file (str) – Path to BAM file with cell barcodes.
df (pl.DataFrame) – DataFrame with variant positions from intersection.
bc_dict (dict[str, int]) – Mapping of cell barcodes to integer indices.
include_samples (list[str] | None, optional) – Sample columns to include from variant data, by default None.
include_features (list[str] | None, optional) – Feature columns to include, by default None.
- Returns:
AnnData object with count matrices in layers (ref, alt, other).
- Return type:
ad.AnnData
- counting.count_alleles_sc.count_bc_snp_alleles(bam, bc_dict, chrom, snp_list, sc_counts)[source]#
Count alleles at SNP positions for each cell barcode.
- Parameters:
bam (AlignmentFile) – Open BAM file handle.
bc_dict (dict[str, int]) – Mapping of cell barcodes to indices.
chrom (str) – Chromosome to process.
snp_list (Iterator[tuple[int, int, str, str]]) – Iterator of (index, pos, ref, alt) tuples.
sc_counts (CountStatsSC) – Statistics container to update with counts.
- Return type:
filter_variant_data#
Variant data filtering and conversion utilities.
- counting.filter_variant_data.vcf_to_bed(vcf_file, out_bed, samples=None, include_gt=True, include_indels=False, max_indel_len=10)[source]#
Convert variant file to BED format.
Supports VCF, VCF.GZ, BCF, and PGEN formats via the VariantSource API. This is the unified version that replaces the duplicate implementation.
Note: Parameter name ‘vcf_file’ is kept for backward compatibility, but accepts any supported variant format (VCF, BCF, PGEN).
- Parameters:
vcf_file (
str|Path) – Path to variant file (VCF, VCF.GZ, BCF, or PGEN)samples (
list[str] |None) – Optional list of sample IDs. If provided, filters to het sites.include_gt (
bool) – Include genotype column in output (default True)include_indels (
bool) – Include indels in addition to SNPs (default False)max_indel_len (
int) – Maximum indel length in bp (default 10)
- Return type:
- Returns:
Path to output BED file as string
- counting.filter_variant_data.gtf_to_bed(gtf_file, out_bed, feature, attribute)[source]#
Convert GTF/GFF3 file to BED format.
- counting.filter_variant_data.intersect_vcf_region(vcf_file, region_file, out_file)[source]#
Perform bedtools intersection of variants with regions.
- counting.filter_variant_data.parse_intersect_region_new(intersect_file, samples=None, use_region_names=False, region_col=None)[source]#
Parse intersection file to DataFrame with typed columns.
- Parameters:
intersect_file (str | Path) – Path to bedtools intersection output.
samples (list[str] | None, optional) – Sample column names to include, by default None.
use_region_names (bool, optional) – Use region names from fourth column, by default False.
region_col (str | None, optional) – Column name for region, by default ‘region’.
- Returns:
Parsed intersection data with typed columns.
- Return type:
pl.DataFrame
- counting.filter_variant_data.parse_intersect_region(intersect_file, use_region_names=False, region_col=None)[source]#
Parse intersection file to DataFrame (legacy version).
- Parameters:
- Returns:
Parsed intersection data.
- Return type:
pl.DataFrame
- Raises:
ValueError – If BED format is not recognized.
parse_gene_data#
Gene annotation parsing and data management.
- class counting.parse_gene_data.ParsedGeneData(gene_df, feature, attribute, parent_attribute)[source]#
Bases:
NamedTupleParsed gene data from GTF/GFF3 file.
- gene_df: DataFrame#
Alias for field number 0
- class counting.parse_gene_data.WaspGeneData(gene_file, feature=None, attribute=None, parent_attribute=None)[source]#
Bases:
objectContainer for gene annotation file paths and configuration.
- update_data(data)[source]#
Update attributes with parsed data.
- Parameters:
data (ParsedGeneData) – Parsed gene data to update from.
- Return type:
- counting.parse_gene_data.parse_gene_file(gene_file, feature=None, attribute=None, parent_attribute=None)[source]#
Parse GTF/GFF3 gene annotation file.
- Parameters:
gene_file (str | Path) – Path to GTF/GFF3 file.
feature (str | None, optional) – Feature type to extract (auto-detected if None).
attribute (str | None, optional) – Attribute for region names (auto-detected if None).
parent_attribute (str | None, optional) – Parent attribute for hierarchical features (auto-detected if None).
- Returns:
Named tuple with (gene_df, feature, attribute, parent_attribute).
- Return type:
- counting.parse_gene_data.make_gene_data(gene_file, out_bed, feature=None, attribute=None, parent_attribute=None)[source]#
Parse gene file and create BED for intersection.
- Parameters:
- Returns:
Container with parsed gene data and configuration.
- Return type:
- counting.parse_gene_data.parse_intersect_genes(intersect_file, attribute=None, parent_attribute=None)[source]#
Parse gene intersection file (legacy version).
- Parameters:
- Returns:
Parsed intersection data.
- Return type:
pl.DataFrame
run_counting#
- class counting.run_counting.WaspCountFiles(bam_file, variant_file, region_file=None, samples=None, use_region_names=False, out_file=None, temp_loc=None, precomputed_vcf_bed=None, precomputed_intersect=None)[source]#
Bases:
objectContainer for WASP counting pipeline file paths and configuration.
Manages input/output file paths and parsing logic for the variant counting pipeline.
- bam_file#
Path to the BAM alignment file.
- variant_file#
Path to the variant file (VCF, BCF, or PGEN).
- region_file#
Optional path to a region file (BED, GTF, or GFF3).
- samples#
List of sample IDs to process, or None for all samples.
- use_region_names#
Whether to use region names from the region file.
- out_file#
Output file path for count results.
- temp_loc#
Directory for temporary files.
- is_gene_file#
Whether the region file is a gene annotation file.
- gtf_bed#
Path to converted GTF/GFF3 BED file, if applicable.
- variant_prefix#
Prefix extracted from variant filename.
- vcf_bed#
Path to variant BED file.
- skip_vcf_to_bed#
Whether to skip VCF-to-BED conversion.
- region_type#
Type of regions (‘regions’ or ‘genes’).
- intersect_file#
Path to intersected variant-region file.
- skip_intersect#
Whether to skip intersection step.
- counting.run_counting.tempdir_decorator(func)[source]#
Decorator that creates a temporary directory for the wrapped function.
If ‘temp_loc’ is not provided in kwargs, creates a temporary directory and passes it to the function. The directory is cleaned up after execution.
- counting.run_counting.run_count_variants(bam_file, variant_file, region_file=None, samples=None, use_region_names=False, out_file=None, temp_loc=None, gene_feature=None, gene_attribute=None, gene_parent=None, use_rust=True, precomputed_vcf_bed=None, precomputed_intersect=None, include_indels=False)[source]#
Run the WASP variant counting pipeline.
Counts allele-specific reads at heterozygous variant positions within optional genomic regions.
- Parameters:
bam_file (
str) – Path to the BAM alignment file.variant_file (
str) – Path to the variant file (VCF, BCF, or PGEN).region_file (
str|None) – Optional path to a region file (BED, GTF, or GFF3).samples (
str|list[str] |None) – Sample ID(s) to process. Can be a single ID, comma-separated string, path to a file with one sample per line, or list of IDs.use_region_names (
bool) – Whether to use region names from the region file.out_file (
str|None) – Output file path. Defaults to ‘counts.tsv’ in current directory.temp_loc (
str|None) – Directory for temporary files. Auto-created if not provided.gene_feature (
str|None) – GTF/GFF3 feature type to extract (e.g., ‘gene’, ‘exon’).gene_attribute (
str|None) – GTF/GFF3 attribute for region names (e.g., ‘gene_name’).gene_parent (
str|None) – GTF/GFF3 parent attribute for hierarchical features.use_rust (
bool) – Whether to use the Rust backend for counting (faster).precomputed_vcf_bed (
str|None) – Path to pre-computed variant BED file (skips conversion).precomputed_intersect (
str|None) – Path to pre-computed intersection file.include_indels (
bool) – Whether to include indels in variant counting.
- Return type:
- Returns:
None. Results are written to out_file.
run_counting_sc#
Single-cell variant counting pipeline.
- class counting.run_counting_sc.WaspCountSC(bam_file, variant_file, barcode_file, feature_file, samples=None, use_region_names=False, out_file=None, temp_loc=None)[source]#
Bases:
objectContainer for single-cell WASP counting pipeline configuration.
- counting.run_counting_sc.run_count_variants_sc(bam_file, variant_file, barcode_file, feature_file=None, samples=None, use_region_names=False, out_file=None, temp_loc=None)[source]#
Run single-cell variant counting pipeline.
- Parameters:
bam_file (str) – Path to the BAM alignment file with cell barcodes.
variant_file (str) – Path to the variant file (VCF, BCF, or PGEN).
barcode_file (str) – Path to cell barcode file (one barcode per line).
feature_file (str | None, optional) – Path to feature/region file (BED, GTF, or GFF3).
samples (str | list[str] | None, optional) – Sample ID(s) to process.
use_region_names (bool, optional) – Whether to use region names from the feature file.
out_file (str | None, optional) – Output file path. Defaults to ‘allele_counts.h5ad’.
temp_loc (str | None, optional) – Directory for temporary files.
- Returns:
Results are written to out_file as AnnData.
- Return type:
None
CLI Entry Point#
- counting.__main__.main(ctx, version=False, verbose=False, quiet=False)[source]#
WASP2 allele counting commands.
- Return type: