Allele Counting Algorithm#

This document describes how WASP2 assigns reads to reference and alternate alleles at heterozygous variant sites.

Overview#

The allele counting algorithm forms the foundation of allele-specific analysis. For each heterozygous SNP, WASP2 examines aligned reads and counts how many support the reference allele, alternate allele, or neither.

Biological Rationale#

In a diploid organism with a heterozygous site (genotype A/G), reads originating from each chromosome should carry the corresponding allele. Under the null hypothesis of no allelic imbalance, we expect equal representation of both alleles:

\[E[\text{ref\_count}] = E[\text{alt\_count}] = \frac{N}{2}\]

where \(N\) is the total number of reads covering the variant.

Deviations from this expectation may indicate:

  • Allele-specific expression (ASE): Differential transcription between alleles

  • Allele-specific chromatin accessibility: Differential regulatory activity

  • Allele-specific binding: Differential protein-DNA interactions

  • Technical artifacts: Mapping bias, amplification bias

Algorithm Details#

Position-Based Alignment#

For each variant position, WASP2 queries the BAM file to retrieve all reads overlapping that genomic coordinate:

  1. Coordinate lookup: Use BAM index to efficiently retrieve reads at position

  2. CIGAR parsing: Walk through the CIGAR string to find the query position corresponding to the reference position

  3. Base extraction: Extract the nucleotide at the query position

# Simplified pseudocode
for read in bam.fetch(chrom, pos, pos + 1):
    query_pos = find_aligned_position(read, pos)
    if query_pos is not None:
        base = read.query_sequence[query_pos]
        if base == ref_allele:
            ref_count += 1
        elif base == alt_allele:
            alt_count += 1
        else:
            other_count += 1

CIGAR-Aware Position Finding#

The alignment between query (read) and reference positions is not always 1:1 due to insertions, deletions, and soft clipping. WASP2 uses the aligned pairs from the CIGAR string:

CIGAR Operations and Position Mapping#

CIGAR Operation

Consumes Reference

Consumes Query

M (match/mismatch)

Yes

Yes

I (insertion)

No

Yes

D (deletion)

Yes

No

S (soft clip)

No

Yes

H (hard clip)

No

No

N (skip)

Yes

No

For a read with CIGAR 50M2D30M:

  • Positions 0-49 in the reference align to positions 0-49 in the query

  • Positions 50-51 in the reference have no query alignment (deletion)

  • Positions 52-81 in the reference align to positions 50-79 in the query

Handling Edge Cases#

Deletions spanning the variant

If the variant position falls within a deletion, the read cannot be assigned to either allele and is counted as “other”.

Insertions adjacent to the variant

Insertions can shift the query position. The algorithm correctly handles this by using aligned pairs rather than simple arithmetic.

Soft-clipped reads

Soft-clipped bases at read ends do not consume reference positions but are present in the query sequence. The algorithm accounts for this.

Rust Acceleration#

WASP2 uses a Rust-accelerated BAM counter for performance:

from wasp2_rust import BamCounter

counter = BamCounter("sample.bam")
regions = [("chr1", 12345, "A", "G"), ("chr1", 12400, "C", "T")]
counts = counter.count_alleles(regions, min_qual=0, threads=4)

The Rust implementation provides:

  • Parallel processing: Count multiple regions simultaneously

  • Memory efficiency: Stream through BAM without loading all reads

  • htslib integration: Direct access to BAM index for efficient queries

Performance Characteristics#

Counting Performance#

Dataset Size

Python (pysam)

Rust (wasp2_rust)

10,000 SNPs

~45 seconds

~5 seconds

100,000 SNPs

~7 minutes

~40 seconds

1,000,000 SNPs

~70 minutes

~6 minutes

Benchmarks on typical ATAC-seq data with 100M reads, single thread.

Quality Filtering#

Optional base quality filtering can exclude low-confidence base calls:

\[Q = -10 \log_{10}(P_{\text{error}})\]

By default, WASP2 uses min_qual=0 (no filtering) to match legacy WASP behavior. For stringent analysis, min_qual=20 (1% error rate) is recommended.

Output Format#

The counting algorithm produces a table with the following columns:

Count Output Columns#

Column

Type

Description

chrom

string

Chromosome name

pos

int

1-based genomic position

ref

string

Reference allele

alt

string

Alternate allele

ref_count

int

Reads supporting reference allele

alt_count

int

Reads supporting alternate allele

other_count

int

Reads with neither allele (errors, indels)

region

string

Associated region (peak, gene) if provided

Region Assignment#

When a BED file of regions (peaks, genes) is provided, variants are associated with overlapping regions:

  1. Intersection: Use bedtools or coitrees to find variant-region overlaps

  2. Aggregation: Sum counts across all variants within each region

  3. Statistical testing: Perform imbalance analysis at the region level

This aggregation increases statistical power for detecting allelic imbalance, as individual SNPs often have low coverage.

See Also#