Allele Counting Algorithm
=========================

This document describes how WASP2 assigns reads to reference and alternate
alleles at heterozygous variant sites.

.. contents:: Contents
   :local:
   :depth: 2

Overview
--------

The allele counting algorithm forms the foundation of allele-specific analysis.
For each heterozygous SNP, WASP2 examines aligned reads and counts how many
support the reference allele, alternate allele, or neither.

Biological Rationale
--------------------

In a diploid organism with a heterozygous site (genotype A/G), reads originating
from each chromosome should carry the corresponding allele. Under the null
hypothesis of no allelic imbalance, we expect equal representation of both
alleles:

.. math::

   E[\text{ref\_count}] = E[\text{alt\_count}] = \frac{N}{2}

where :math:`N` is the total number of reads covering the variant.

Deviations from this expectation may indicate:

- **Allele-specific expression (ASE)**: Differential transcription between alleles
- **Allele-specific chromatin accessibility**: Differential regulatory activity
- **Allele-specific binding**: Differential protein-DNA interactions
- **Technical artifacts**: Mapping bias, amplification bias

Algorithm Details
-----------------

Position-Based Alignment
^^^^^^^^^^^^^^^^^^^^^^^^

For each variant position, WASP2 queries the BAM file to retrieve all reads
overlapping that genomic coordinate:

1. **Coordinate lookup**: Use BAM index to efficiently retrieve reads at position
2. **CIGAR parsing**: Walk through the CIGAR string to find the query position
   corresponding to the reference position
3. **Base extraction**: Extract the nucleotide at the query position

.. code-block:: python

   # Simplified pseudocode
   for read in bam.fetch(chrom, pos, pos + 1):
       query_pos = find_aligned_position(read, pos)
       if query_pos is not None:
           base = read.query_sequence[query_pos]
           if base == ref_allele:
               ref_count += 1
           elif base == alt_allele:
               alt_count += 1
           else:
               other_count += 1

CIGAR-Aware Position Finding
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The alignment between query (read) and reference positions is not always 1:1
due to insertions, deletions, and soft clipping. WASP2 uses the aligned pairs
from the CIGAR string:

.. table:: CIGAR Operations and Position Mapping
   :widths: 20 40 40

   ==================== =========================== =======================
   CIGAR Operation      Consumes Reference          Consumes Query
   ==================== =========================== =======================
   M (match/mismatch)   Yes                         Yes
   I (insertion)        No                          Yes
   D (deletion)         Yes                         No
   S (soft clip)        No                          Yes
   H (hard clip)        No                          No
   N (skip)             Yes                         No
   ==================== =========================== =======================

For a read with CIGAR ``50M2D30M``:

- Positions 0-49 in the reference align to positions 0-49 in the query
- Positions 50-51 in the reference have no query alignment (deletion)
- Positions 52-81 in the reference align to positions 50-79 in the query

Handling Edge Cases
^^^^^^^^^^^^^^^^^^^

**Deletions spanning the variant**
  If the variant position falls within a deletion, the read cannot be assigned
  to either allele and is counted as "other".

**Insertions adjacent to the variant**
  Insertions can shift the query position. The algorithm correctly handles
  this by using aligned pairs rather than simple arithmetic.

**Soft-clipped reads**
  Soft-clipped bases at read ends do not consume reference positions but
  are present in the query sequence. The algorithm accounts for this.

Rust Acceleration
-----------------

WASP2 uses a Rust-accelerated BAM counter for performance:

.. code-block:: python

   from wasp2_rust import BamCounter

   counter = BamCounter("sample.bam")
   regions = [("chr1", 12345, "A", "G"), ("chr1", 12400, "C", "T")]
   counts = counter.count_alleles(regions, min_qual=0, threads=4)

The Rust implementation provides:

- **Parallel processing**: Count multiple regions simultaneously
- **Memory efficiency**: Stream through BAM without loading all reads
- **htslib integration**: Direct access to BAM index for efficient queries

Performance Characteristics
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. table:: Counting Performance
   :widths: 30 35 35

   =================== ======================== ========================
   Dataset Size        Python (pysam)           Rust (wasp2_rust)
   =================== ======================== ========================
   10,000 SNPs         ~45 seconds              ~5 seconds
   100,000 SNPs        ~7 minutes               ~40 seconds
   1,000,000 SNPs      ~70 minutes              ~6 minutes
   =================== ======================== ========================

*Benchmarks on typical ATAC-seq data with 100M reads, single thread.*

Quality Filtering
-----------------

Optional base quality filtering can exclude low-confidence base calls:

.. math::

   Q = -10 \log_{10}(P_{\text{error}})

By default, WASP2 uses ``min_qual=0`` (no filtering) to match legacy WASP
behavior. For stringent analysis, ``min_qual=20`` (1% error rate) is recommended.

Output Format
-------------

The counting algorithm produces a table with the following columns:

.. table:: Count Output Columns
   :widths: 20 20 60

   ============= ======== ============================================
   Column        Type     Description
   ============= ======== ============================================
   chrom         string   Chromosome name
   pos           int      1-based genomic position
   ref           string   Reference allele
   alt           string   Alternate allele
   ref_count     int      Reads supporting reference allele
   alt_count     int      Reads supporting alternate allele
   other_count   int      Reads with neither allele (errors, indels)
   region        string   Associated region (peak, gene) if provided
   ============= ======== ============================================

Region Assignment
-----------------

When a BED file of regions (peaks, genes) is provided, variants are associated
with overlapping regions:

1. **Intersection**: Use bedtools or coitrees to find variant-region overlaps
2. **Aggregation**: Sum counts across all variants within each region
3. **Statistical testing**: Perform imbalance analysis at the region level

This aggregation increases statistical power for detecting allelic imbalance,
as individual SNPs often have low coverage.

See Also
--------

- :doc:`statistical_models` - Statistical testing of allele counts