Counting Module
===============

Overview
--------

The counting module quantifies allele-specific read counts at heterozygous SNP positions. It's the first step in allelic imbalance analysis.

Purpose
~~~~~~~

* Count reads supporting reference vs alternate alleles
* Filter by sample genotype (heterozygous sites)
* Annotate with genomic regions (genes, peaks)
* Support single-cell RNA-seq

When to Use
~~~~~~~~~~~

Use counting when you have:
* Aligned reads (BAM file)
* Variant calls (VCF file)
* Want to quantify allele-specific expression

CLI Usage
---------

Basic Command
~~~~~~~~~~~~~

.. code-block:: bash

   wasp2-count count-variants BAM_FILE VCF_FILE

Full Options
~~~~~~~~~~~~

.. code-block:: bash

   wasp2-count count-variants \
     input.bam \
     variants.vcf \
     --samples sample1,sample2 \
     --region genes.gtf \
     --out_file counts.tsv

Input Requirements
------------------

BAM File
~~~~~~~~

* Aligned reads (single-end or paired-end)
* Indexed (.bai file in same directory)
* Sorted by coordinate

VCF File
~~~~~~~~

* Variant calls with genotype information
* Heterozygous SNPs (GT=0|1 or 1|0)
* Can include sample-specific genotypes

Optional: Region File
~~~~~~~~~~~~~~~~~~~~~

Annotate SNPs overlapping genes/peaks:

* GTF/GFF3 format (genes)
* BED format (peaks, regions)
* narrowPeak format (ATAC-seq, ChIP-seq)

Parameters
----------

``--samples`` / ``-s``
~~~~~~~~~~~~~~~~~~~~~~

Filter SNPs heterozygous in specified samples:

.. code-block:: bash

   --samples sample1,sample2,sample3
   # or
   --samples samples.txt  # one per line

``--region`` / ``-r``
~~~~~~~~~~~~~~~~~~~~~

Annotate SNPs with overlapping regions:

.. code-block:: bash

   --region genes.gtf      # Gene annotations
   --region peaks.bed      # ATAC-seq peaks
   --region regions.gff3   # Custom regions

``--out_file`` / ``-o``
~~~~~~~~~~~~~~~~~~~~~~~

Output file path (default: counts.tsv):

.. code-block:: bash

   --out_file my_counts.tsv

Output Format
-------------

Tab-separated file with columns:

Basic Columns
~~~~~~~~~~~~~

* ``chr``: Chromosome
* ``pos``: SNP position (1-based)
* ``ref``: Reference allele
* ``alt``: Alternate allele
* ``ref_count``: Reads supporting reference
* ``alt_count``: Reads supporting alternate
* ``other_count``: Reads supporting other alleles

Optional Columns (with --region)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* ``gene_id``: Overlapping gene
* ``gene_name``: Gene symbol
* ``feature``: Feature type (exon, intron, etc.)

Example Workflow
----------------

1. Basic Counting
~~~~~~~~~~~~~~~~~

.. code-block:: bash

   wasp2-count count-variants sample.bam variants.vcf

2. Filter by Sample
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   wasp2-count count-variants \
     sample.bam \
     variants.vcf \
     --samples NA12878

3. Annotate with Genes
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   wasp2-count count-variants \
     sample.bam \
     variants.vcf \
     --samples NA12878 \
     --region genes.gtf \
     --out_file counts_annotated.tsv

Single-Cell Counting
--------------------

For single-cell RNA-seq:

.. code-block:: bash

   wasp2-count count-variants-sc \
     sc_rnaseq.bam \
     variants.vcf \
     --barcode_map barcodes.tsv

Output includes cell-type-specific counts.

Common Issues
-------------

Low Count Numbers
~~~~~~~~~~~~~~~~~

* Check BAM file coverage (``samtools depth``)
* Verify VCF contains heterozygous SNPs
* Ensure BAM and VCF use same reference genome

No Output SNPs
~~~~~~~~~~~~~~

* Check if --samples filter is too restrictive
* Verify VCF has genotype information (GT field)
* Ensure BAM file is indexed

Next Steps
----------

After counting:
* :doc:`analysis` - Detect allelic imbalance
* :doc:`mapping` - Correct reference bias with WASP