Analysis Module
===============

Overview
--------

The analysis module detects statistically significant allelic imbalance using beta-binomial models.

Purpose
-------

* Detect allelic imbalance at genomic regions
* Control for biological and technical variation
* Support single-cell and bulk RNA-seq
* Compare imbalance between groups/conditions

Statistical Models
------------------

Beta-Binomial Model
~~~~~~~~~~~~~~~~~~~

WASP2 uses beta-binomial distribution to model:
* Overdispersion (variation beyond binomial)
* Biological variability between regions
* Technical noise in sequencing

The model:
* Null hypothesis: Equal expression from both alleles (p=0.5)
* Alternative: Allelic imbalance (p ≠ 0.5)
* FDR correction for multiple testing

Dispersion Parameter
~~~~~~~~~~~~~~~~~~~~

Two models:
1. **Single**: One dispersion parameter for all regions
2. **Linear**: Dispersion varies with read depth

CLI Usage
---------

Basic Analysis
~~~~~~~~~~~~~~

.. code-block:: bash

   wasp2-analyze find-imbalance counts.tsv

Options
~~~~~~~

.. code-block:: bash

   wasp2-analyze find-imbalance \
     counts.tsv \
     --min-count 10 \
     --pseudocount 1 \
     --model single \
     --output results.tsv

Parameters
----------

``--min-count``
~~~~~~~~~~~~~~~

Minimum total read count per region (default: 10):

.. code-block:: bash

   --min-count 20  # More stringent

``--pseudocount``
~~~~~~~~~~~~~~~~~

Pseudocount added to avoid zero counts (default: 1):

.. code-block:: bash

   --pseudocount 0  # No pseudocount

``--model``
~~~~~~~~~~~

Dispersion model (default: single):

.. code-block:: bash

   --model linear  # Depth-dependent dispersion

``--phased``
~~~~~~~~~~~~

Use phased genotype information:

.. code-block:: bash

   --phased  # Requires phased VCF

Output Format
-------------

Tab-separated file with columns:

Statistical Columns
~~~~~~~~~~~~~~~~~~~

* ``region``: Genomic region identifier
* ``ref_count``: Total reference allele counts
* ``alt_count``: Total alternate allele counts
* ``p_value``: Likelihood ratio test p-value
* ``fdr_pval``: FDR-corrected p-value
* ``effect_size``: Log2 fold-change (ref/alt)

Model Parameters
~~~~~~~~~~~~~~~~

* ``dispersion``: Beta-binomial dispersion parameter
* ``log_likelihood_null``: Null model log-likelihood
* ``log_likelihood_alt``: Alternative model log-likelihood

Interpreting Results
--------------------

Significant Imbalance
~~~~~~~~~~~~~~~~~~~~~

FDR < 0.05 indicates significant imbalance:

* **Biological**: cis-regulatory variation, ASE
* **Technical**: mapping bias (check WASP), PCR artifacts

Effect Size
~~~~~~~~~~~

* log2FC > 1: Strong imbalance (2-fold difference)
* log2FC > 2: Very strong imbalance (4-fold difference)

Single-Cell Analysis
--------------------

For single-cell data, WASP2 detects allelic imbalance within specific cell populations
using aggregated counts across cells of the same type.

.. code-block:: bash

   wasp2-analyze find-imbalance-sc \
     adata.h5ad \
     barcode_map.tsv \
     --sample donor1 \
     --min-count 10

Output: Per-celltype TSV files (``ai_results_[CELLTYPE].tsv``).

Single-Cell Comparative Imbalance
---------------------------------

Overview
~~~~~~~~

The comparative imbalance analysis detects **differential allelic imbalance** between
cell types, conditions, or biological groups. This is useful for identifying:

* Cell-type-specific regulatory variation
* Sex differences in chromatin accessibility
* Condition-dependent allelic effects (e.g., treatment vs control)
* Developmental stage-specific imbalance

Statistical Model
~~~~~~~~~~~~~~~~~

The analysis uses a **likelihood ratio test (LRT)** comparing two hypotheses:

* **Null (H0)**: Both groups share the same allelic imbalance (μ_combined)
* **Alternative (H1)**: Groups have different imbalance (μ₁ ≠ μ₂)

The test statistic follows a chi-squared distribution with 1 degree of freedom:

.. code-block:: text

   LRT = -2 × (log L_null - log L_alt)
   p-value = P(χ²(df=1) > LRT)

Input Format: Count Matrix (.h5ad)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The count matrix must be an AnnData object with the following structure:

.. code-block:: text

   AnnData object (n_obs × n_vars)
   ├── .obs                 # SNP metadata (rows)
   │   ├── index            # SNP identifiers (0, 1, 2, ...)
   │   └── [sample_name]    # Genotypes: '0|1', '1|0', '0/1', '1/0'
   │
   ├── .var                 # Cell metadata (columns)
   │   └── group            # Cell type/group assignment
   │
   ├── .layers
   │   ├── "ref"            # Reference allele counts (sparse matrix)
   │   └── "alt"            # Alternate allele counts (sparse matrix)
   │
   └── .uns
       ├── feature          # DataFrame: SNP → region mapping
       └── samples          # List of sample names

**Example count matrix creation:**

.. code-block:: bash

   # Generate counts from BAM + variants + barcodes
   wasp2-count count-variants-sc \
     aligned.bam \
     variants.vcf.gz \
     barcodes.txt \
     --samples NA12878 \
     --feature peaks.bed \
     --out_file allele_counts.h5ad

Input Format: Barcode Map (TSV)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A two-column TSV file (no header) mapping cell barcodes to groups:

.. code-block:: text

   AAACGAACAGTCAGTT-1    excitatory_neurons
   AAACGAAGTCGCTCTA-1    inhibitory_neurons
   AAACGAAGTGAACCTA-1    excitatory_neurons
   AAAGGATCATCGATGT-1    astrocytes
   AAAGGATGTGCAACGA-1    microglia

**Requirements:**

* Tab-separated (``\t``)
* No header row
* Barcodes must match those in the count matrix
* Groups can be cell types, conditions, sex, or any categorical variable

Basic Usage
~~~~~~~~~~~

Compare imbalance between all groups:

.. code-block:: bash

   wasp2-analyze compare-imbalance \
     allele_counts.h5ad \
     barcode_map.tsv

Compare specific groups only:

.. code-block:: bash

   wasp2-analyze compare-imbalance \
     allele_counts.h5ad \
     barcode_map.tsv \
     --groups "excitatory_neurons,inhibitory_neurons"

Output Format
~~~~~~~~~~~~~

Results are written to ``ai_results_[GROUP1]_[GROUP2].tsv``:

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Column
     - Description
   * - ``region``
     - Genomic region identifier (peak or gene)
   * - ``num_snps``
     - Number of shared heterozygous SNPs in region
   * - ``combined_mu``
     - Reference allele frequency under null hypothesis (shared)
   * - ``mu1``
     - Reference allele frequency in group 1
   * - ``mu2``
     - Reference allele frequency in group 2
   * - ``null_ll``
     - Log-likelihood under null (shared μ)
   * - ``alt_ll``
     - Log-likelihood under alternative (separate μ values)
   * - ``pval``
     - Likelihood ratio test p-value
   * - ``fdr_pval``
     - FDR-corrected p-value (Benjamini-Hochberg)

**Interpreting results:**

* ``fdr_pval < 0.05``: Significant differential imbalance
* ``|mu1 - mu2| > 0.1``: Meaningful effect size (~20% difference)
* ``mu < 0.5``: Alternate allele favored; ``mu > 0.5``: Reference allele favored

Parameters
~~~~~~~~~~

``--groups``
   Comma-separated list of groups to compare. If omitted, compares all pairwise
   combinations found in the barcode map.

``--min``
   Minimum total allele count per region per group (default: 10). Higher values
   increase statistical power but reduce the number of testable regions.

``--pseudocount``
   Pseudocount added to avoid zero counts (default: 1). Affects dispersion estimation.

``--sample``
   Sample name for heterozygous SNP filtering. Required if multiple samples are
   present in the count matrix.

``--phased``
   Use phased genotype information from VCF. Requires genotypes in ``0|1`` or ``1|0``
   format. Improves power when haplotype phase is known.

``-z/--z_cutoff``
   Remove SNPs with counts exceeding this z-score threshold. Useful for removing
   outliers caused by mapping artifacts or copy number variation.

Example: Sex Differences Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Identify chromatin accessibility regions with sex-biased allelic imbalance:

**Step 1: Prepare barcode map with sex labels**

.. code-block:: text

   # barcode_sex_map.tsv
   AAACGAACAGTCAGTT-1    male
   AAACGAAGTCGCTCTA-1    female
   AAACGAAGTGAACCTA-1    male
   AAAGGATCATCGATGT-1    female

**Step 2: Run comparative analysis**

.. code-block:: bash

   wasp2-analyze compare-imbalance \
     allele_counts.h5ad \
     barcode_sex_map.tsv \
     --groups "male,female" \
     --min 20 \
     --out_file ai_results_sex_comparison.tsv

**Step 3: Filter significant results**

.. code-block:: bash

   # Extract regions with significant sex differences
   awk -F'\t' 'NR==1 || $9 < 0.05' ai_results_male_female.tsv > significant_sex_diff.tsv

   # Find regions with large effect size
   awk -F'\t' 'NR==1 || ($4 - $5 > 0.15 || $5 - $4 > 0.15)' significant_sex_diff.tsv

Example: snATAC-seq Cell Type Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Complete workflow for analyzing cell-type-specific chromatin accessibility imbalance:

**Step 1: Count alleles from snATAC-seq BAM**

.. code-block:: bash

   # Extract valid barcodes from Cell Ranger output
   zcat filtered_barcodes.tsv.gz > barcodes.txt

   # Count alleles at heterozygous SNPs overlapping peaks
   wasp2-count count-variants-sc \
     possorted_bam.bam \
     phased_variants.vcf.gz \
     barcodes.txt \
     --samples sample1 \
     --feature atac_peaks.bed \
     --out_file snATAC_counts.h5ad

**Step 2: Create barcode-to-celltype mapping**

Export cell type annotations from your clustering analysis (e.g., Seurat, ArchR):

.. code-block:: r

   # R/Seurat example
   write.table(
     data.frame(barcode = Cells(seurat_obj),
                celltype = Idents(seurat_obj)),
     "barcode_celltype_map.tsv",
     sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE
   )

**Step 3: Run single-cell imbalance analysis**

.. code-block:: bash

   # Per-celltype analysis
   wasp2-analyze find-imbalance-sc \
     snATAC_counts.h5ad \
     barcode_celltype_map.tsv \
     --sample sample1 \
     --phased \
     --min 10 \
     -z 3.0

**Step 4: Compare imbalance between cell types**

.. code-block:: bash

   # Compare specific cell types
   wasp2-analyze compare-imbalance \
     snATAC_counts.h5ad \
     barcode_celltype_map.tsv \
     --sample sample1 \
     --groups "excitatory,inhibitory,astrocyte" \
     --phased \
     --min 15

   # This produces:
   # - ai_results_excitatory_inhibitory.tsv
   # - ai_results_excitatory_astrocyte.tsv
   # - ai_results_inhibitory_astrocyte.tsv

**Step 5: Identify cell-type-specific regulatory regions**

.. code-block:: bash

   # Find peaks with differential imbalance between excitatory and inhibitory neurons
   awk -F'\t' '$9 < 0.01 && ($4 > 0.6 || $4 < 0.4)' \
     ai_results_excitatory_inhibitory.tsv > neuron_subtype_specific_peaks.tsv

Best Practices for Single-Cell Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Data Quality:**

* Use WASP-filtered BAM files to remove mapping bias
* Require ≥10 total counts per region per group (``--min 10``)
* Apply z-score filtering to remove outliers (``-z 3.0``)

**Statistical Power:**

* Merge similar cell types if individual populations have low coverage
* Use phased genotypes when available (``--phased``)
* Focus on regions with multiple SNPs for better estimates

**Interpretation:**

* Consider biological replication across samples
* Validate top hits with orthogonal methods (allele-specific CRISPR, etc.)
* Integrate with eQTL data to identify causal variants

Example Workflow
----------------

.. code-block:: bash

   # 1. Count alleles
   wasp2-count count-variants \
     wasp_filtered.bam \
     variants.vcf \
     --region genes.gtf \
     --samples NA12878 \
     --output counts.tsv

   # 2. Analyze imbalance
   wasp2-analyze find-imbalance \
     counts.tsv \
     --min-count 20 \
     --model single \
     --output imbalance.tsv

   # 3. Filter significant results
   awk '$5 < 0.05' imbalance.tsv > significant.tsv

Best Practices
--------------

Read Depth
~~~~~~~~~~

* Minimum 10 reads per region (use ``--min-count``)
* Higher depth = more power
* Consider downsampling very deep regions

Quality Control
~~~~~~~~~~~~~~~

* Use WASP-filtered reads
* Remove low-complexity regions
* Filter low-quality SNPs

Multiple Testing
~~~~~~~~~~~~~~~~

* FDR correction is automatic
* Consider Bonferroni for very important regions
* Validate top hits experimentally

Common Issues
-------------

No Significant Results
~~~~~~~~~~~~~~~~~~~~~~

* Increase sample size
* Check read depth (use deeper sequencing)
* Verify heterozygous SNPs present

Many Significant Results
~~~~~~~~~~~~~~~~~~~~~~~~

* Check for batch effects
* Verify WASP filtering was applied
* Consider stricter FDR threshold

Next Steps
----------

* Validate results with qPCR or DNA-seq
* Integrate with eQTL data
* Perform pathway enrichment analysis