RNA-seq Allelic Imbalance Tutorial
===================================

This tutorial demonstrates a complete workflow for detecting allele-specific expression (ASE)
in bulk RNA-seq data using WASP2.

**Estimated time:** ~30 minutes

Overview
--------

The tutorial covers the complete RNA-seq allelic imbalance analysis pipeline:

1. **Data Loading** - BAM, VCF, and gene annotations (GTF)
2. **Allele Counting** - Count reads at heterozygous SNPs within genes/exons
3. **Statistical Testing** - Beta-binomial model for allelic imbalance detection
4. **ASE Visualization** - Volcano plots and allele ratio distributions
5. **Imprinting Detection** - Identify monoallelic expression patterns
6. **eQTL Integration** - Connect ASE signals to regulatory variants

Prerequisites
-------------

**Software:**

* WASP2 (``pip install wasp2``)
* Python packages: pandas, numpy, matplotlib, seaborn, scipy

**Data:**

* Aligned BAM file (coordinate-sorted, indexed)
* Phased VCF file with heterozygous variants
* Gene annotation file (GTF format, e.g., GENCODE)

Workflow Summary
----------------

Step 1: Count Alleles at Genes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use ``wasp2-count count-variants`` to count allele-specific reads at heterozygous SNPs:

.. code-block:: bash

   wasp2-count count-variants \
       sample.bam \
       variants.vcf.gz \
       --samples SAMPLE_ID \
       --region genes.gtf \
       --out_file allele_counts.tsv

This produces a TSV file with columns:

* ``chr``, ``pos``: SNP location
* ``ref``, ``alt``: Alleles
* ``ref_count``, ``alt_count``: Read counts per allele
* ``other_count``: Reads supporting other alleles (non-ref, non-alt)
* ``gene_id``, ``gene_name``: Overlapping gene annotation
* ``feature``: Feature type (exon, intron, etc.) when using GTF

Step 2: Test for Allelic Imbalance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use ``wasp2-analyze find-imbalance`` to detect significant ASE:

.. code-block:: bash

   wasp2-analyze find-imbalance \
       allele_counts.tsv \
       --min 10 \
       --pseudocount 1 \
       --phased \
       --out_file ai_results.tsv

The beta-binomial model tests for deviation from 50:50 allele ratios, accounting for
biological overdispersion.

**Output columns:**

* ``region``: Gene identifier
* ``ref_count``, ``alt_count``: Aggregated counts
* ``p_value``: Likelihood ratio test p-value
* ``fdr_pval``: FDR-corrected p-value
* ``effect_size``: Log2 fold change (ref/alt)

Step 3: Identify Significant ASE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Filter results by significance and effect size:

.. code-block:: bash

   # Significant ASE (FDR < 0.05, |log2FC| > 1)
   # Column indices: $5 = fdr_pval, $6 = effect_size
   awk -F'\t' 'NR==1 || ($5 < 0.05 && ($6 > 1 || $6 < -1))' ai_results.tsv > significant_ase.tsv

Step 4: Detect Imprinting Patterns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Monoallelic expression (>90:10 allele ratio) may indicate genomic imprinting:

.. code-block:: python

   import pandas as pd

   results = pd.read_csv('ai_results.tsv', sep='\t')
   total = results['ref_count'] + results['alt_count']
   results['ref_ratio'] = results['ref_count'] / total

   # Monoallelic genes
   monoallelic = results[
       (results['ref_ratio'] > 0.9) | (results['ref_ratio'] < 0.1)
   ]

Step 5: Integrate with eQTL Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cross-reference ASE results with eQTL databases (e.g., GTEx):

.. code-block:: python

   import pandas as pd

   ase = pd.read_csv('ai_results.tsv', sep='\t')
   eqtl = pd.read_csv('gtex_eqtl.tsv', sep='\t')

   # Merge on gene ID
   integrated = ase.merge(eqtl, left_on='region', right_on='gene_id')

   # Check direction concordance between ASE and eQTL
   # ASE effect_size > 0 means REFERENCE allele is more expressed
   # eQTL slope > 0 means ALTERNATE allele INCREASES expression
   # Therefore, concordance = OPPOSITE signs (ref high in ASE = alt low in eQTL)
   integrated['concordant'] = (
       (integrated['effect_size'] > 0) != (integrated['slope'] > 0)
   )

Key Concepts
------------

Beta-Binomial Model
~~~~~~~~~~~~~~~~~~~

WASP2 uses a beta-binomial distribution to model allele counts:

* Accounts for **overdispersion** (biological variation beyond binomial sampling)
* Models **technical noise** from PCR amplification and sequencing
* Aggregates information across **multiple SNPs** per gene

The null hypothesis is equal expression from both alleles (p = 0.5).

Effect Size Interpretation
~~~~~~~~~~~~~~~~~~~~~~~~~~

* **log2FC > 1**: Reference allele 2x more expressed (strong ASE)
* **log2FC > 2**: Reference allele 4x more expressed (very strong ASE)
* **log2FC near 0**: Balanced expression (no ASE)

Significance Thresholds
~~~~~~~~~~~~~~~~~~~~~~~

* **FDR < 0.05**: Standard significance threshold
* **FDR < 0.01**: Stringent threshold for high-confidence hits
* Combine with effect size filters to focus on biologically meaningful results

Troubleshooting
---------------

Low SNP Counts
~~~~~~~~~~~~~~

If few heterozygous SNPs are detected:

* Verify VCF contains heterozygous genotypes:

  - **Phased format**: GT = ``0|1`` or ``1|0`` (pipe separator)
  - **Unphased format**: GT = ``0/1`` or ``1/0`` (slash separator)
  - Use ``--phased`` flag only with phased genotypes

* Check sample ID matches VCF sample column
* Ensure BAM and VCF use the same reference genome

No Significant Results
~~~~~~~~~~~~~~~~~~~~~~

* Increase sequencing depth (more reads = more power)
* Lower ``--min`` threshold (but interpret with caution)
* Check for batch effects or technical artifacts

Too Many Significant Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Verify WASP mapping bias correction was applied
* Check for copy number variation (CNV) artifacts
* Use stricter FDR threshold (e.g., 0.01)

See Also
--------

* :doc:`/user_guide/counting` - Detailed counting options
* :doc:`/user_guide/analysis` - Statistical methods and parameters
* :doc:`/tutorials/scrna_seq` - Single-cell RNA-seq tutorial
* :doc:`/tutorials/comparative_imbalance` - Comparing ASE between groups