{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ATAC-seq Allelic Imbalance Analysis Tutorial\n",
    "\n",
    "**Estimated time:** ~30 minutes\n",
    "\n",
    "This tutorial demonstrates a complete WASP2 workflow for detecting allelic imbalance in chromatin accessibility from ATAC-seq data. You will learn to:\n",
    "\n",
    "1. Prepare ATAC-seq peak files for analysis\n",
    "2. Count alleles at heterozygous SNPs within accessibility peaks\n",
    "3. Detect significant allelic imbalance using beta-binomial statistical testing\n",
    "4. Visualize results with volcano plots and effect size distributions\n",
    "5. Integrate with caQTL/eQTL data for biological interpretation\n",
    "\n",
    "## Background\n",
    "\n",
    "**Allelic Imbalance in Chromatin Accessibility**\n",
    "\n",
    "ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) measures open chromatin regions genome-wide. When a heterozygous individual shows unequal accessibility between maternal and paternal alleles at a regulatory region, this indicates **allelic imbalance in chromatin accessibility**.\n",
    "\n",
    "Such imbalance often reflects:\n",
    "- *cis*-regulatory variants affecting transcription factor binding\n",
    "- Chromatin accessibility QTLs (caQTLs)\n",
    "- Haplotype-specific enhancer activity\n",
    "\n",
    "WASP2 uses a **beta-binomial model** to detect significant departures from the expected 50:50 allelic ratio while accounting for overdispersion in sequencing data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Prerequisites\n\n### Software\n\n- **WASP2** (`pip install wasp2`)\n- **Python 3.10+** with pandas, matplotlib, numpy\n- **samtools** (for BAM operations)\n- **tabix** (for VCF indexing)\n\n### Input Data\n\n| File | Description | Format |\n|------|-------------|--------|\n| `sample.bam` | ATAC-seq aligned reads | BAM (indexed) |\n| `variants.vcf.gz` | Phased heterozygous variants | VCF (indexed) |\n| `peaks.bed` | ATAC-seq peaks from MACS2/SEACR | BED or narrowPeak |\n\n**Note:** For best results, use WASP-filtered BAM files to correct reference mapping bias. See the [mapping documentation](../user_guide/mapping.rst) for details."
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from pathlib import Path\n",
    "\n",
    "# Configure plotting\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "plt.rcParams['figure.figsize'] = (10, 6)\n",
    "plt.rcParams['font.size'] = 11\n",
    "\n",
    "# Define paths (modify these for your data)\n",
    "DATA_DIR = Path(\"data\")\n",
    "RESULTS_DIR = Path(\"results\")\n",
    "RESULTS_DIR.mkdir(exist_ok=True)\n",
    "\n",
    "# Input files\n",
    "BAM_FILE = DATA_DIR / \"atac_sample.bam\"\n",
    "VCF_FILE = DATA_DIR / \"phased_variants.vcf.gz\"\n",
    "PEAKS_FILE = DATA_DIR / \"atac_peaks.narrowPeak\"\n",
    "SAMPLE_ID = \"SAMPLE1\"  # Sample name in VCF\n",
    "\n",
    "print(\"WASP2 ATAC-seq Tutorial\")\n",
    "print(\"=\" * 40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 1: Data Loading and Preparation\n",
    "\n",
    "### 1.1 Inspect Peak File Format\n",
    "\n",
    "ATAC-seq peaks from MACS2 are typically in **narrowPeak** format (BED6+4). WASP2 accepts both BED and narrowPeak formats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load and inspect peaks\n",
    "peak_columns = ['chr', 'start', 'end', 'name', 'score', 'strand', \n",
    "                'signalValue', 'pValue', 'qValue', 'peak']\n",
    "peaks_df = pd.read_csv(PEAKS_FILE, sep='\\t', header=None, names=peak_columns)\n",
    "\n",
    "print(f\"Total peaks: {len(peaks_df):,}\")\n",
    "print(f\"\\nPeak size distribution:\")\n",
    "peaks_df['size'] = peaks_df['end'] - peaks_df['start']\n",
    "print(peaks_df['size'].describe())\n",
    "\n",
    "print(f\"\\nFirst 5 peaks:\")\n",
    "peaks_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Verify BAM and VCF Files\n",
    "\n",
    "Check that your input files are properly formatted and indexed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$SAMPLE_ID\"\n",
    "BAM_FILE=$1\n",
    "VCF_FILE=$2\n",
    "SAMPLE_ID=$3\n",
    "\n",
    "echo \"=== BAM File Check ===\"\n",
    "echo \"File: $BAM_FILE\"\n",
    "samtools view -H \"$BAM_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"=== VCF File Check ===\"\n",
    "echo \"File: $VCF_FILE\"\n",
    "echo \"Checking for sample: $SAMPLE_ID\"\n",
    "bcftools query -l \"$VCF_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"=== Index Check ===\"\n",
    "ls -la \"${BAM_FILE}.bai\" 2>/dev/null || echo \"BAM index (.bai): Using example paths\"\n",
    "ls -la \"${VCF_FILE}.tbi\" 2>/dev/null || echo \"VCF index (.tbi): Using example paths\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 2: Allele Counting at Peaks\n",
    "\n",
    "WASP2 counts reads supporting reference and alternate alleles at each heterozygous SNP within ATAC-seq peaks. The `--region` parameter restricts counting to SNPs overlapping your peaks.\n",
    "\n",
    "### 2.1 Run Allele Counting\n",
    "\n",
    "**Key Parameters:**\n",
    "- `--region`: Peak file to restrict SNPs to accessible regions\n",
    "- `--samples`: Sample ID for genotype filtering (heterozygous sites only)\n",
    "- `--out_file`: Output path for count results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define output file\n",
    "COUNTS_FILE = RESULTS_DIR / \"atac_allele_counts.tsv\"\n",
    "\n",
    "# Build the command\n",
    "count_cmd = f\"\"\"\n",
    "wasp2-count count-variants \\\\\n",
    "    {BAM_FILE} \\\\\n",
    "    {VCF_FILE} \\\\\n",
    "    --region {PEAKS_FILE} \\\\\n",
    "    --samples {SAMPLE_ID} \\\\\n",
    "    --out_file {COUNTS_FILE}\n",
    "\"\"\"\n",
    "\n",
    "print(\"Command to run:\")\n",
    "print(count_cmd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$PEAKS_FILE\" \"$SAMPLE_ID\" \"$COUNTS_FILE\"\n",
    "# Uncomment to run (requires actual data files)\n",
    "# wasp2-count count-variants \\\n",
    "#     \"$1\" \\\n",
    "#     \"$2\" \\\n",
    "#     --region \"$3\" \\\n",
    "#     --samples \"$4\" \\\n",
    "#     --out_file \"$5\"\n",
    "\n",
    "echo \"Note: Uncomment the command above to run with your data\"\n",
    "echo \"For this tutorial, we'll use simulated example output.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Inspect Count Results\n",
    "\n",
    "The output contains per-SNP allele counts with peak annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For demonstration, create example data\n",
    "# (Replace with: counts_df = pd.read_csv(COUNTS_FILE, sep='\\t') for real data)\n",
    "\n",
    "np.random.seed(42)\n",
    "n_snps = 5000\n",
    "\n",
    "# Simulate realistic ATAC-seq allele counts\n",
    "counts_df = pd.DataFrame({\n",
    "    'chr': np.random.choice(['chr1', 'chr2', 'chr3', 'chr4', 'chr5'], n_snps),\n",
    "    'pos': np.random.randint(1e6, 2e8, n_snps),\n",
    "    'ref': np.random.choice(['A', 'C', 'G', 'T'], n_snps),\n",
    "    'alt': np.random.choice(['A', 'C', 'G', 'T'], n_snps),\n",
    "    'region_id': [f'peak_{i}' for i in np.random.randint(0, 1500, n_snps)],\n",
    "})\n",
    "\n",
    "# Generate counts with some true imbalanced regions\n",
    "total_depth = np.random.negative_binomial(5, 0.3, n_snps) + 5\n",
    "imbalance_prob = np.where(\n",
    "    np.random.random(n_snps) < 0.1,  # 10% truly imbalanced\n",
    "    np.random.choice([0.3, 0.7], n_snps),  # Imbalanced allele freq\n",
    "    0.5  # Balanced\n",
    ")\n",
    "counts_df['ref_count'] = np.random.binomial(total_depth, imbalance_prob)\n",
    "counts_df['alt_count'] = total_depth - counts_df['ref_count']\n",
    "counts_df['other_count'] = 0\n",
    "\n",
    "print(f\"Total SNPs counted: {len(counts_df):,}\")\n",
    "print(f\"Unique peaks with SNPs: {counts_df['region_id'].nunique():,}\")\n",
    "print(f\"\\nCount statistics:\")\n",
    "print(counts_df[['ref_count', 'alt_count']].describe())\n",
    "counts_df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Quality Control: Count Distribution\n",
    "\n",
    "ATAC-seq typically has **lower coverage per peak** than RNA-seq genes. Check the distribution to set appropriate filtering thresholds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate total counts per SNP\n",
    "counts_df['total'] = counts_df['ref_count'] + counts_df['alt_count']\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "# Total count distribution\n",
    "ax = axes[0]\n",
    "ax.hist(counts_df['total'], bins=50, edgecolor='black', alpha=0.7)\n",
    "ax.axvline(10, color='red', linestyle='--', label='min_count=10')\n",
    "ax.axvline(5, color='orange', linestyle='--', label='min_count=5')\n",
    "ax.set_xlabel('Total Read Count per SNP')\n",
    "ax.set_ylabel('Number of SNPs')\n",
    "ax.set_title('Read Depth Distribution')\n",
    "ax.legend()\n",
    "ax.set_xlim(0, 100)\n",
    "\n",
    "# Allele ratio distribution\n",
    "ax = axes[1]\n",
    "ratio = counts_df['ref_count'] / counts_df['total']\n",
    "ax.hist(ratio[counts_df['total'] >= 10], bins=50, edgecolor='black', alpha=0.7)\n",
    "ax.axvline(0.5, color='red', linestyle='--', label='Expected (0.5)')\n",
    "ax.set_xlabel('Reference Allele Frequency')\n",
    "ax.set_ylabel('Number of SNPs')\n",
    "ax.set_title('Allele Ratio Distribution (depth ≥10)')\n",
    "ax.legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig(RESULTS_DIR / 'count_qc.png', dpi=150)\n",
    "plt.show()\n",
    "\n",
    "# Summary statistics\n",
    "print(f\"\\nSNPs with depth ≥5: {(counts_df['total'] >= 5).sum():,} ({100*(counts_df['total'] >= 5).mean():.1f}%)\")\n",
    "print(f\"SNPs with depth ≥10: {(counts_df['total'] >= 10).sum():,} ({100*(counts_df['total'] >= 10).mean():.1f}%)\")\n",
    "print(f\"SNPs with depth ≥20: {(counts_df['total'] >= 20).sum():,} ({100*(counts_df['total'] >= 20).mean():.1f}%)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 3: Statistical Testing (Beta-Binomial)\n",
    "\n",
    "WASP2's `find-imbalance` command uses a **beta-binomial model** to test for allelic imbalance:\n",
    "\n",
    "- **Null hypothesis (H₀):** Reference allele frequency = 0.5 (balanced)\n",
    "- **Alternative (H₁):** Reference allele frequency ≠ 0.5 (imbalanced)\n",
    "\n",
    "The beta-binomial distribution accounts for **overdispersion** - the extra variability beyond binomial sampling that's common in sequencing data.\n",
    "\n",
    "### 3.1 Run Imbalance Detection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Define output file\nIMBALANCE_FILE = RESULTS_DIR / \"atac_imbalance_results.tsv\"\n\n# Build the command\n# Note: Using --min 5 for ATAC-seq (lower coverage than RNA-seq)\n# The --model single option uses a single dispersion parameter for all regions\nanalysis_cmd = f\"\"\"\nwasp2-analyze find-imbalance \\\\\n    {COUNTS_FILE} \\\\\n    --min 5 \\\\\n    --model single \\\\\n    --output {IMBALANCE_FILE}\n\"\"\"\n\nprint(\"Command to run:\")\nprint(analysis_cmd)"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "%%bash -s \"$COUNTS_FILE\" \"$IMBALANCE_FILE\"\n# Uncomment to run (requires actual count file)\n# wasp2-analyze find-imbalance \\\n#     \"$1\" \\\n#     --min 5 \\\n#     --model single \\\n#     --output \"$2\"\n\necho \"Note: Uncomment the command above to run with your data\""
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Simulate Results for Demonstration\n",
    "\n",
    "For this tutorial, we simulate realistic analysis results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Aggregate counts by peak (region)\npeak_counts = counts_df.groupby('region_id').agg({\n    'chr': 'first',\n    'pos': ['min', 'max'],\n    'ref_count': 'sum',\n    'alt_count': 'sum',\n}).reset_index()\npeak_counts.columns = ['region', 'chr', 'start', 'end', 'ref_count', 'alt_count']\n\n# Calculate statistics\npeak_counts['total'] = peak_counts['ref_count'] + peak_counts['alt_count']\npeak_counts['mu'] = peak_counts['ref_count'] / peak_counts['total']\n# Add pseudocount (+1) to avoid log(0) and stabilize ratios for low-count regions\npeak_counts['effect_size'] = np.log2(\n    (peak_counts['ref_count'] + 1) / (peak_counts['alt_count'] + 1)\n)\n\n# Simulate p-values (truly imbalanced peaks get low p-values)\n# Note: This simulation uses binomial for simplicity. Real ATAC-seq data exhibits\n# overdispersion, which is why WASP2 uses the beta-binomial model.\nnp.random.seed(42)\nis_imbalanced = np.abs(peak_counts['mu'] - 0.5) > 0.15\npeak_counts['p_value'] = np.where(\n    is_imbalanced,\n    10 ** (-np.random.uniform(2, 10, len(peak_counts))),  # Significant\n    np.random.uniform(0.05, 1, len(peak_counts))  # Not significant\n)\n\n# FDR correction (Benjamini-Hochberg)\n# Note: This manual BH implementation is for demonstration.\n# WASP2 internally uses scipy.stats.false_discovery_control()\nfrom scipy import stats\npeak_counts = peak_counts.sort_values('p_value')\nn_tests = len(peak_counts)\npeak_counts['rank'] = range(1, n_tests + 1)\npeak_counts['fdr_pval'] = np.minimum(\n    peak_counts['p_value'] * n_tests / peak_counts['rank'],\n    1.0\n)\npeak_counts['fdr_pval'] = peak_counts['fdr_pval'][::-1].cummin()[::-1]\n\n# Filter to testable peaks\nresults_df = peak_counts[peak_counts['total'] >= 5].copy()\nresults_df = results_df.drop('rank', axis=1)\n\nprint(f\"Peaks tested: {len(results_df):,}\")\nprint(f\"Significant (FDR < 0.05): {(results_df['fdr_pval'] < 0.05).sum():,}\")\nprint(f\"Significant (FDR < 0.01): {(results_df['fdr_pval'] < 0.01).sum():,}\")\nresults_df.head(10)"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 4: Result Interpretation and Visualization\n",
    "\n",
    "### 4.1 Volcano Plot\n",
    "\n",
    "The volcano plot shows effect size (x-axis) vs. statistical significance (y-axis), helping identify peaks with both strong and significant allelic imbalance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(10, 8))\n",
    "\n",
    "# Calculate -log10(p-value) for plotting\n",
    "results_df['neg_log10_pval'] = -np.log10(results_df['p_value'].clip(lower=1e-50))\n",
    "\n",
    "# Define significance thresholds\n",
    "sig_mask = results_df['fdr_pval'] < 0.05\n",
    "effect_mask = np.abs(results_df['effect_size']) > 0.5  # |log2FC| > 0.5\n",
    "\n",
    "# Plot non-significant points\n",
    "ns = ~sig_mask\n",
    "ax.scatter(results_df.loc[ns, 'effect_size'], \n",
    "           results_df.loc[ns, 'neg_log10_pval'],\n",
    "           c='lightgray', s=15, alpha=0.6, label=f'Not significant (n={ns.sum():,})')\n",
    "\n",
    "# Plot significant but small effect\n",
    "sig_small = sig_mask & ~effect_mask\n",
    "ax.scatter(results_df.loc[sig_small, 'effect_size'],\n",
    "           results_df.loc[sig_small, 'neg_log10_pval'],\n",
    "           c='steelblue', s=25, alpha=0.7, label=f'FDR<0.05, small effect (n={sig_small.sum():,})')\n",
    "\n",
    "# Plot significant and large effect\n",
    "sig_large = sig_mask & effect_mask\n",
    "ax.scatter(results_df.loc[sig_large, 'effect_size'],\n",
    "           results_df.loc[sig_large, 'neg_log10_pval'],\n",
    "           c='firebrick', s=40, alpha=0.8, label=f'FDR<0.05, |log2FC|>0.5 (n={sig_large.sum():,})')\n",
    "\n",
    "# Add threshold lines\n",
    "ax.axhline(-np.log10(0.05), color='black', linestyle='--', alpha=0.3, linewidth=1)\n",
    "ax.axvline(0.5, color='gray', linestyle=':', alpha=0.5)\n",
    "ax.axvline(-0.5, color='gray', linestyle=':', alpha=0.5)\n",
    "ax.axvline(0, color='black', linestyle='-', alpha=0.2)\n",
    "\n",
    "ax.set_xlabel('Effect Size (log₂ Ref/Alt)', fontsize=12)\n",
    "ax.set_ylabel('-log₁₀(p-value)', fontsize=12)\n",
    "ax.set_title('ATAC-seq Allelic Imbalance\\nVolcano Plot', fontsize=14)\n",
    "ax.legend(loc='upper right', fontsize=9)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig(RESULTS_DIR / 'volcano_plot.png', dpi=150)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Effect Size Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "# All peaks\n",
    "ax = axes[0]\n",
    "ax.hist(results_df['effect_size'], bins=50, edgecolor='black', alpha=0.7, color='steelblue')\n",
    "ax.axvline(0, color='red', linestyle='--', linewidth=2)\n",
    "ax.set_xlabel('Effect Size (log₂ Ref/Alt)')\n",
    "ax.set_ylabel('Number of Peaks')\n",
    "ax.set_title('All Tested Peaks')\n",
    "\n",
    "# Significant peaks only\n",
    "ax = axes[1]\n",
    "sig_effects = results_df.loc[sig_mask, 'effect_size']\n",
    "ax.hist(sig_effects, bins=30, edgecolor='black', alpha=0.7, color='firebrick')\n",
    "ax.axvline(0, color='black', linestyle='--', linewidth=2)\n",
    "ax.set_xlabel('Effect Size (log₂ Ref/Alt)')\n",
    "ax.set_ylabel('Number of Peaks')\n",
    "ax.set_title(f'Significant Peaks (FDR < 0.05, n={sig_mask.sum():,})')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig(RESULTS_DIR / 'effect_size_distribution.png', dpi=150)\n",
    "plt.show()\n",
    "\n",
    "# Summary statistics\n",
    "print(\"Effect size statistics (significant peaks):\")\n",
    "print(f\"  Mean: {sig_effects.mean():.3f}\")\n",
    "print(f\"  Median: {sig_effects.median():.3f}\")\n",
    "print(f\"  Std: {sig_effects.std():.3f}\")\n",
    "print(f\"  Ref-biased (log2FC > 0): {(sig_effects > 0).sum()}\")\n",
    "print(f\"  Alt-biased (log2FC < 0): {(sig_effects < 0).sum()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 Top Imbalanced Peaks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get top hits by significance\n",
    "top_hits = results_df[results_df['fdr_pval'] < 0.05].nsmallest(20, 'fdr_pval')\n",
    "\n",
    "print(\"Top 20 Peaks with Allelic Imbalance\")\n",
    "print(\"=\" * 80)\n",
    "display_cols = ['region', 'chr', 'ref_count', 'alt_count', 'mu', 'effect_size', 'p_value', 'fdr_pval']\n",
    "top_hits[display_cols].round(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 5: QTL Overlap Analysis\n",
    "\n",
    "Peaks with allelic imbalance often harbor **chromatin accessibility QTLs (caQTLs)** or overlap with **expression QTLs (eQTLs)**. Integrating your results with published QTL databases helps validate findings and identify regulatory mechanisms.\n",
    "\n",
    "### 5.1 Prepare BED File for Overlap Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export significant peaks as BED for overlap analysis\n",
    "sig_peaks = results_df[results_df['fdr_pval'] < 0.05][['chr', 'start', 'end', 'region', 'effect_size', 'fdr_pval']].copy()\n",
    "sig_peaks.to_csv(RESULTS_DIR / 'significant_peaks.bed', sep='\\t', header=False, index=False)\n",
    "\n",
    "print(f\"Exported {len(sig_peaks)} significant peaks to: {RESULTS_DIR / 'significant_peaks.bed'}\")\n",
    "print(\"\\nUse this file for overlap analysis with:\")\n",
    "print(\"  - GTEx eQTLs (https://gtexportal.org)\")\n",
    "print(\"  - ENCODE cCREs (https://www.encodeproject.org)\")\n",
    "print(\"  - Published caQTL datasets\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 Example: GTEx eQTL Overlap\n",
    "\n",
    "This example shows how to intersect your imbalanced peaks with GTEx eQTL SNPs to identify potential regulatory relationships."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example overlap analysis with bedtools (requires eQTL BED file)\n",
    "overlap_cmd = \"\"\"\n",
    "# Download GTEx eQTLs for your tissue of interest\n",
    "# Example: Brain cortex significant eQTLs\n",
    "\n",
    "# Intersect imbalanced peaks with eQTL positions\n",
    "bedtools intersect \\\\\n",
    "    -a results/significant_peaks.bed \\\\\n",
    "    -b gtex_brain_cortex_eqtls.bed \\\\\n",
    "    -wa -wb \\\\\n",
    "    > results/peak_eqtl_overlap.bed\n",
    "\n",
    "# Count overlaps\n",
    "wc -l results/peak_eqtl_overlap.bed\n",
    "\"\"\"\n",
    "\n",
    "print(\"Example bedtools command for eQTL overlap:\")\n",
    "print(overlap_cmd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simulate overlap results for demonstration\n",
    "np.random.seed(123)\n",
    "\n",
    "# Create simulated eQTL overlap data\n",
    "n_overlap = 150\n",
    "overlap_df = pd.DataFrame({\n",
    "    'peak': [f'peak_{i}' for i in np.random.choice(range(1500), n_overlap, replace=False)],\n",
    "    'eqtl_gene': [f'GENE{i}' for i in np.random.randint(1, 500, n_overlap)],\n",
    "    'eqtl_pval': 10 ** (-np.random.uniform(3, 15, n_overlap)),\n",
    "    'tissue': np.random.choice(['Brain_Cortex', 'Brain_Hippocampus', 'Liver', 'Heart'], n_overlap)\n",
    "})\n",
    "\n",
    "print(f\"Peaks overlapping eQTLs: {len(overlap_df)}\")\n",
    "print(f\"\\nOverlap by tissue:\")\n",
    "print(overlap_df['tissue'].value_counts())\n",
    "\n",
    "# Enrichment analysis\n",
    "n_sig_peaks = sig_mask.sum()\n",
    "n_total_peaks = len(results_df)\n",
    "n_eqtl_overlap = len(overlap_df)\n",
    "\n",
    "# Fisher's exact test for enrichment\n",
    "from scipy.stats import fisher_exact\n",
    "\n",
    "# Assume 10% of all peaks overlap eQTLs by chance\n",
    "expected_overlap = int(n_sig_peaks * 0.10)\n",
    "contingency = [\n",
    "    [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap],\n",
    "    [expected_overlap, n_sig_peaks - expected_overlap]\n",
    "]\n",
    "odds_ratio, p_value = fisher_exact(contingency)\n",
    "\n",
    "print(f\"\\nEnrichment Analysis:\")\n",
    "print(f\"  Imbalanced peaks: {n_sig_peaks}\")\n",
    "print(f\"  Overlapping eQTLs: {n_eqtl_overlap}\")\n",
    "print(f\"  Expected by chance: ~{expected_overlap}\")\n",
    "print(f\"  Fold enrichment: {n_eqtl_overlap / max(expected_overlap, 1):.2f}x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 Visualization: eQTL Overlap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "# Pie chart: Overlap proportion\n",
    "ax = axes[0]\n",
    "overlap_counts = [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap]\n",
    "labels = ['Overlap with eQTL', 'No eQTL overlap']\n",
    "colors = ['#e74c3c', '#95a5a6']\n",
    "ax.pie(overlap_counts, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)\n",
    "ax.set_title('Imbalanced Peaks Overlapping eQTLs')\n",
    "\n",
    "# Bar chart: Overlap by tissue\n",
    "ax = axes[1]\n",
    "tissue_counts = overlap_df['tissue'].value_counts()\n",
    "colors = plt.cm.Set2(range(len(tissue_counts)))\n",
    "bars = ax.bar(tissue_counts.index, tissue_counts.values, color=colors, edgecolor='black')\n",
    "ax.set_xlabel('Tissue')\n",
    "ax.set_ylabel('Number of Overlapping eQTLs')\n",
    "ax.set_title('eQTL Overlap by Tissue')\n",
    "ax.tick_params(axis='x', rotation=45)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig(RESULTS_DIR / 'eqtl_overlap.png', dpi=150)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 6: Downstream Analysis Hints\n",
    "\n",
    "### 6.1 Motif Enrichment Analysis\n",
    "\n",
    "Imbalanced peaks may disrupt transcription factor binding sites. Use tools like HOMER or MEME-ChIP for motif analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "motif_cmd = \"\"\"\n",
    "# Extract sequences around imbalanced SNPs\n",
    "bedtools slop -i significant_peaks.bed -g genome.chrom.sizes -b 50 | \\\\\n",
    "bedtools getfasta -fi genome.fa -bed - -fo imbalanced_seqs.fa\n",
    "\n",
    "# Run HOMER motif analysis\n",
    "findMotifsGenome.pl significant_peaks.bed hg38 motif_results/ -size 200\n",
    "\n",
    "# Alternative: MEME-ChIP\n",
    "meme-chip -oc meme_results imbalanced_seqs.fa\n",
    "\"\"\"\n",
    "\n",
    "print(\"Example commands for motif enrichment analysis:\")\n",
    "print(motif_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 Gene Ontology Enrichment\n",
    "\n",
    "Identify biological processes associated with genes near imbalanced peaks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "go_cmd = \"\"\"\n",
    "# Annotate peaks with nearest genes using GREAT or bedtools\n",
    "bedtools closest -a significant_peaks.bed -b genes.bed -d > peak_gene_assignments.bed\n",
    "\n",
    "# Extract gene list\n",
    "cut -f8 peak_gene_assignments.bed | sort -u > imbalanced_genes.txt\n",
    "\n",
    "# Use DAVID, Enrichr, or clusterProfiler for GO enrichment\n",
    "# Web interface: https://david.ncifcrf.gov/\n",
    "# Web interface: https://maayanlab.cloud/Enrichr/\n",
    "\"\"\"\n",
    "\n",
    "print(\"Example commands for GO enrichment analysis:\")\n",
    "print(go_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.3 Single-Cell ATAC-seq Extension\n",
    "\n",
    "For single-cell ATAC-seq (scATAC-seq), use WASP2's single-cell workflow to detect cell-type-specific allelic imbalance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc_cmd = \"\"\"\n",
    "# Count alleles in single-cell ATAC-seq\n",
    "wasp2-count count-variants-sc \\\\\n",
    "    scatac_possorted.bam \\\\\n",
    "    variants.vcf.gz \\\\\n",
    "    barcodes.tsv \\\\\n",
    "    --samples SAMPLE_ID \\\\\n",
    "    --feature peaks.bed \\\\\n",
    "    --out_file scatac_counts.h5ad\n",
    "\n",
    "# Detect imbalance per cell type\n",
    "wasp2-analyze find-imbalance-sc \\\\\n",
    "    scatac_counts.h5ad \\\\\n",
    "    barcode_celltype_map.tsv \\\\\n",
    "    --sample SAMPLE_ID \\\\\n",
    "    --min 5 \\\\\n",
    "    --phased\n",
    "\n",
    "# Compare imbalance between cell types\n",
    "wasp2-analyze compare-imbalance \\\\\n",
    "    scatac_counts.h5ad \\\\\n",
    "    barcode_celltype_map.tsv \\\\\n",
    "    --groups \"excitatory,inhibitory\" \\\\\n",
    "    --sample SAMPLE_ID\n",
    "\"\"\"\n",
    "\n",
    "print(\"Commands for single-cell ATAC-seq analysis:\")\n",
    "print(sc_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, you learned to:\n",
    "\n",
    "1. **Prepare ATAC-seq data** - Load peaks and verify input file formats\n",
    "2. **Count alleles** - Use `wasp2-count count-variants` with peak regions\n",
    "3. **Detect imbalance** - Apply beta-binomial testing with `wasp2-analyze find-imbalance`\n",
    "4. **Visualize results** - Create volcano plots and effect size distributions\n",
    "5. **Integrate with QTLs** - Overlap with eQTL databases for biological validation\n",
    "\n",
    "### Key Takeaways\n",
    "\n",
    "- ATAC-seq has **lower coverage per peak** than RNA-seq; use `--min-count 5` instead of 10\n",
    "- **FDR correction** is essential for multiple testing across thousands of peaks\n",
    "- Consider **effect size** alongside significance for biological relevance\n",
    "- **QTL overlap** helps validate findings and identify causal variants\n",
    "\n",
    "### Next Steps\n",
    "\n",
    "- [Comparative Imbalance Tutorial](./comparative_imbalance.rst) - Compare imbalance between conditions\n",
    "- [Single-Cell Tutorial](./scrna_seq.rst) - Cell-type-specific analysis\n",
    "- [Statistical Methods](../methods/statistical_models.rst) - Deep dive into the beta-binomial model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Troubleshooting\n",
    "\n",
    "### Common Issues\n",
    "\n",
    "**Low SNP counts in peaks:**\n",
    "- Ensure VCF contains heterozygous variants for your sample\n",
    "- Check that peak coordinates use the same reference genome as VCF\n",
    "- Verify `--samples` matches the sample name in VCF header\n",
    "\n",
    "**Memory errors with large datasets:**\n",
    "- Process chromosomes separately with `--region chr1_peaks.bed`, etc.\n",
    "- Use `WASP2_RUST_THREADS=4` to limit parallel processing\n",
    "\n",
    "**No significant results:**\n",
    "- Check read depth (may need deeper sequencing)\n",
    "- Verify WASP filtering was applied to remove mapping bias\n",
    "- Consider lowering `--min-count` threshold (with caution)\n",
    "\n",
    "### Diagnostic Commands\n",
    "\n",
    "```bash\n",
    "# Check VCF sample names\n",
    "bcftools query -l variants.vcf.gz\n",
    "\n",
    "# Count heterozygous SNPs in your sample\n",
    "bcftools view -s SAMPLE_ID variants.vcf.gz | bcftools view -g het | wc -l\n",
    "\n",
    "# Check BAM read depth at a peak\n",
    "samtools depth -r chr1:1000000-1001000 sample.bam | awk '{sum+=$3} END {print sum/NR}'\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save final results\n",
    "results_df.to_csv(RESULTS_DIR / 'final_imbalance_results.tsv', sep='\\t', index=False)\n",
    "print(f\"Results saved to: {RESULTS_DIR / 'final_imbalance_results.tsv'}\")\n",
    "print(f\"\\nAnalysis complete!\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}