{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ATAC-seq Allelic Imbalance Analysis Tutorial\n", "\n", "**Estimated time:** ~30 minutes\n", "\n", "This tutorial demonstrates a complete WASP2 workflow for detecting allelic imbalance in chromatin accessibility from ATAC-seq data. You will learn to:\n", "\n", "1. Prepare ATAC-seq peak files for analysis\n", "2. Count alleles at heterozygous SNPs within accessibility peaks\n", "3. Detect significant allelic imbalance using beta-binomial statistical testing\n", "4. Visualize results with volcano plots and effect size distributions\n", "5. Integrate with caQTL/eQTL data for biological interpretation\n", "\n", "## Background\n", "\n", "**Allelic Imbalance in Chromatin Accessibility**\n", "\n", "ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) measures open chromatin regions genome-wide. When a heterozygous individual shows unequal accessibility between maternal and paternal alleles at a regulatory region, this indicates **allelic imbalance in chromatin accessibility**.\n", "\n", "Such imbalance often reflects:\n", "- *cis*-regulatory variants affecting transcription factor binding\n", "- Chromatin accessibility QTLs (caQTLs)\n", "- Haplotype-specific enhancer activity\n", "\n", "WASP2 uses a **beta-binomial model** to detect significant departures from the expected 50:50 allelic ratio while accounting for overdispersion in sequencing data." ] }, { "cell_type": "markdown", "metadata": {}, "source": "## Prerequisites\n\n### Software\n\n- **WASP2** (`pip install wasp2`)\n- **Python 3.10+** with pandas, matplotlib, numpy\n- **samtools** (for BAM operations)\n- **tabix** (for VCF indexing)\n\n### Input Data\n\n| File | Description | Format |\n|------|-------------|--------|\n| `sample.bam` | ATAC-seq aligned reads | BAM (indexed) |\n| `variants.vcf.gz` | Phased heterozygous variants | VCF (indexed) |\n| `peaks.bed` | ATAC-seq peaks from MACS2/SEACR | BED or narrowPeak |\n\n**Note:** For best results, use WASP-filtered BAM files to correct reference mapping bias. See the [mapping documentation](../user_guide/mapping.rst) for details." }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from pathlib import Path\n", "\n", "# Configure plotting\n", "plt.style.use('seaborn-v0_8-whitegrid')\n", "plt.rcParams['figure.figsize'] = (10, 6)\n", "plt.rcParams['font.size'] = 11\n", "\n", "# Define paths (modify these for your data)\n", "DATA_DIR = Path(\"data\")\n", "RESULTS_DIR = Path(\"results\")\n", "RESULTS_DIR.mkdir(exist_ok=True)\n", "\n", "# Input files\n", "BAM_FILE = DATA_DIR / \"atac_sample.bam\"\n", "VCF_FILE = DATA_DIR / \"phased_variants.vcf.gz\"\n", "PEAKS_FILE = DATA_DIR / \"atac_peaks.narrowPeak\"\n", "SAMPLE_ID = \"SAMPLE1\" # Sample name in VCF\n", "\n", "print(\"WASP2 ATAC-seq Tutorial\")\n", "print(\"=\" * 40)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 1: Data Loading and Preparation\n", "\n", "### 1.1 Inspect Peak File Format\n", "\n", "ATAC-seq peaks from MACS2 are typically in **narrowPeak** format (BED6+4). WASP2 accepts both BED and narrowPeak formats." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load and inspect peaks\n", "peak_columns = ['chr', 'start', 'end', 'name', 'score', 'strand', \n", " 'signalValue', 'pValue', 'qValue', 'peak']\n", "peaks_df = pd.read_csv(PEAKS_FILE, sep='\\t', header=None, names=peak_columns)\n", "\n", "print(f\"Total peaks: {len(peaks_df):,}\")\n", "print(f\"\\nPeak size distribution:\")\n", "peaks_df['size'] = peaks_df['end'] - peaks_df['start']\n", "print(peaks_df['size'].describe())\n", "\n", "print(f\"\\nFirst 5 peaks:\")\n", "peaks_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Verify BAM and VCF Files\n", "\n", "Check that your input files are properly formatted and indexed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$SAMPLE_ID\"\n", "BAM_FILE=$1\n", "VCF_FILE=$2\n", "SAMPLE_ID=$3\n", "\n", "echo \"=== BAM File Check ===\"\n", "echo \"File: $BAM_FILE\"\n", "samtools view -H \"$BAM_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n", "\n", "echo \"\"\n", "echo \"=== VCF File Check ===\"\n", "echo \"File: $VCF_FILE\"\n", "echo \"Checking for sample: $SAMPLE_ID\"\n", "bcftools query -l \"$VCF_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n", "\n", "echo \"\"\n", "echo \"=== Index Check ===\"\n", "ls -la \"${BAM_FILE}.bai\" 2>/dev/null || echo \"BAM index (.bai): Using example paths\"\n", "ls -la \"${VCF_FILE}.tbi\" 2>/dev/null || echo \"VCF index (.tbi): Using example paths\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 2: Allele Counting at Peaks\n", "\n", "WASP2 counts reads supporting reference and alternate alleles at each heterozygous SNP within ATAC-seq peaks. The `--region` parameter restricts counting to SNPs overlapping your peaks.\n", "\n", "### 2.1 Run Allele Counting\n", "\n", "**Key Parameters:**\n", "- `--region`: Peak file to restrict SNPs to accessible regions\n", "- `--samples`: Sample ID for genotype filtering (heterozygous sites only)\n", "- `--out_file`: Output path for count results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define output file\n", "COUNTS_FILE = RESULTS_DIR / \"atac_allele_counts.tsv\"\n", "\n", "# Build the command\n", "count_cmd = f\"\"\"\n", "wasp2-count count-variants \\\\\n", " {BAM_FILE} \\\\\n", " {VCF_FILE} \\\\\n", " --region {PEAKS_FILE} \\\\\n", " --samples {SAMPLE_ID} \\\\\n", " --out_file {COUNTS_FILE}\n", "\"\"\"\n", "\n", "print(\"Command to run:\")\n", "print(count_cmd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$PEAKS_FILE\" \"$SAMPLE_ID\" \"$COUNTS_FILE\"\n", "# Uncomment to run (requires actual data files)\n", "# wasp2-count count-variants \\\n", "# \"$1\" \\\n", "# \"$2\" \\\n", "# --region \"$3\" \\\n", "# --samples \"$4\" \\\n", "# --out_file \"$5\"\n", "\n", "echo \"Note: Uncomment the command above to run with your data\"\n", "echo \"For this tutorial, we'll use simulated example output.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Inspect Count Results\n", "\n", "The output contains per-SNP allele counts with peak annotations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For demonstration, create example data\n", "# (Replace with: counts_df = pd.read_csv(COUNTS_FILE, sep='\\t') for real data)\n", "\n", "np.random.seed(42)\n", "n_snps = 5000\n", "\n", "# Simulate realistic ATAC-seq allele counts\n", "counts_df = pd.DataFrame({\n", " 'chr': np.random.choice(['chr1', 'chr2', 'chr3', 'chr4', 'chr5'], n_snps),\n", " 'pos': np.random.randint(1e6, 2e8, n_snps),\n", " 'ref': np.random.choice(['A', 'C', 'G', 'T'], n_snps),\n", " 'alt': np.random.choice(['A', 'C', 'G', 'T'], n_snps),\n", " 'region_id': [f'peak_{i}' for i in np.random.randint(0, 1500, n_snps)],\n", "})\n", "\n", "# Generate counts with some true imbalanced regions\n", "total_depth = np.random.negative_binomial(5, 0.3, n_snps) + 5\n", "imbalance_prob = np.where(\n", " np.random.random(n_snps) < 0.1, # 10% truly imbalanced\n", " np.random.choice([0.3, 0.7], n_snps), # Imbalanced allele freq\n", " 0.5 # Balanced\n", ")\n", "counts_df['ref_count'] = np.random.binomial(total_depth, imbalance_prob)\n", "counts_df['alt_count'] = total_depth - counts_df['ref_count']\n", "counts_df['other_count'] = 0\n", "\n", "print(f\"Total SNPs counted: {len(counts_df):,}\")\n", "print(f\"Unique peaks with SNPs: {counts_df['region_id'].nunique():,}\")\n", "print(f\"\\nCount statistics:\")\n", "print(counts_df[['ref_count', 'alt_count']].describe())\n", "counts_df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Quality Control: Count Distribution\n", "\n", "ATAC-seq typically has **lower coverage per peak** than RNA-seq genes. Check the distribution to set appropriate filtering thresholds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate total counts per SNP\n", "counts_df['total'] = counts_df['ref_count'] + counts_df['alt_count']\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", "# Total count distribution\n", "ax = axes[0]\n", "ax.hist(counts_df['total'], bins=50, edgecolor='black', alpha=0.7)\n", "ax.axvline(10, color='red', linestyle='--', label='min_count=10')\n", "ax.axvline(5, color='orange', linestyle='--', label='min_count=5')\n", "ax.set_xlabel('Total Read Count per SNP')\n", "ax.set_ylabel('Number of SNPs')\n", "ax.set_title('Read Depth Distribution')\n", "ax.legend()\n", "ax.set_xlim(0, 100)\n", "\n", "# Allele ratio distribution\n", "ax = axes[1]\n", "ratio = counts_df['ref_count'] / counts_df['total']\n", "ax.hist(ratio[counts_df['total'] >= 10], bins=50, edgecolor='black', alpha=0.7)\n", "ax.axvline(0.5, color='red', linestyle='--', label='Expected (0.5)')\n", "ax.set_xlabel('Reference Allele Frequency')\n", "ax.set_ylabel('Number of SNPs')\n", "ax.set_title('Allele Ratio Distribution (depth ≥10)')\n", "ax.legend()\n", "\n", "plt.tight_layout()\n", "plt.savefig(RESULTS_DIR / 'count_qc.png', dpi=150)\n", "plt.show()\n", "\n", "# Summary statistics\n", "print(f\"\\nSNPs with depth ≥5: {(counts_df['total'] >= 5).sum():,} ({100*(counts_df['total'] >= 5).mean():.1f}%)\")\n", "print(f\"SNPs with depth ≥10: {(counts_df['total'] >= 10).sum():,} ({100*(counts_df['total'] >= 10).mean():.1f}%)\")\n", "print(f\"SNPs with depth ≥20: {(counts_df['total'] >= 20).sum():,} ({100*(counts_df['total'] >= 20).mean():.1f}%)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 3: Statistical Testing (Beta-Binomial)\n", "\n", "WASP2's `find-imbalance` command uses a **beta-binomial model** to test for allelic imbalance:\n", "\n", "- **Null hypothesis (H₀):** Reference allele frequency = 0.5 (balanced)\n", "- **Alternative (H₁):** Reference allele frequency ≠ 0.5 (imbalanced)\n", "\n", "The beta-binomial distribution accounts for **overdispersion** - the extra variability beyond binomial sampling that's common in sequencing data.\n", "\n", "### 3.1 Run Imbalance Detection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Define output file\nIMBALANCE_FILE = RESULTS_DIR / \"atac_imbalance_results.tsv\"\n\n# Build the command\n# Note: Using --min 5 for ATAC-seq (lower coverage than RNA-seq)\n# The --model single option uses a single dispersion parameter for all regions\nanalysis_cmd = f\"\"\"\nwasp2-analyze find-imbalance \\\\\n {COUNTS_FILE} \\\\\n --min 5 \\\\\n --model single \\\\\n --output {IMBALANCE_FILE}\n\"\"\"\n\nprint(\"Command to run:\")\nprint(analysis_cmd)" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "%%bash -s \"$COUNTS_FILE\" \"$IMBALANCE_FILE\"\n# Uncomment to run (requires actual count file)\n# wasp2-analyze find-imbalance \\\n# \"$1\" \\\n# --min 5 \\\n# --model single \\\n# --output \"$2\"\n\necho \"Note: Uncomment the command above to run with your data\"" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Simulate Results for Demonstration\n", "\n", "For this tutorial, we simulate realistic analysis results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Aggregate counts by peak (region)\npeak_counts = counts_df.groupby('region_id').agg({\n 'chr': 'first',\n 'pos': ['min', 'max'],\n 'ref_count': 'sum',\n 'alt_count': 'sum',\n}).reset_index()\npeak_counts.columns = ['region', 'chr', 'start', 'end', 'ref_count', 'alt_count']\n\n# Calculate statistics\npeak_counts['total'] = peak_counts['ref_count'] + peak_counts['alt_count']\npeak_counts['mu'] = peak_counts['ref_count'] / peak_counts['total']\n# Add pseudocount (+1) to avoid log(0) and stabilize ratios for low-count regions\npeak_counts['effect_size'] = np.log2(\n (peak_counts['ref_count'] + 1) / (peak_counts['alt_count'] + 1)\n)\n\n# Simulate p-values (truly imbalanced peaks get low p-values)\n# Note: This simulation uses binomial for simplicity. Real ATAC-seq data exhibits\n# overdispersion, which is why WASP2 uses the beta-binomial model.\nnp.random.seed(42)\nis_imbalanced = np.abs(peak_counts['mu'] - 0.5) > 0.15\npeak_counts['p_value'] = np.where(\n is_imbalanced,\n 10 ** (-np.random.uniform(2, 10, len(peak_counts))), # Significant\n np.random.uniform(0.05, 1, len(peak_counts)) # Not significant\n)\n\n# FDR correction (Benjamini-Hochberg)\n# Note: This manual BH implementation is for demonstration.\n# WASP2 internally uses scipy.stats.false_discovery_control()\nfrom scipy import stats\npeak_counts = peak_counts.sort_values('p_value')\nn_tests = len(peak_counts)\npeak_counts['rank'] = range(1, n_tests + 1)\npeak_counts['fdr_pval'] = np.minimum(\n peak_counts['p_value'] * n_tests / peak_counts['rank'],\n 1.0\n)\npeak_counts['fdr_pval'] = peak_counts['fdr_pval'][::-1].cummin()[::-1]\n\n# Filter to testable peaks\nresults_df = peak_counts[peak_counts['total'] >= 5].copy()\nresults_df = results_df.drop('rank', axis=1)\n\nprint(f\"Peaks tested: {len(results_df):,}\")\nprint(f\"Significant (FDR < 0.05): {(results_df['fdr_pval'] < 0.05).sum():,}\")\nprint(f\"Significant (FDR < 0.01): {(results_df['fdr_pval'] < 0.01).sum():,}\")\nresults_df.head(10)" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 4: Result Interpretation and Visualization\n", "\n", "### 4.1 Volcano Plot\n", "\n", "The volcano plot shows effect size (x-axis) vs. statistical significance (y-axis), helping identify peaks with both strong and significant allelic imbalance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10, 8))\n", "\n", "# Calculate -log10(p-value) for plotting\n", "results_df['neg_log10_pval'] = -np.log10(results_df['p_value'].clip(lower=1e-50))\n", "\n", "# Define significance thresholds\n", "sig_mask = results_df['fdr_pval'] < 0.05\n", "effect_mask = np.abs(results_df['effect_size']) > 0.5 # |log2FC| > 0.5\n", "\n", "# Plot non-significant points\n", "ns = ~sig_mask\n", "ax.scatter(results_df.loc[ns, 'effect_size'], \n", " results_df.loc[ns, 'neg_log10_pval'],\n", " c='lightgray', s=15, alpha=0.6, label=f'Not significant (n={ns.sum():,})')\n", "\n", "# Plot significant but small effect\n", "sig_small = sig_mask & ~effect_mask\n", "ax.scatter(results_df.loc[sig_small, 'effect_size'],\n", " results_df.loc[sig_small, 'neg_log10_pval'],\n", " c='steelblue', s=25, alpha=0.7, label=f'FDR<0.05, small effect (n={sig_small.sum():,})')\n", "\n", "# Plot significant and large effect\n", "sig_large = sig_mask & effect_mask\n", "ax.scatter(results_df.loc[sig_large, 'effect_size'],\n", " results_df.loc[sig_large, 'neg_log10_pval'],\n", " c='firebrick', s=40, alpha=0.8, label=f'FDR<0.05, |log2FC|>0.5 (n={sig_large.sum():,})')\n", "\n", "# Add threshold lines\n", "ax.axhline(-np.log10(0.05), color='black', linestyle='--', alpha=0.3, linewidth=1)\n", "ax.axvline(0.5, color='gray', linestyle=':', alpha=0.5)\n", "ax.axvline(-0.5, color='gray', linestyle=':', alpha=0.5)\n", "ax.axvline(0, color='black', linestyle='-', alpha=0.2)\n", "\n", "ax.set_xlabel('Effect Size (log₂ Ref/Alt)', fontsize=12)\n", "ax.set_ylabel('-log₁₀(p-value)', fontsize=12)\n", "ax.set_title('ATAC-seq Allelic Imbalance\\nVolcano Plot', fontsize=14)\n", "ax.legend(loc='upper right', fontsize=9)\n", "\n", "plt.tight_layout()\n", "plt.savefig(RESULTS_DIR / 'volcano_plot.png', dpi=150)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Effect Size Distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", "# All peaks\n", "ax = axes[0]\n", "ax.hist(results_df['effect_size'], bins=50, edgecolor='black', alpha=0.7, color='steelblue')\n", "ax.axvline(0, color='red', linestyle='--', linewidth=2)\n", "ax.set_xlabel('Effect Size (log₂ Ref/Alt)')\n", "ax.set_ylabel('Number of Peaks')\n", "ax.set_title('All Tested Peaks')\n", "\n", "# Significant peaks only\n", "ax = axes[1]\n", "sig_effects = results_df.loc[sig_mask, 'effect_size']\n", "ax.hist(sig_effects, bins=30, edgecolor='black', alpha=0.7, color='firebrick')\n", "ax.axvline(0, color='black', linestyle='--', linewidth=2)\n", "ax.set_xlabel('Effect Size (log₂ Ref/Alt)')\n", "ax.set_ylabel('Number of Peaks')\n", "ax.set_title(f'Significant Peaks (FDR < 0.05, n={sig_mask.sum():,})')\n", "\n", "plt.tight_layout()\n", "plt.savefig(RESULTS_DIR / 'effect_size_distribution.png', dpi=150)\n", "plt.show()\n", "\n", "# Summary statistics\n", "print(\"Effect size statistics (significant peaks):\")\n", "print(f\" Mean: {sig_effects.mean():.3f}\")\n", "print(f\" Median: {sig_effects.median():.3f}\")\n", "print(f\" Std: {sig_effects.std():.3f}\")\n", "print(f\" Ref-biased (log2FC > 0): {(sig_effects > 0).sum()}\")\n", "print(f\" Alt-biased (log2FC < 0): {(sig_effects < 0).sum()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 Top Imbalanced Peaks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get top hits by significance\n", "top_hits = results_df[results_df['fdr_pval'] < 0.05].nsmallest(20, 'fdr_pval')\n", "\n", "print(\"Top 20 Peaks with Allelic Imbalance\")\n", "print(\"=\" * 80)\n", "display_cols = ['region', 'chr', 'ref_count', 'alt_count', 'mu', 'effect_size', 'p_value', 'fdr_pval']\n", "top_hits[display_cols].round(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 5: QTL Overlap Analysis\n", "\n", "Peaks with allelic imbalance often harbor **chromatin accessibility QTLs (caQTLs)** or overlap with **expression QTLs (eQTLs)**. Integrating your results with published QTL databases helps validate findings and identify regulatory mechanisms.\n", "\n", "### 5.1 Prepare BED File for Overlap Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Export significant peaks as BED for overlap analysis\n", "sig_peaks = results_df[results_df['fdr_pval'] < 0.05][['chr', 'start', 'end', 'region', 'effect_size', 'fdr_pval']].copy()\n", "sig_peaks.to_csv(RESULTS_DIR / 'significant_peaks.bed', sep='\\t', header=False, index=False)\n", "\n", "print(f\"Exported {len(sig_peaks)} significant peaks to: {RESULTS_DIR / 'significant_peaks.bed'}\")\n", "print(\"\\nUse this file for overlap analysis with:\")\n", "print(\" - GTEx eQTLs (https://gtexportal.org)\")\n", "print(\" - ENCODE cCREs (https://www.encodeproject.org)\")\n", "print(\" - Published caQTL datasets\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2 Example: GTEx eQTL Overlap\n", "\n", "This example shows how to intersect your imbalanced peaks with GTEx eQTL SNPs to identify potential regulatory relationships." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example overlap analysis with bedtools (requires eQTL BED file)\n", "overlap_cmd = \"\"\"\n", "# Download GTEx eQTLs for your tissue of interest\n", "# Example: Brain cortex significant eQTLs\n", "\n", "# Intersect imbalanced peaks with eQTL positions\n", "bedtools intersect \\\\\n", " -a results/significant_peaks.bed \\\\\n", " -b gtex_brain_cortex_eqtls.bed \\\\\n", " -wa -wb \\\\\n", " > results/peak_eqtl_overlap.bed\n", "\n", "# Count overlaps\n", "wc -l results/peak_eqtl_overlap.bed\n", "\"\"\"\n", "\n", "print(\"Example bedtools command for eQTL overlap:\")\n", "print(overlap_cmd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Simulate overlap results for demonstration\n", "np.random.seed(123)\n", "\n", "# Create simulated eQTL overlap data\n", "n_overlap = 150\n", "overlap_df = pd.DataFrame({\n", " 'peak': [f'peak_{i}' for i in np.random.choice(range(1500), n_overlap, replace=False)],\n", " 'eqtl_gene': [f'GENE{i}' for i in np.random.randint(1, 500, n_overlap)],\n", " 'eqtl_pval': 10 ** (-np.random.uniform(3, 15, n_overlap)),\n", " 'tissue': np.random.choice(['Brain_Cortex', 'Brain_Hippocampus', 'Liver', 'Heart'], n_overlap)\n", "})\n", "\n", "print(f\"Peaks overlapping eQTLs: {len(overlap_df)}\")\n", "print(f\"\\nOverlap by tissue:\")\n", "print(overlap_df['tissue'].value_counts())\n", "\n", "# Enrichment analysis\n", "n_sig_peaks = sig_mask.sum()\n", "n_total_peaks = len(results_df)\n", "n_eqtl_overlap = len(overlap_df)\n", "\n", "# Fisher's exact test for enrichment\n", "from scipy.stats import fisher_exact\n", "\n", "# Assume 10% of all peaks overlap eQTLs by chance\n", "expected_overlap = int(n_sig_peaks * 0.10)\n", "contingency = [\n", " [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap],\n", " [expected_overlap, n_sig_peaks - expected_overlap]\n", "]\n", "odds_ratio, p_value = fisher_exact(contingency)\n", "\n", "print(f\"\\nEnrichment Analysis:\")\n", "print(f\" Imbalanced peaks: {n_sig_peaks}\")\n", "print(f\" Overlapping eQTLs: {n_eqtl_overlap}\")\n", "print(f\" Expected by chance: ~{expected_overlap}\")\n", "print(f\" Fold enrichment: {n_eqtl_overlap / max(expected_overlap, 1):.2f}x\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3 Visualization: eQTL Overlap" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", "# Pie chart: Overlap proportion\n", "ax = axes[0]\n", "overlap_counts = [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap]\n", "labels = ['Overlap with eQTL', 'No eQTL overlap']\n", "colors = ['#e74c3c', '#95a5a6']\n", "ax.pie(overlap_counts, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)\n", "ax.set_title('Imbalanced Peaks Overlapping eQTLs')\n", "\n", "# Bar chart: Overlap by tissue\n", "ax = axes[1]\n", "tissue_counts = overlap_df['tissue'].value_counts()\n", "colors = plt.cm.Set2(range(len(tissue_counts)))\n", "bars = ax.bar(tissue_counts.index, tissue_counts.values, color=colors, edgecolor='black')\n", "ax.set_xlabel('Tissue')\n", "ax.set_ylabel('Number of Overlapping eQTLs')\n", "ax.set_title('eQTL Overlap by Tissue')\n", "ax.tick_params(axis='x', rotation=45)\n", "\n", "plt.tight_layout()\n", "plt.savefig(RESULTS_DIR / 'eqtl_overlap.png', dpi=150)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 6: Downstream Analysis Hints\n", "\n", "### 6.1 Motif Enrichment Analysis\n", "\n", "Imbalanced peaks may disrupt transcription factor binding sites. Use tools like HOMER or MEME-ChIP for motif analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "motif_cmd = \"\"\"\n", "# Extract sequences around imbalanced SNPs\n", "bedtools slop -i significant_peaks.bed -g genome.chrom.sizes -b 50 | \\\\\n", "bedtools getfasta -fi genome.fa -bed - -fo imbalanced_seqs.fa\n", "\n", "# Run HOMER motif analysis\n", "findMotifsGenome.pl significant_peaks.bed hg38 motif_results/ -size 200\n", "\n", "# Alternative: MEME-ChIP\n", "meme-chip -oc meme_results imbalanced_seqs.fa\n", "\"\"\"\n", "\n", "print(\"Example commands for motif enrichment analysis:\")\n", "print(motif_cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2 Gene Ontology Enrichment\n", "\n", "Identify biological processes associated with genes near imbalanced peaks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "go_cmd = \"\"\"\n", "# Annotate peaks with nearest genes using GREAT or bedtools\n", "bedtools closest -a significant_peaks.bed -b genes.bed -d > peak_gene_assignments.bed\n", "\n", "# Extract gene list\n", "cut -f8 peak_gene_assignments.bed | sort -u > imbalanced_genes.txt\n", "\n", "# Use DAVID, Enrichr, or clusterProfiler for GO enrichment\n", "# Web interface: https://david.ncifcrf.gov/\n", "# Web interface: https://maayanlab.cloud/Enrichr/\n", "\"\"\"\n", "\n", "print(\"Example commands for GO enrichment analysis:\")\n", "print(go_cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.3 Single-Cell ATAC-seq Extension\n", "\n", "For single-cell ATAC-seq (scATAC-seq), use WASP2's single-cell workflow to detect cell-type-specific allelic imbalance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc_cmd = \"\"\"\n", "# Count alleles in single-cell ATAC-seq\n", "wasp2-count count-variants-sc \\\\\n", " scatac_possorted.bam \\\\\n", " variants.vcf.gz \\\\\n", " barcodes.tsv \\\\\n", " --samples SAMPLE_ID \\\\\n", " --feature peaks.bed \\\\\n", " --out_file scatac_counts.h5ad\n", "\n", "# Detect imbalance per cell type\n", "wasp2-analyze find-imbalance-sc \\\\\n", " scatac_counts.h5ad \\\\\n", " barcode_celltype_map.tsv \\\\\n", " --sample SAMPLE_ID \\\\\n", " --min 5 \\\\\n", " --phased\n", "\n", "# Compare imbalance between cell types\n", "wasp2-analyze compare-imbalance \\\\\n", " scatac_counts.h5ad \\\\\n", " barcode_celltype_map.tsv \\\\\n", " --groups \"excitatory,inhibitory\" \\\\\n", " --sample SAMPLE_ID\n", "\"\"\"\n", "\n", "print(\"Commands for single-cell ATAC-seq analysis:\")\n", "print(sc_cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this tutorial, you learned to:\n", "\n", "1. **Prepare ATAC-seq data** - Load peaks and verify input file formats\n", "2. **Count alleles** - Use `wasp2-count count-variants` with peak regions\n", "3. **Detect imbalance** - Apply beta-binomial testing with `wasp2-analyze find-imbalance`\n", "4. **Visualize results** - Create volcano plots and effect size distributions\n", "5. **Integrate with QTLs** - Overlap with eQTL databases for biological validation\n", "\n", "### Key Takeaways\n", "\n", "- ATAC-seq has **lower coverage per peak** than RNA-seq; use `--min-count 5` instead of 10\n", "- **FDR correction** is essential for multiple testing across thousands of peaks\n", "- Consider **effect size** alongside significance for biological relevance\n", "- **QTL overlap** helps validate findings and identify causal variants\n", "\n", "### Next Steps\n", "\n", "- [Comparative Imbalance Tutorial](./comparative_imbalance.rst) - Compare imbalance between conditions\n", "- [Single-Cell Tutorial](./scrna_seq.rst) - Cell-type-specific analysis\n", "- [Statistical Methods](../methods/statistical_models.rst) - Deep dive into the beta-binomial model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Troubleshooting\n", "\n", "### Common Issues\n", "\n", "**Low SNP counts in peaks:**\n", "- Ensure VCF contains heterozygous variants for your sample\n", "- Check that peak coordinates use the same reference genome as VCF\n", "- Verify `--samples` matches the sample name in VCF header\n", "\n", "**Memory errors with large datasets:**\n", "- Process chromosomes separately with `--region chr1_peaks.bed`, etc.\n", "- Use `WASP2_RUST_THREADS=4` to limit parallel processing\n", "\n", "**No significant results:**\n", "- Check read depth (may need deeper sequencing)\n", "- Verify WASP filtering was applied to remove mapping bias\n", "- Consider lowering `--min-count` threshold (with caution)\n", "\n", "### Diagnostic Commands\n", "\n", "```bash\n", "# Check VCF sample names\n", "bcftools query -l variants.vcf.gz\n", "\n", "# Count heterozygous SNPs in your sample\n", "bcftools view -s SAMPLE_ID variants.vcf.gz | bcftools view -g het | wc -l\n", "\n", "# Check BAM read depth at a peak\n", "samtools depth -r chr1:1000000-1001000 sample.bam | awk '{sum+=$3} END {print sum/NR}'\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save final results\n", "results_df.to_csv(RESULTS_DIR / 'final_imbalance_results.tsv', sep='\\t', index=False)\n", "print(f\"Results saved to: {RESULTS_DIR / 'final_imbalance_results.tsv'}\")\n", "print(f\"\\nAnalysis complete!\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 4 }