10X scRNA-seq Tutorial#
This tutorial walks through a complete WASP2 workflow for detecting allele-specific expression in 10X Genomics single-cell RNA-seq data.
Overview#
Goal: Identify genes with allele-specific expression (ASE) in different cell types from 10X Chromium scRNA-seq data.
Input Data:
Cell Ranger output (BAM + filtered barcodes)
Phased VCF file with heterozygous variants
Cell type annotations from Seurat/Scanpy
Prerequisites#
Software:
WASP2 (
pip install wasp2)Cell Ranger output (v3+)
R with Seurat or Python with Scanpy
Data:
Aligned BAM file with cell barcodes (CB tag)
Phased VCF for your sample
Completed cell type annotation
Step 1: Prepare Input Data#
Cell Ranger Output Structure#
After running Cell Ranger, your output directory contains:
cellranger_output/
└── outs/
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
└── filtered_feature_bc_matrix/
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
The BAM file contains cell barcodes in the CB tag:
# View CB tags in BAM
samtools view possorted_genome_bam.bam | head -1 | tr '\t' '\n' | grep CB
# Output: CB:Z:AAACCCAAGAAACACT-1
Step 2: Generate Barcode File#
From Seurat Analysis#
After running Seurat clustering and annotation:
library(Seurat)
# Load your analyzed Seurat object
seurat_obj <- readRDS("seurat_analyzed.rds")
# Check available metadata columns
head(seurat_obj@meta.data)
# Extract barcodes and cell types
barcode_df <- data.frame(
barcode = colnames(seurat_obj),
cell_type = seurat_obj$celltype_annotation # Your annotation column
)
# Preview the data
head(barcode_df)
#> barcode cell_type
#> 1 AAACCCAAGAAACACT-1 B_cell
#> 2 AAACCCAAGAAACTGT-1 B_cell
#> 3 AAACCCAAGAAAGCGA-1 CD4_T_cell
# Write barcode file (no header, tab-separated)
write.table(
barcode_df,
file = "barcodes_celltype.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = FALSE
)
From Scanpy Analysis#
After running Scanpy clustering and annotation:
import scanpy as sc
import pandas as pd
# Load your analyzed AnnData object
adata = sc.read_h5ad("scanpy_analyzed.h5ad")
# Check available annotations
print(adata.obs.columns)
# Extract barcodes and cell types
barcode_df = pd.DataFrame({
'barcode': adata.obs_names,
'cell_type': adata.obs['leiden_annotation'] # Your annotation column
})
# Preview the data
print(barcode_df.head())
# barcode cell_type
# 0 AAACCCAAGAAACACT-1 B_cell
# 1 AAACCCAAGAAACTGT-1 B_cell
# 2 AAACCCAAGAAAGCGA-1 CD4_T_cell
# Write barcode file (no header, tab-separated)
barcode_df.to_csv(
'barcodes_celltype.tsv',
sep='\t',
header=False,
index=False
)
Verify Barcode Format#
# Check barcode file format
head barcodes_celltype.tsv
# AAACCCAAGAAACACT-1 B_cell
# AAACCCAAGAAACTGT-1 B_cell
# Count cells per type
cut -f2 barcodes_celltype.tsv | sort | uniq -c | sort -rn
# 1500 CD4_T_cell
# 1200 B_cell
# 800 Monocyte
# ...
Step 3: Count Allele-Specific Reads#
Run the single-cell allele counting:
wasp2-count count-variants-sc \
cellranger_output/outs/possorted_genome_bam.bam \
phased_variants.vcf.gz \
barcodes_celltype.tsv \
--region genes.gtf \
--samples SAMPLE_ID \
--out_file allele_counts.h5ad
Parameters:
barcodes_celltype.tsv: Your barcode file with cell type annotations (positional)--region: Gene annotation file (GTF/GFF) or peak file (BED)--samples: Sample ID matching VCF sample column--out_file: Output AnnData file
Step 4: Analyze Allelic Imbalance#
Cell-Type-Specific Analysis#
Analyze imbalance within each cell type:
wasp2-analyze find-imbalance-sc \
allele_counts.h5ad \
barcodes_celltype.tsv \
--sample SAMPLE_ID \
--out_file imbalance_by_celltype.tsv
Output columns:
region: Gene or genomic regioncell_type: Cell type from barcode fileref_count: Total reference allele countsalt_count: Total alternate allele countsp_value: Statistical significancefdr_pval: FDR-corrected p-valueeffect_size: Log2 fold change
Compare Between Cell Types#
Find differential allelic imbalance between cell types:
wasp2-analyze compare-imbalance \
allele_counts.h5ad \
barcodes_celltype.tsv \
--groups "CD4_T_cell,CD8_T_cell" \
--out_file differential_imbalance.tsv
Step 5: Interpret Results#
Load and explore results in Python:
import pandas as pd
import matplotlib.pyplot as plt
# Load results
results = pd.read_csv('imbalance_by_celltype.tsv', sep='\t')
# Filter significant results (FDR < 0.05)
significant = results[results['fdr_pval'] < 0.05]
print(f"Found {len(significant)} significant ASE events")
# Top genes per cell type
top_genes = (significant
.groupby('cell_type')
.apply(lambda x: x.nsmallest(10, 'fdr_pval'))
.reset_index(drop=True))
print(top_genes[['region', 'cell_type', 'effect_size', 'fdr_pval']])
# Visualize effect sizes
fig, ax = plt.subplots(figsize=(10, 6))
significant.boxplot(column='effect_size', by='cell_type', ax=ax)
plt.title('Allelic Imbalance by Cell Type')
plt.ylabel('Log2 Fold Change (Ref/Alt)')
plt.savefig('ase_by_celltype.png')
Example Output#
region cell_type ref_count alt_count fdr_pval effect_size
ENSG00000123456 B_cell 245 89 0.001 1.46
ENSG00000234567 CD4_T_cell 156 312 0.003 -1.00
ENSG00000345678 Monocyte 423 198 0.012 1.09
Troubleshooting#
No Cells Matched#
If you see “0 barcodes matched”:
# Check BAM barcode format
samtools view your.bam | head -1000 | grep -o 'CB:Z:[^\t]*' | head
# Compare with your barcode file
head barcodes.tsv
# Common issues:
# - Missing -1 suffix in barcode file
# - Barcode file has header (should not)
# - Different barcode versions (v2 vs v3)
Quick Diagnostic:
# Compare BAM barcodes with file
samtools view your.bam | head -10000 | grep -o 'CB:Z:[^\t]*' | cut -d: -f3 | sort -u > bam_bc.txt
cut -f1 barcodes.tsv | sort -u > file_bc.txt
comm -12 bam_bc.txt file_bc.txt | wc -l # Should be > 0
# Fix suffix mismatch if needed
awk -F'\t' '{print $1"-1\t"$2}' barcodes_no_suffix.tsv > barcodes.tsv
Low Read Counts#
Single-cell data is sparse. Consider:
Using pseudo-bulk aggregation by cell type
Lowering
--min-countthresholdFocusing on highly expressed genes
Memory Issues#
For large datasets, process chromosomes separately by filtering your region file:
# Process autosomes separately (add chrX, chrY if needed)
for chr in chr{1..22}; do
# Extract regions for this chromosome
grep "^${chr}\t" genes.bed > genes_${chr}.bed
wasp2-count count-variants-sc \
sample.bam \
variants.vcf.gz \
barcodes.tsv \
--region genes_${chr}.bed \
--out_file counts_${chr}.h5ad
done
Next Steps#
Integrate with eQTL databases (GTEx, eQTLGen)
Correlate ASE with gene expression levels
Validate top hits with allele-specific primers
Compare across conditions or timepoints
See Also#
Single-Cell Analysis - Barcode file format reference
Analysis Module - Statistical methods
Seurat - R toolkit for scRNA-seq
Scanpy - Python toolkit for scRNA-seq