Single-Cell Analysis#
Overview#
WASP2 provides specialized tools for allele-specific analysis in single-cell RNA-seq (scRNA-seq) data. This guide covers the barcode file format requirements and single-cell-specific workflows.
Barcode File Format#
WASP2 uses a two-column TSV (tab-separated values) format for barcode files. This format maps cell barcodes to cell type annotations.
Format Specification#
BARCODE<TAB>CELLTYPE
Requirements:
No header row
Tab-separated (
\t) delimiterColumn 1: Cell barcode (string)
Column 2: Cell type annotation (string)
Example:
CACCCAAGTGAGTTGG-1 Oligodendrocytes
GCTTAAGCCGCGGCAT-1 Oligodendrocytes
GTCACGGGTGGCCTAG-1 Endothelial
AACCATGGTCACCTAA-1 Microglia
TGAGCCGAGAAACGCC-1 Astrocytes
10X Genomics Barcodes#
10X Chromium barcodes follow a specific format:
16 nucleotides followed by
-Nsuffix (e.g.,CACCCAAGTGAGTTGG-1)The suffix indicates the GEM well (
-1for single sample,-2,-3, etc. for aggregated samples)Barcodes are from the 10X whitelist (~3 million for v3 chemistry, ~737,000 for v2)
Chemistry Versions:
Chemistry |
Barcode Length |
Notes |
|---|---|---|
10X v2 |
16 bp |
~737,000 valid barcodes, older whitelist |
10X v3/v3.1 |
16 bp |
~3.5 million valid barcodes, improved capture |
10X Multiome |
16 bp |
Same as v3, paired ATAC+GEX |
PBMC Example (10X v3):
AAACCCAAGAAACACT-1 B_cell
AAACCCAAGAAAGCGA-1 CD4_T_cell
AAACCCAAGAACAACT-1 CD8_T_cell
AAACCCAAGAACCAAG-1 Monocyte
AAACCCAAGAACGATA-1 NK_cell
Multi-Sample Aggregated Example:
When using Cell Ranger aggr to combine multiple samples, barcodes are distinguished by suffix:
AAACCCAAGAAACACT-1 B_cell sample1
AAACCCAAGAAACTGT-1 B_cell sample1
AAACCCAAGAAACACT-2 B_cell sample2
AAACCCAAGAAACTGT-2 B_cell sample2
AAACCCAAGAAACACT-3 CD4_T_cell sample3
Note
For multi-sample experiments, WASP2 uses only the first two columns (barcode, cell_type). The third column (sample origin) is optional metadata for your reference.
Barcode Format Validation#
Before running WASP2, validate your barcode file format:
# Check file structure (should show TAB separator)
head -5 barcodes.tsv | cat -A
# Expected output (^I = TAB):
# AAACCCAAGAAACACT-1^IB_cell$
# Verify barcode format matches 10X pattern
head -1 barcodes.tsv | cut -f1 | grep -E '^[ACGT]{16}-[0-9]+$'
# Should return the barcode if valid
# Count barcodes per cell type
cut -f2 barcodes.tsv | sort | uniq -c | sort -rn
# Check for common issues
# 1. No header row (first line should be a barcode, not "barcode")
head -1 barcodes.tsv
# 2. Correct delimiter (TAB not space/comma)
file barcodes.tsv # Should mention "ASCII text"
Python Validation Script:
import re
def validate_10x_barcode_file(filepath):
"""Validate 10X scRNA-seq barcode file format."""
pattern = re.compile(r'^[ACGT]{16}-\d+$')
errors = []
i = 0
with open(filepath) as f:
for i, line in enumerate(f, 1):
parts = line.rstrip('\n').split('\t')
# Check column count
if len(parts) < 1:
errors.append(f"Line {i}: Empty line")
continue
barcode = parts[0]
# Check barcode format
if not pattern.match(barcode):
errors.append(f"Line {i}: Invalid barcode format '{barcode}'")
# Check for header (common mistake)
if i == 1 and barcode.lower() in ('barcode', 'cell_barcode', 'cb'):
errors.append(f"Line 1: Appears to be a header row, remove it")
if errors:
print(f"Found {len(errors)} errors:")
for err in errors[:10]: # Show first 10
print(f" {err}")
return False
else:
print(f"Validation passed: {i} barcodes")
return True
# Usage
validate_10x_barcode_file('barcodes.tsv')
Cell Ranger Output#
When using Cell Ranger output, barcodes can be found in:
cellranger_output/
└── outs/
└── filtered_feature_bc_matrix/
└── barcodes.tsv.gz
This file contains only the barcode column. To create a WASP2-compatible barcode file, you need to add cell type annotations from your downstream analysis.
Generating Barcode Files#
From Seurat (R)#
After clustering and cell type annotation in Seurat:
# Assuming 'seurat_obj' has cell type labels in metadata
library(Seurat)
# Extract barcodes and cell types
barcode_df <- data.frame(
barcode = colnames(seurat_obj),
cell_type = seurat_obj$cell_type # Your annotation column
)
# Write TSV without header
write.table(
barcode_df,
file = "barcodes.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE,
col.names = FALSE
)
From Scanpy (Python)#
After clustering and cell type annotation in Scanpy:
import pandas as pd
# Assuming 'adata' has cell type labels in obs
barcode_df = pd.DataFrame({
'barcode': adata.obs_names,
'cell_type': adata.obs['cell_type'] # Your annotation column
})
# Write TSV without header
barcode_df.to_csv(
'barcodes.tsv',
sep='\t',
header=False,
index=False
)
Simple Barcode List#
If you only need to filter by barcodes without cell type annotation, you can use a single-column file:
CACCCAAGTGAGTTGG-1
GCTTAAGCCGCGGCAT-1
GTCACGGGTGGCCTAG-1
Common Format Variations#
Cell Ranger Raw Barcodes:
# Extract filtered barcodes (single-column, add cell types later)
zcat cellranger_output/outs/filtered_feature_bc_matrix/barcodes.tsv.gz > barcodes_raw.txt
Barcode Suffix Handling:
Some tools strip the -1 suffix. Ensure BAM and barcode file match:
# Compare formats
samtools view sample.bam | head -1000 | grep -o 'CB:Z:[^\t]*' | cut -d: -f3 | head
cut -f1 barcodes.tsv | head
# Add suffix if missing
awk -F'\t' '{print $1"-1\t"$2}' barcodes_no_suffix.tsv > barcodes.tsv
Single-Cell CLI Usage#
Count Alleles#
wasp2-count count-variants-sc \
sample.bam \
variants.vcf.gz \
barcodes.tsv \
--region peaks.bed \
--samples NA12878 \
--out_file allele_counts.h5ad
Analyze Imbalance#
wasp2-analyze find-imbalance-sc \
allele_counts.h5ad \
barcodes.tsv \
--sample NA12878 \
--out_file imbalance_results.tsv
Output Format#
The single-cell counting module outputs an AnnData (.h5ad) file containing:
Layers:
X: Total allele counts (ref + alt + other)ref: Reference allele countsalt: Alternate allele countsother: Other allele counts
Observations (obs):
SNP information (chrom, pos, ref, alt)
Aggregate counts per SNP
Variables (var):
Cell barcodes
Unstructured (uns):
Sample information
Count statistics
Feature-SNP mapping (if regions provided)
Best Practices#
Quality Filtering#
Filter low-quality cells before generating barcode file
Remove doublets and dead cells
Use cells with sufficient UMI counts (>1000 for most protocols)
Cell Type Annotation#
Use consistent cell type naming (no spaces, special characters)
Consider hierarchical annotations (e.g.,
T_cell,CD4_T_cell)Document your annotation sources and markers
Barcode Matching#
Ensure barcodes match exactly (including
-1suffix)Verify barcode format matches BAM file CB tags
Check for barcode format differences between tools
Example Files#
WASP2 includes example barcode files in the tests/data/ directory:
barcodes_10x_scrna.tsv- Standard PBMC cell types (B_cell, CD4_T_cell, etc.)barcodes_example.tsv- Brain tissue cell types (Neurons, Astrocytes, etc.)barcodes_10x_multi_sample.tsv- Multi-sample aggregated experiment with -1, -2, -3 suffixesbarcodes_10x_hierarchical.tsv- Hierarchical cell type naming (T_cell.CD4.Naive, etc.)
These files can be used as templates or for testing your WASP2 installation.
Comparative Analysis#
After detecting allelic imbalance within individual cell populations, you can compare imbalance between groups to identify cell-type-specific or condition-dependent regulatory variation.
Quick example:
# Compare imbalance between two cell types
wasp2-analyze compare-imbalance \
allele_counts.h5ad \
barcode_celltype_map.tsv \
--groups "excitatory_neurons,inhibitory_neurons" \
--sample SAMPLE_ID \
--phased
This identifies genomic regions where allelic imbalance differs significantly between the specified groups, using a likelihood ratio test with FDR correction.
For comprehensive coverage of comparative analysis, see:
Comparative Imbalance Analysis Tutorial - Detailed comparative analysis tutorial
Analysis Module - Statistical methods for comparative imbalance
See Also
10X scRNA-seq Tutorial - Complete 10X scRNA-seq tutorial
Analysis Module - Statistical analysis methods
Counting Module - General allele counting