Building High-Quality Reference Data for Spatial Deconvolution

A practical guide to maximize deconvolution accuracy through better reference data.

Why Reference Quality Matters

Spatial deconvolution is fundamentally a regression problem: we estimate cell type proportions by decomposing spatial gene expression into a linear combination of reference signatures. The quality of these signatures directly determines the upper bound of deconvolution accuracy.

First Principle: A deconvolution algorithm can only distinguish cell types that are distinguishable in the reference data.

This guide distills empirical lessons from extensive benchmarking into actionable recommendations.

Quick Checklist

Before running deconvolution, verify your reference data passes these checks:

Check	Threshold	Command
No Unknown cells	0 cells	`~obs['cell_type'].str.contains('Unknown\|UND\|Unassigned')`
Cells per type	≥500 (ideal: ≥2000)	`adata.obs['cell_type'].value_counts()`
Marker expression	≥80% cells express core markers	`(adata[mask, marker].X > 0).mean()`
Marker fold-change	≥5× vs other types	`type_mean / other_mean`
Signature correlation	<0.95 between types	`np.corrcoef(signatures)`

If any check fails, your deconvolution results for that cell type will be unreliable.

The Unknown Cell Problem

First Principle: Deconvolution assumes each cell type has a distinct signature. Unknown cells violate this assumption—their signature approximates the average of other types, making them match any mixed spot equally well.

Why Unknown Absorbs Proportions: A Causal Chain

Step 1: Unknown ≈ Linear Combination of Others

Unknown cells are defined by exclusion—they didn't cluster with any specific type. Their mean expression is therefore a weighted average:

sig_Unknown ≈ α₁·sig_Type1 + α₂·sig_Type2 + ... + αₖ·sig_TypeK

In zebrafish embryo data: R² = 0.96 when fitting Unknown from other types.

Step 2: This Creates Equivalent Solutions

Deconvolution solves min ||Y - Xβ||² subject to β ≥ 0. When one column is a linear combination of others, infinitely many β achieve the same minimum residual:

β = [0.30, 0.25, 0.25, 0.20, 0.00]  →  residual = r
β = [0.15, 0.12, 0.12, 0.10, 0.51]  →  residual = r  (same!)

Step 3: NNLS Prefers Unknown

Among equivalent solutions, NNLS favors Unknown because:

Unknown signature correlates highest with average spot expression (0.76 vs 0.73 for specific types)
Spatial spots are mixtures; Unknown is also a mixture
The algorithm finds a "shortcut": one universal signature instead of decomposing into specific types

The Residual Paradox

This is the key insight that reveals the problem:

Condition	UND Proportion	Residual
With Unknown	68%	53.1
Without Unknown	0%	53.3
Difference	−68%	+0.3%

Unknown absorbs 68% of proportions but improves residual by only 0.3%.

This proves Unknown provides no additional explanatory power—it's a mathematical shortcut, not a biological signal. Removing Unknown forces NNLS to find the biologically meaningful solution with nearly identical fit quality.

Solution

# Always filter before deconvolution
ref = ref[~ref.obs['cell_type'].isin([
    'Unknown', 'UND', 'Unassigned', 'Ambiguous', 'Mixed', 'Doublet'
])].copy()

This affects all regression-based methods (RCTD, Cell2Location, CARD, SPOTlight, CIBERSORT, etc.)—it's a property of the mathematics, not any specific implementation.

Cell Count Requirements

Signature stability depends directly on sample size:

┌─────────────────────────────────────────────────────────────────┐
│  Cells per Type    │  Signature Quality    │  Recommendation    │
├─────────────────────────────────────────────────────────────────┤
│  > 2000            │  Excellent            │  ✓ Use directly    │
│  500 - 2000        │  Good                 │  ✓ Acceptable      │
│  200 - 500         │  Marginal             │  ⚠ Verify markers  │
│  < 200             │  Unstable             │  ✗ Merge or remove │
└─────────────────────────────────────────────────────────────────┘

Why this matters: With <200 cells, random sampling noise dominates the mean expression profile. The resulting signature captures noise rather than biology.

The Dual-Marker Annotation Strategy

Standard annotation using only positive markers is insufficient. High-quality annotation requires both positive and negative markers:

def annotate_cell_type(adata, positive_markers, negative_markers):
    """
    Strict annotation with dual-marker validation.

    A cell is assigned to a type only if it:
    1. Expresses positive markers (identity)
    2. Does NOT express negative markers (exclusion)
    """
    # Score positive markers
    sc.tl.score_genes(adata, positive_markers, score_name='pos_score')

    # Score negative markers (should be low)
    sc.tl.score_genes(adata, negative_markers, score_name='neg_score')

    # Combined score: reward positive, penalize negative
    adata.obs['final_score'] = adata.obs['pos_score'] - 0.5 * adata.obs['neg_score']

    return adata.obs['final_score']

Recommended Markers by Cell Type

Cell Type	Positive Markers	Negative Markers
Astrocytes	Aqp4, Aldh1l1, Slc1a2, Gfap, Glul	Mbp, Cx3cr1, Slc17a7
Oligodendrocytes	Mbp, Plp1, Mog, Mag, Mobp	Aqp4, Slc17a7, Gad1
Microglia	Cx3cr1, P2ry12, Tmem119, Csf1r	Mbp, Aqp4, Slc17a7
OPCs	Pdgfra, Cspg4, Sox10, Olig1	Mbp, Mog, Aqp4
Excitatory neurons	Slc17a7, Camk2a, Satb2	Gad1, Gad2, Mbp
Inhibitory neurons	Gad1, Gad2, Slc32a1	Slc17a7, Mbp
Endothelial	Pecam1, Cldn5, Vwf	Slc17a7, Gad1, Mbp

Signature Quality Control

Step 1: Compute Signature Correlation Matrix

High inter-signature correlation (>0.95) indicates the algorithm cannot distinguish those cell types:

import numpy as np
from scipy.stats import pearsonr

def check_signature_separability(adata, cell_type_key):
    """Compute pairwise signature correlations."""
    cell_types = adata.obs[cell_type_key].unique()

    # Compute mean expression per cell type
    signatures = {}
    for ct in cell_types:
        mask = adata.obs[cell_type_key] == ct
        signatures[ct] = np.asarray(adata[mask].X.mean(axis=0)).flatten()

    # Correlation matrix
    print("Signature correlations (flag if > 0.95):")
    for ct1 in cell_types:
        for ct2 in cell_types:
            if ct1 < ct2:
                r, _ = pearsonr(signatures[ct1], signatures[ct2])
                flag = "⚠️" if r > 0.95 else ""
                print(f"  {ct1} vs {ct2}: r={r:.3f} {flag}")

Step 2: Validate Marker Expression

For each cell type, verify its markers are actually expressed:

def validate_markers(adata, cell_type, markers, cell_type_key):
    """Check marker expression in annotated cells."""
    mask = adata.obs[cell_type_key] == cell_type

    print(f"\n{cell_type} marker validation:")
    for marker in markers:
        if marker not in adata.var_names:
            print(f"  {marker}: NOT IN DATA")
            continue

        expr = adata[mask, marker].X
        if hasattr(expr, 'toarray'):
            expr = expr.toarray()

        pct_expressing = (expr > 0).mean() * 100
        mean_in_type = expr.mean()
        mean_in_others = adata[~mask, marker].X.mean()
        fold_change = mean_in_type / (mean_in_others + 1e-6)

        status = "✓" if pct_expressing > 80 and fold_change > 5 else "⚠️"
        print(f"  {status} {marker}: {pct_expressing:.0f}% expressing, FC={fold_change:.1f}x")

Handling Problematic Cell Types

Problem: High Signature Correlation

Symptom: Two cell types have correlation >0.95 Solution: Merge into a single category or use only marker genes

# Option 1: Merge similar types
adata.obs['cell_type_merged'] = adata.obs['cell_type'].replace({
    'CD4_naive': 'CD4_T',
    'CD4_memory': 'CD4_T',
})

# Option 2: Marker-only signature (reduces correlation)
marker_genes = ['CD4', 'CD8A', 'CD19', 'CD14', ...]  # curated list
adata_markers = adata[:, marker_genes]

Problem: Low Cell Count

Symptom: Cell type has <200 cells Solution:

Merge with related type (e.g., T cell subtypes → T cells)
Exclude from analysis (better than unreliable estimates)
Augment from external datasets (same tissue, same species)

Problem: Marker Genes Not Expressed

Symptom: Core markers show <50% expression or FC <2× Solution: Re-annotate using alternative markers or cluster-based refinement

Complete QC Pipeline

def prepare_reference(adata_raw, cell_type_key, marker_dict):
    """
    Full QC pipeline for reference data preparation.

    Parameters
    ----------
    adata_raw : AnnData
        Raw single-cell reference data
    cell_type_key : str
        Column in .obs with cell type annotations
    marker_dict : dict
        {cell_type: {'positive': [...], 'negative': [...]}}

    Returns
    -------
    adata_clean : AnnData
        QC-passed reference data
    """
    adata = adata_raw.copy()

    # Step 1: Filter low-count cell types
    counts = adata.obs[cell_type_key].value_counts()
    valid_types = counts[counts >= 200].index
    adata = adata[adata.obs[cell_type_key].isin(valid_types)]
    print(f"Step 1: Kept {len(valid_types)} cell types with ≥200 cells")

    # Step 2: Re-annotate with strict markers
    scores = {}
    for ct, markers in marker_dict.items():
        pos = [g for g in markers.get('positive', []) if g in adata.var_names]
        neg = [g for g in markers.get('negative', []) if g in adata.var_names]

        if pos:
            sc.tl.score_genes(adata, pos, score_name=f'{ct}_pos')
            scores[ct] = adata.obs[f'{ct}_pos'].values

            if neg:
                sc.tl.score_genes(adata, neg, score_name=f'{ct}_neg')
                scores[ct] = scores[ct] - 0.5 * adata.obs[f'{ct}_neg'].values

    # Step 3: Assign to highest-scoring type (with margin)
    score_df = pd.DataFrame(scores, index=adata.obs_names)
    max_score = score_df.max(axis=1)
    second_score = score_df.apply(lambda x: x.nlargest(2).iloc[1], axis=1)

    confident_mask = (max_score > second_score + 0.5) & (max_score > 0)
    adata.obs['cell_type_strict'] = 'Ambiguous'
    adata.obs.loc[confident_mask, 'cell_type_strict'] = score_df.loc[confident_mask].idxmax(axis=1)

    # Step 4: Remove ambiguous cells
    adata_clean = adata[adata.obs['cell_type_strict'] != 'Ambiguous']
    print(f"Step 4: Kept {adata_clean.n_obs}/{adata.n_obs} cells after filtering ambiguous")

    # Step 5: Verify signature separability
    check_signature_separability(adata_clean, 'cell_type_strict')

    return adata_clean

Common Pitfalls

Pitfall	Consequence	Prevention
Keeping Unknown/Unassigned cells	Absorbs 60%+ of proportions	Filter before deconvolution
Automatic clustering annotation	Includes mislabeled cells	Validate with markers
Using different species	Gene names/expression don't match	Use same species
Ignoring batch effects	Technical variation in signature	Batch-correct first
Too many subtypes	Highly correlated signatures	Merge related types
Only positive markers	Cross-contamination	Add negative markers

Recommended Reference Data Sources

Tissue	Recommended Source	Notes
Brain	Allen Brain Atlas	Well-curated, multiple regions
Liver	MacParland et al., Nat Commun 2018	Hepatocyte zonation
Kidney	Kidney Cell Atlas (Stewart et al.)	Comprehensive
Immune	Tabula Sapiens / Tabula Muris	Pan-tissue immune
Tumor	Tissue-matched normal + tumor	Avoid cell line artifacts

General rule: Prefer published, peer-reviewed references from the same tissue and species as your spatial data.

Summary

Reference data quality determines deconvolution accuracy. The algorithm is only as good as the signatures it's given.

Key principles:

Sufficient cells (≥500 per type) for stable signatures
Dual markers (positive + negative) for clean annotation
Low correlation (<0.95) between signatures for distinguishability
Validated expression (≥80% expressing, FC ≥5×) for each marker

"Garbage in, garbage out. But gold in, gold out."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building High-Quality Reference Data for Spatial Deconvolution

Why Reference Quality Matters

Quick Checklist

The Unknown Cell Problem

Why Unknown Absorbs Proportions: A Causal Chain

The Residual Paradox

Solution

Cell Count Requirements

The Dual-Marker Annotation Strategy

Recommended Markers by Cell Type

Signature Quality Control

Step 1: Compute Signature Correlation Matrix

Step 2: Validate Marker Expression

Handling Problematic Cell Types

Problem: High Signature Correlation

Problem: Low Cell Count

Problem: Marker Genes Not Expressed

Complete QC Pipeline

Common Pitfalls

Recommended Reference Data Sources

Summary

See Also

FilesExpand file tree

reference_data_guide.md

Latest commit

History

reference_data_guide.md

File metadata and controls

Building High-Quality Reference Data for Spatial Deconvolution

Why Reference Quality Matters

Quick Checklist

The Unknown Cell Problem

Why Unknown Absorbs Proportions: A Causal Chain

The Residual Paradox

Solution

Cell Count Requirements

The Dual-Marker Annotation Strategy

Recommended Markers by Cell Type

Signature Quality Control

Step 1: Compute Signature Correlation Matrix

Step 2: Validate Marker Expression

Handling Problematic Cell Types

Problem: High Signature Correlation

Problem: Low Cell Count

Problem: Marker Genes Not Expressed

Complete QC Pipeline

Common Pitfalls

Recommended Reference Data Sources

Summary

See Also