mirpy is a Python library for AIRR-seq and immune repertoire analysis.
It provides composable building blocks for parsing, filtering, comparing, and
characterising T-cell and B-cell receptor repertoires.
- Installation
- Module overview
- Quick start
- Diversity metrics
- Metaclonotypes
- TCRdist
- CDR3 Motif Logos
- ALICE and TCRNET
- Prototype-based embeddings (TCREmp)
- Copilot Agent Workflow
- Resources
- Project status
Requirements:
- Python 3.11+
- a C/C++ build toolchain for compiled extensions
Install from PyPI:
pip install mirpy-libInstall from source (one-shot):
git clone https://github.com/antigenomics/mirpy.git
cd mirpy
pip install .Install from source (editable development mode):
git clone https://github.com/antigenomics/mirpy.git
cd mirpy
./setup.shsetup.sh already installs mirpy in editable mode.
Prefer pip install mirpy-lib for project usage.
Use the cloned repo setup when developing or running docs/notebooks locally.
| Package | Responsibilities |
|---|---|
mir.common |
Clonotypes, repertoires, parsers, segment libraries |
mir.distances |
Aligners, Hamming/Levenshtein search, graph utilities, TCRdist |
mir.basic |
Sampling, segment usage, alphabet helpers, Pgen utilities |
mir.graph |
Edit-distance graphs, neighbourhood enrichment, token graphs, single-cell pairing |
mir.embedding |
Prototype embeddings: TCREmp, PairedTCREmp |
mir.comparative |
Pairwise overlap metrics (Jaccard, D, F, Morisita-Horn), trie-accelerated approximate matching, VDJBet Pgen-matched null distributions |
mir.biomarkers |
ALICE enrichment, TCRNET, CDR3 sequence logos |
mir.utils |
Embedding diagnostics, shared memory, notebook asset helpers |
from mir.common.gene_library import GeneLibrary
lib = GeneLibrary.load_default(
loci={"TRA", "TRB"},
species={"human"},
source="imgt",
)If a requested organism/locus pair is absent from the default local segment file, mirpy downloads the missing V and J segments from IMGT and appends them automatically.
from mir.common.parser import VDJtoolsParser
parser = VDJtoolsParser(sep="\t")
clonotypes = parser.parse("example.tsv")Supported parsers: VDJtoolsParser, AIRRParser, AdaptiveParser,
VDJdbFullPairedParser, and others in mir.common.parser.
from mir.common.repertoire import LocusRepertoire
repertoire = LocusRepertoire(clonotypes=clonotypes, locus="TRB")
print(repertoire.duplicate_count) # total read count
print(repertoire.clonotype_count) # unique clonotypes
# Functional / canonical filtering using IMGT annotations
from mir.common.filter import filter_functional, filter_canonical
from mir.common.gene_library import GeneLibrary
imgt_lib = GeneLibrary.load_default(loci={"TRB"}, species={"human"}, source="imgt")
functional_rep = filter_functional(repertoire, gene_library=imgt_lib)
canonical_rep = filter_canonical(repertoire, gene_library=imgt_lib)from mir.common.pool import pool_samples
# Pool by amino-acid CDR3 + V/J; retain contributing sample IDs
pooled = pool_samples(dataset, rule="aavj", include_sample_ids=True)Supported pooling rules: ntvj, nt, aavj, aa.
For each rule the representative clonotype is selected by frequency
(duplicate_count when weighted=True); duplicate_count is reassigned to
the total sum, and incidence / occurrences metadata are added.
mir.common.diversity implements VDJtools-style summary indices
(Shugay et al. 2015, PMID:26606115) and iNEXT-style Hill diversity profiles
and rarefaction/extrapolation curves (Hsieh et al. 2016).
from mir.common.diversity import summarize_counts
counts = [c.duplicate_count for c in repertoire.clonotypes]
div = summarize_counts(counts)
print(div.abundance) # total read count
print(div.diversity) # observed richness
print(div.chao1) # bias-corrected Chao1 species richness estimator
print(div.shannon) # Shannon entropy Hβ²
print(div.gini_simpson) # Gini-Simpson index (1 β Ξ£pΒ²)
print(div.singletons) # clones seen exactly once
print(div.doubletons) # clones seen exactly twicefrom mir.common.diversity import hill_curve
# Returns a Polars DataFrame with columns q, D_q
profile = hill_curve(counts)
# q=0 β species richness; q=1 β exp(Shannon); q=2 β inverse Simpsonfrom mir.common.diversity import rarefaction_curve
curve = rarefaction_curve(counts)
# Polars DataFrame with m, s_obs, s_est, s_lwr, s_upr, sample coverage CSee notebooks/diversity_analysis.ipynb for a full donor-level workflow
including rarefaction curves, Hill profiles, and Healthy vs MS cohort
comparisons.
A metaclonotype is a lightweight cluster layer over an existing
LocusRepertoire. It stores cluster membership as a Polars DataFrame
(mapping cluster_id β clonotype_id) without rebuilding repertoire objects.
This supports any clustering backend: DBSCAN, ALICE/TCRNET enriched clusters,
TCRdist radius clusters, or pre-computed connected components.
MetaclonotypeClusterConfig + cluster_metaclonotypes dispatch to any
supported backend (ALICE, TCRNET, TCRdist, edit-distance graph, TCREmp, GLIPH):
from mir.biomarkers.metaclonotype_cluster import (
MetaclonotypeClusterConfig,
cluster_metaclonotypes,
cluster_paired_metaclonotypes,
)
# Edit-distance graph, Leiden communities
cfg = MetaclonotypeClusterConfig(method="edit_distance", graph_algo="leiden")
meta = cluster_metaclonotypes(rep, cfg)
# TCRdist radius clusters
cfg_dist = MetaclonotypeClusterConfig(method="tcrdist", locus="TRB", max_distance=24.5)
meta_dist = cluster_metaclonotypes(rep, cfg_dist)Paired-chain metaclonotypes via single-chain-combine (works for all methods):
# Computes per-chain edit-distance clusters, combines IDs as "TRA_cluster.TRB_cluster"
cfg = MetaclonotypeClusterConfig(method="edit_distance", min_cluster_size=1)
meta_paired = cluster_paired_metaclonotypes(paired_locus_rep, cfg)For TCREmp, cluster_paired_metaclonotypes uses the built-in PairedTCREmp
joint embedding by default. See notebooks/metaclonotype_method_compare.ipynb
for a comparison of methods including concordance analysis.
from mir.common.metaclonotype import metaclonotypes_from_labels
# labels is a list of ints; -1 denotes noise/singleton (excluded by default)
meta = metaclonotypes_from_labels(clonotype_ids, labels)
print(meta.n_clusters) # number of clusters
print(meta.cluster_ids[:5]) # sorted cluster IDsfrom mir.common.metaclonotype import metaclonotypes_from_components
# components: list of lists of clonotype IDs
meta = metaclonotypes_from_components(components)from mir.common.metaclonotype import summarize_metaclonotypes
# Returns a Polars DataFrame with cluster_id and aggregated duplicate_count
summary = summarize_metaclonotypes(repertoire, meta)from mir.common.metaclonotype import functional_diversity
# One-call wrapper: summarize β DiversitySummary
div = functional_diversity(repertoire, meta)
print(div.shannon) # Shannon entropy at the cluster level
print(div.chao1) # Chao1 estimator for cluster richness
print(div.gini_simpson) # Gini-Simpson indexfrom mir.common.metaclonotype import functional_overlap_1
# Fraction of metaclonotypes in rep_a that share a CDR3 identity with rep_b
overlap = functional_overlap_1(meta_a, meta_b, repertoire_a, repertoire_b)TcrDist (mir.distances.tcrdist) computes the weighted V-gene + CDR3
alignment distance between TCR clonotypes, following the TCRdist3 metric
(Dash et al. 2017). All V-gene pairwise distances are pre-computed once
from full germline sequences; CDR3 alignment uses BLOSUM62 with a fixed-gap
C extension that releases the GIL for thread parallelism.
from mir.distances.tcrdist import TcrDist
from mir.common.clonotype import Clonotype
# Build once β loads OLGA library and pre-computes V-gene distances (~3β10 s)
td = TcrDist.from_defaults(
"TRB", "human",
w_v=1.0, w_j=0.0, w_cdr3=3.0,
fixed_gaps=(3, 4, -4, -3), # C-accelerated (default)
# fixed_gaps="Mid" β midpoint gap per pair (Python, ~330Γ slower)
# fixed_gaps=None β full BioPython DP (~780Γ slower)
)
cln1 = Clonotype(v_gene="TRBV19*01", j_gene="TRBJ2-7*01", junction_aa="CASSIRSSYEQYF")
cln2 = Clonotype(v_gene="TRBV19*01", j_gene="TRBJ2-7*01", junction_aa="CASSIRASYEQYF")
d = td.dist(cln1, cln2) # single pair
row = td.dist_one_to_many(cln1, refs) # (K,) array
mat = td.dist_matrix(queries, refs, n_jobs=4) # (N, K) matrixfrom mir.basic.pgen import OlgaModel
model = OlgaModel(locus="TRB", species="human")
bg_seqs, _ = model.generate_sequences_counted(10_000, n_jobs=4, seed=42)
bg_clns = [Clonotype(junction_aa=s, locus="TRB") for s in bg_seqs]
# Median background distance for each query clonotype
radii = td.compute_radius(hits, bg_clns, percentile=50, n_jobs=4)
# Cluster around seeds whose radius falls in the bottom quartile
import numpy as np
threshold = float(np.percentile(radii, 25))
meta = td.find_metaclonotypes(rep, max_distance=threshold, n_jobs=4)Performance (Apple M3, TRB, fixed_gaps=(3,4,-4,-3), n_jobs=1):
28 M pairs/s at 1Kβ5K scale; ~76 M pairs/s with n_jobs=8.
See notebooks/tcrdist_analysis.ipynb for an influenza GILGFVFTL worked example.
mir.biomarkers.motif_logo builds IC and selection sequence logos for
CDR3 motifs, following Pogorelyy et al. 2019 (PMID:31194732). The key
idea is to subtract an OLGA-derived background for the same V-gene / J-gene /
CDR3-length bin, which collapses the germline signal and reveals only the
antigen-driven component.
from mir.biomarkers.motif_logo import (
compute_pwm, compute_logo, get_vj_background,
build_terminal_anchored_pwm, load_motif_pwms, plot_logo,
)
motif_pwms = load_motif_pwms("motif_pwms.txt.gz") # OLGA backgrounds
seqs = ["CASSGRSYEQYF", "CASSGRTNEQYF", ...] # CDR3 sequences
bg = get_vj_background(
motif_pwms, v_gene="TRBV19*01", j_gene="TRBJ2-7*01",
length=13, species="HomoSapiens", gene="TRB",
)
pwm = compute_pwm(seqs)
logo = compute_logo(pwm, background=bg) # adds ic_height + bg_height columns
fig, ax = plt.subplots()
plot_logo(logo, ax, height_col="bg_height") # selection logoFor CDR3s of mixed lengths (different J-genes), use the terminal-anchored logo which anchors V-side and J-side blocks independently:
ta_pwm = build_terminal_anchored_pwm(seqs, n_term=8, c_term=7)
ta_logo = compute_logo(ta_pwm, background=bg)For automated per-VJ-len logos from ALICE/TCRNET hit DataFrames use
build_motif_logos_vj. Background data (motif_pwms.txt.gz) is fetched
automatically by the notebook bootstrap helpers in mir.utils.notebook_assets.
See notebooks/motif_logos.ipynb for GILGFVFTL (Influenza A) and HLA-B27 AS
worked examples.
Both modules detect antigen-driven CDR3 clusters, but differ in how they model the background:
| Feature | ALICE | TCRNET |
|---|---|---|
| Background model | OLGA Pgen (analytical or MC pool) | Any MC control β real or synthetic |
| Pgen calls | OLGA 1mm Pgen (10M pool + fallback) | None |
| V/J restriction | match_mode="vj" (default) |
match_mode="none" (default) |
| Statistics | Poisson | Binomial / Beta-Binomial |
| Selection correction | q_factor |
q_factor (needed for synthetic controls) |
ALICE (mir.biomarkers.alice) implements the Pogorelyy et al. 2019
(PMID:31194732) Poisson enrichment test. Default is match_mode="vj" with
OLGA gene-usage conditioning: N and pgen are scaled by P_OLGA(V,J) so
that Ξ» = N_total Γ pgen regardless of gene restriction, while the observed
count k is V/J-filtered. Uses a 10 M-sequence MC pool by default (the
paper uses 100 M, which requires ~17 GB; set mc_n_pool=100_000_000 if memory
allows) with fallback to OLGA analytical 1mm Pgen for rare sequences.
V/J-restricted neighbour counting uses a grouped-trie strategy: one small
trie per (V, J) gene group. This makes match_mode="vj" 1.5β2Γ faster
than match_mode="none" on natural repertoires (benchmark: 300 K sequences,
8 workers β unrestricted 9.9 s, V+J restricted 5.5 s).
TCRNET (mir.biomarkers.tcrnet) is a purely MC-control algorithm.
When used with a real control it captures V/J bias automatically. Pass
q_factor β 3β5 when using a synthetic OLGA pool to correct for the
pre-thymic selection deficit. TCRNET with a 100 M synthetic pool,
match_mode="vj", and q_factor=Q is statistically equivalent to the
original ALICE paper.
TCREmp embeds immune receptor clonotypes as distance vectors to a fixed set
of prototype clonotypes, enabling rapid downstream analysis, dimensionality
reduction, and machine learning.
from mir.embedding.tcremp import TCREmp
from mir.common.clonotype import Clonotype
model = TCREmp.from_defaults("human", "TRB", n_prototypes=1000, junction_method="fixed_gap")
clonotypes = [
Clonotype(v_gene="TRBV10-3*01", j_gene="TRBJ2-7*01", junction_aa="CASSIRSSYEQYF"),
Clonotype(v_gene="TRBV20-1*01", j_gene="TRBJ1-1*01", junction_aa="CSARDSSYEQYF"),
]
X = model.embed(clonotypes) # shape (2, 3000) β float32 arrayColumn layout: [v_1, j_1, junc_1, v_2, j_2, junc_2, β¦, v_K, j_K, junc_K]
where each distance uses d(a, b) = s(a,a) + s(b,b) β 2Β·s(a,b).
For full DP alignment use junction_method="biopython" (~383Γ slower).
For custom prototypes use TCREmp.from_file("prototypes.tsv", ...).
Paired-chain embedding concatenates TRA and TRB embeddings per PairedClonotype:
from mir.embedding.tcremp import PairedTCREmp
paired_model = PairedTCREmp.from_defaults("human", "TRA_TRB", n_prototypes=500)
X_pair = paired_model.embed(paired_clonotypes)n_jobs behaviour:
n_jobs=None(default): auto-select based onlen(clonotypes) Γ n_prototypes.n_jobs=1: force serial.n_jobs>1: force explicit worker count.
from mir.basic.alphabets import (
aa_to_reduced, mask, matches, matches_aa_reduced, NT_MASK, AA_MASK,
)
nt_masked = mask("ATCGAT", (2, 5), NT_MASK)
assert nt_masked == b"ATNNNT"
aa = "CASTIV"
reduced = aa_to_reduced(aa)
# Matching ignores mask symbols: N (nucleotide) or X (amino acid)
assert matches(mask(aa, 0, AA_MASK), aa, AA_MASK)
assert matches_aa_reduced(aa, mask(reduced, 3, AA_MASK))This repository ships a dedicated Copilot custom agent and companion prompt:
- Agent:
.github/agents/mirpy-analysis.agent.md - Companion prompt:
.github/prompts/mirpy-analysis.prompt.md
Use /mirpy-analysis from chat to supply input data paths, optional metadata
schema, and workflow definition. The agent creates dedicated notebooks,
installs/validates dependencies, executes cells sequentially, and reports
outcomes. For large datasets it benchmarks small chunks first and asks before
any run expected to exceed ~10β15 min on 4β8 cores or ~12β16 GB RAM.
- Example notebooks: notebooks/
- API reference: https://antigenomics.github.io/mirpy/modules.html
- Notebook gallery: https://antigenomics.github.io/mirpy/examples.html
- Docs source: docs/
- Agent skill guide (Claude, GitHub Copilot): skills/mirpy/SKILL.md
- Benchmark baselines: benchmarks.md
The library is actively evolving. Some modules are more mature than others, and parts of the public API may still change.
