magistra-predictor

A dual-track framework for GLP-1 side effect estimation, separating clinical evidence from real-world patient reports.

This repository contains the statistical methodology, extraction prompts, and model configuration behind Magistra Health — a free, open, continuously updated platform that estimates GLP-1 medication side effect risk using two parallel data tracks.

The idea in one sentence

We don't blend clinical trial data and patient community reports into a single number. We compute two estimates and show both, so the gap between them is visible instead of hidden.

Why this repo exists

Patients starting semaglutide, tirzepatide, or liraglutide face a gap: clinical trials report one side effect rate, Reddit reports another, and neither is wrong. Existing resources pick one or average them opaquely. We keep them separate.

Representative gaps (reference profile: female, 35, medium dose, first month):

Side effect	Clinical trials	Real-world reports	Gap
Nausea	28%	49%	21 pp
Diarrhoea	11%	31%	20 pp
Fatigue	10%	25%	15 pp
Hair loss	3%	15%	12 pp
Emotional blunting	3%	12%	9 pp

Clinical trials aren't wrong — they measure what they measure. But they systematically miss delayed effects (hair loss), subjective effects (emotional blunting), and effects that aren't pre-specified endpoints.

Try it

Live tool: https://magistra.health/en/predictor Methodology doc: https://magistra.health/en/methodology Public API: https://magistra.health/api/data?q=help LLM-readable summary: https://magistra.health/llms.txt

What's in this repo

├── README.md                    # You are here
├── LICENSE                      # Apache 2.0
├── CITATION.cff                 # Academic citation metadata
├── CONTRIBUTING.md              # How to critique or contribute
├── methodology/
│   ├── side-effects-engine.ts   # Dual-track risk calculator
│   ├── model-config.ts          # Config schema (TypeScript types)
│   ├── model-config.json        # Live model config snapshot
│   ├── analyze-model.mjs        # Daily statistical analysis pipeline
│   └── extraction-prompt.mjs    # LLM extraction prompt
├── examples/
│   ├── api-examples.md          # How to query the public API
│   ├── fetch-predictor.js       # Node.js example
│   └── dual-track-comparison.md # Detailed walkthrough of the gap
└── preprint/
    └── magistra-methodology.md  # Full methodology preprint (v4.0)

Quick start

Query the API

# Get API documentation
curl https://magistra.health/api/data?q=help

# Get overview stats
curl https://magistra.health/api/data?q=overview

# Get details for one side effect
curl "https://magistra.health/api/data?q=effect&id=nausea"

# Get a personalised prediction
curl -X POST https://magistra.health/api/predictor/calculate \
  -H "Content-Type: application/json" \
  -d '{
    "molecule": "semaglutide",
    "doseMg": 1.0,
    "doseTier": "medium",
    "sex": "female",
    "age": 35,
    "hasGiHistory": false,
    "hasDiabetes": false,
    "isFirstMonth": true
  }'

Understand the dual-track output

Each effect in the predictor response has two fields: clinical and realWorld. Each contains:

{
  "percentage": 28,
  "confidenceInterval": { "low": 22, "high": 35 },
  "confidenceLevel": "moderate",
  "dataPointCount": 15,
  "basis": "15 clinical studies and regulatory reports",
  "isFallback": false
}

When realWorld.percentage > clinical.percentage by more than 15 percentage points, the gap is flagged as "community reports higher than clinical trials" — suggesting the effect may be systematically under-measured in trials.

Methodology at a glance

Collection. 18 data sources (PubMed, FDA FAERS, ClinicalTrials.gov, 13 subreddits, Drugs.com, Trustpilot, Indian pharmacovigilance sources) scraped daily.
Extraction. Claude Haiku extracts structured data points (rate, sample size, demographics, dose tier) with explicit confidence labels. Conservative: only explicitly stated rates are recorded.
Dual-track filtering. Data points are split by sourceType into clinical+regulatory (Track C) and user_report+news (Track R). The two tracks are never blended.
Weighted estimation. Weighted mean rate with sample-size and extraction-confidence weights, Winsorized at 5th/95th percentiles when n > 10.
Log-odds modifiers. Sex, age ≥ 65, GI history, diabetes, first month of treatment applied on log-odds scale, with cumulative shift capped at ±2.5 (~12× max cumulative OR) to prevent implausible stacking.
Random-effects confidence intervals. DerSimonian-Laird τ² estimation, delta-method SE on log-odds scale.
Self-evolving config. Daily pipeline computes empirical odds ratios for every parameter × effect combination, applies Benjamini-Hochberg FDR correction across ~180-240 tests, auto-applies only conservative changes (n ≥ 30, p_adj ≤ 0.01, |Δ OR| ≤ 0.3).
Safety. Versioned rollback (30 prior configs retained), canonical profile regression testing, human review queue for larger changes, max 5 auto-applied changes per day.

Full details in preprint/magistra-methodology.md or at https://magistra.health/en/methodology.

Limitations (honest list)

Data volume: Current database has ~217 points across 15 effects. Model health is "degraded" until n ≥ 100 per effect.
Demographic bias: 87% female representation, minimal ethnic diversity, Western-dominated sources.
Hand-coded modifiers: Initial values from published literature; empirical replacement in progress as data accumulates.
No interaction terms. Modifiers applied additively.
No formal calibration yet. Planned at n ≥ 500.
LLM extraction unaudited. Gold-standard audit planned.
Not causal. These are population-average conditional risks.

See CONTRIBUTING.md if you'd like to help fix any of these.

Contributing

We welcome critique, corrections, and collaboration. See CONTRIBUTING.md for details.

Specifically seeking:

Biostatisticians to review random-effects CI implementation and propose replacements for the method-of-moments τ² (we'd rather use REML or Paule-Mandel)
Clinical researchers to critique effect categorization and modifier values
ML researchers to audit LLM extraction accuracy on a gold-standard subset
Pharmacovigilance experts to suggest additional data sources and flag missing effects

Substantive contributors are acknowledged in the public changelog on the methodology page.

Citation

If you use this methodology or data in research, please cite:

Goyal, S. (2026). A Dual-Track Framework for GLP-1 Side Effect Estimation: Separating Clinical Evidence from Real-World Patient Reports (v4.0). Zenodo. https://doi.org/10.5281/zenodo.19559749

@misc{goyal2026magistra,
  author       = {Goyal, Saurabh},
  title        = {A Dual-Track Framework for GLP-1 Side Effect Estimation: Separating Clinical Evidence from Real-World Patient Reports},
  year         = {2026},
  publisher    = {Zenodo},
  version      = {4.0},
  doi          = {10.5281/zenodo.19559749},
  url          = {https://doi.org/10.5281/zenodo.19559749}
}

See CITATION.cff for the machine-readable citation file.

License

Apache 2.0. See LICENSE.

The data in the Magistra database is aggregated from public sources and is available free for research and non-commercial use with attribution. Commercial/bulk access: contact saurabh@magistra.health.

Contact

Saurabh Goyal Founder, Magistra Health B.V. saurabh@magistra.health https://magistra.health

Magistra is a statistical tool, not medical advice. The predictions are population-average conditional risks, not individual outcomes. Always consult a licensed clinician before starting or changing medication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

magistra-predictor

The idea in one sentence

Why this repo exists

Try it

What's in this repo

Quick start

Query the API

Understand the dual-track output

Methodology at a glance

Limitations (honest list)

Contributing

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
methodology		methodology
preprint		preprint
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Folders and files

Latest commit

History

Repository files navigation

magistra-predictor

The idea in one sentence

Why this repo exists

Try it

What's in this repo

Quick start

Query the API

Understand the dual-track output

Methodology at a glance

Limitations (honest list)

Contributing

Citation

License

Contact

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages