Open Model Benchmark Cards

Generate concise, reproducible Markdown benchmark cards from structured JSON.

Quick Start

pip install -e ".[dev]"
python examples/generate_card.py
pytest

Example Output

The demo writes reports/gpt-oss-20b-card.md.

Research Brief

See docs/research_brief.md for the reproducibility motivation and extension roadmap.

Portfolio Notes

This project makes benchmark results easier to audit, compare, and reproduce.

Experiment Artifacts

Result schema: examples/result_schema.json
Model comparison: reports/model_comparison.csv
Analysis: reports/model_comparison_analysis.md

CLI

python -m open_model_benchmark_cards.cli examples/result_schema.json

The CLI validates single-model cards and can also render multi-model comparison tables from a list of result objects.

Full Model Set

The repository includes 18 model-result records in examples/full_model_results.json and a generated report in reports/full_model_comparison_report.md.

Schema Checks

The card generator includes explicit schema checks for benchmark result objects, keeping generated reports consistent across model comparisons.

Real Public Dataset Experiment

datasets/external/real_benchmark_card_inputs.json contains benchmark-card inputs derived from aizip/Rag-Eval-Dataset-6k. The card captures source URL, metric names, and limitations instead of inventing model scores.

GPU-Backed Real Experiment

This repository now includes a reproducible GPU-backed experiment using local-portfolio-results. The smoke path runs on the local RTX 5090 Laptop GPU through the Transformers conda environment and writes metrics, figures, and a markdown report.

conda run -n Transformers python scripts/download_data.py --smoke
conda run -n Transformers python scripts/preprocess_data.py --max-samples 384
conda run -n Transformers python scripts/run_experiment.py --device cuda --smoke
conda run -n Transformers python scripts/make_report.py

Main report: reports/benchmark_card_generation_report.md.

Publishable V2 Research Results

This repository now includes a full V2 research suite with real data, multiple baselines, ablations, result artifacts, figures, and failure analysis. The README summarizes the measured run so the project can be judged from results, not just project intent.

Dataset And Scale

Experiment indexes from the other 8 V2 repositories, converted into benchmark-card records with artifact and limitation checks.

Full-profile result rows: 8
Experiment profile: full
Experiment index: reports/results/experiment_index.json
Full report: reports/open_model_benchmark_cards_v2_research_report.md

Main Results

repo	completeness_score	experiments	artifact_count
agent-safety-eval-lab	1.0000	4.0000	6.0000
agent-trace-viewer	1.0000	4.0000	6.0000
llm-eval-cookbook	1.0000	5.0000	6.0000
mcp-tool-security-playground	1.0000	4.0000	6.0000
multilingual-llm-safety-bench	1.0000	4.0000	7.0000
prompt-robustness-suite	1.0000	15.0000	5.0000
rag-eval-observatory	1.0000	4.0000	5.0000
transformer-from-scratch-notes	1.0000	4.0000	5.0000

Analysis

The generator produced benchmark cards for all 8 upstream repos and scored each card for experiment count, dataset path, artifacts, device metadata, and limitations.
Every upstream card currently reaches the schema completeness threshold, giving the portfolio a cross-repo reproducibility index.
The generated cards point back to committed reports and result artifacts, so project claims can be audited instead of trusted as prose.
This repo now closes the loop: it consumes the portfolio's actual experiment indexes and turns them into standardized research cards.

Failure Analysis

The failure-analysis pass found 0 failure records.

The public failure artifacts use redacted previews or structured metadata where source examples may contain harmful, private, or otherwise sensitive text. This keeps the analysis reproducible without turning the README into a prompt-injection or unsafe-content corpus.

Key Artifacts

Figures:

Reproduction

conda run -n Transformers python scripts/run_matrix.py --device cuda --profile full
conda run -n Transformers python scripts/analyze_failures.py
conda run -n Transformers python scripts/make_report.py
conda run -n Transformers python -m pytest

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
configs		configs
datasets/external		datasets/external
docs		docs
examples		examples
notebooks		notebooks
reports		reports
scripts		scripts
src/open_model_benchmark_cards		src/open_model_benchmark_cards
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Model Benchmark Cards

Quick Start

Example Output

Research Brief

Portfolio Notes

Experiment Artifacts

CLI

Full Model Set

Schema Checks

Real Public Dataset Experiment

GPU-Backed Real Experiment

Publishable V2 Research Results

Dataset And Scale

Main Results

Analysis

Failure Analysis

Key Artifacts

Reproduction

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open Model Benchmark Cards

Quick Start

Example Output

Research Brief

Portfolio Notes

Experiment Artifacts

CLI

Full Model Set

Schema Checks

Real Public Dataset Experiment

GPU-Backed Real Experiment

Publishable V2 Research Results

Dataset And Scale

Main Results

Analysis

Failure Analysis

Key Artifacts

Reproduction

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages