You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
7 active experiments analysed across 7 workflows. 0 reached statistical significance — all experiments are in early data-collection phase (EXTEND). No variants have reached min_samples yet. All experiments need more runs before conclusions can be drawn.
prompt_style · daily-news.lock.yml
Variants: detailed vs concise · Window: last 30 runs · Analysed: 4 runs with assignments min_samples: 30 per variant · Issue: #31190
H0: no change in output quality. H1: concise prompt reduces token usage by ≥20% with no significant drop in output completeness score
Experiment : prompt_style
Workflow : daily-news.lock.yml
Hypothesis : H0: no change in output quality. H1: concise reduces token usage ≥20%
Window : last 30 runs | Analysed: 4 runs
min_samples: 30 per variant
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant | n | Succ % | Mean dur (s) | 95% CI (s) | p-value | min_samples |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| detailed | 3 | 0.0% | 301 | [259, 343] | (ref) | 3/30 (10%) |
| concise | 1 | 0.0% | N/A | N/A | N/A | 1/30 (3%) |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05 ** p<0.01 *** p<0.001
Note: All runs failing — workflow-level issue, not variant-specific.
Recommendation: EXTEND
Rationale : Insufficient data — all variants need at least 5 runs before analysis.
Recommendation: EXTEND — Gathering more data. Note all runs failing (workflow-level issue unrelated to the experiment).
reasoning_depth · daily-fact.lock.yml
Variants: single_pass vs multi_candidate · Window: last 30 runs · Analysed: 2 runs with assignments min_samples: 30 per variant · Issue: #31324
H0: no change in discussion engagement rate. H1: multi_candidate produces more novel verses with higher reaction counts (expected +20% reactions).
Experiment : reasoning_depth
Workflow : daily-fact.lock.yml
Hypothesis : H1: multi_candidate produces more novel verses (+20% reactions)
Window : last 30 runs | Analysed: 2 runs
min_samples: 30 per variant
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant | n | Succ % | Mean dur (s) | 95% CI (s) | p-value | min_samples |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| single_pass | 2 | 0.0% | 210 | [64, 357] | (ref) | 2/30 (6%) |
| multi_candidate | 0 | N/A | N/A | N/A | N/A | 0/30 (0%) |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05 ** p<0.01 *** p<0.001
Note: multi_candidate variant has not been assigned yet. Experiment started 2026-05-11.
Recommendation: EXTEND
Rationale : Insufficient data — all variants need at least 5 runs before analysis.
Recommendation: EXTEND — Very early stage, multi_candidate variant not yet sampled.
output_format · daily-issues-report.lock.yml
Variants: collapsible vs inline · Window: last 30 runs · Analysed: 9 runs with assignments min_samples: 30 per variant · Issue: #30573
H0: no change in discussion engagement score. H1: inline format produces ≥20% higher engagement.
Experiment : output_format
Workflow : daily-issues-report.lock.yml
Hypothesis : H1: inline format produces ≥20% higher engagement
Window : last 30 runs | Analysed: 9 runs
min_samples: 30 per variant
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant | n | Succ % | Mean dur (s) | 95% CI (s) | p-value | min_samples |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| collapsible | 3 | 0.0% | 310 | [261, 360] | (ref) | 3/30 (10%) |
| inline | 6 | 0.0% | 390 | [243, 538] | N/A | 6/30 (20%) |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05 ** p<0.01 *** p<0.001
Note: All runs failing (0.0% success rate) — workflow-level issue affecting both variants equally.
Recommendation: EXTEND
Rationale : Insufficient data — all variants need at least 5 runs before analysis.
Recommendation: EXTEND — 20% toward min_samples for inline; workflow failures unrelated to experiment variant.
prompt_style · issue-arborist.lock.yml
Variants: concise vs detailed · Window: last 30 runs · Analysed: 4 runs with assignments min_samples: 30 per variant · Issue: #30015
H0: no change in links_created. H1: detailed instructions produce ≥15% more correct sub-issue links.
Experiment : prompt_style
Workflow : issue-arborist.lock.yml
Hypothesis : H1: detailed instructions produce ≥15% more correct sub-issue links
Window : last 30 runs | Analysed: 4 runs
min_samples: 30 per variant
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant | n | Succ % | Mean dur (s) | 95% CI (s) | p-value | min_samples |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| concise | 2 | 0.0% | 165 | [89, 241] | (ref) | 2/30 (6%) |
| detailed | 2 | 0.0% | 165 | [89, 241] | N/A | 2/30 (6%) |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05 ** p<0.01 *** p<0.001
Note: Experiment started 2026-05-12. Both variants failing identically — workflow-level issue.
Recommendation: EXTEND
Rationale : Insufficient data — all variants need at least 5 runs before analysis.
Recommendation: EXTEND — Very early stage (2 runs per variant).
Variants: detailed vs concise · Window: last 30 runs · Analysed: 10 runs with assignments min_samples: 30 per variant
Concise prompt reduces token consumption ≥20% without degrading fix precision.
Experiment : prompt_style
Workflow : daily-astrostylelite-markdown-spellcheck.lock.yml
Hypothesis : Concise prompt reduces token consumption ≥20% without degrading fix precision
Window : last 30 runs | Analysed: 10 runs
min_samples: 30 per variant
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant | n | Succ % | Mean dur (s) | 95% CI (s) | p-value | min_samples |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| detailed | 7 | 85.7% | 395 | [286, 503] | (ref) | 7/30 (23%) |
| concise | 3 | 100.0% | 459 | [303, 615] | 0.4902 | 3/30 (10%) |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05 ** p<0.01 *** p<0.001
Note: Most balanced experiment so far. concise shows 100% success vs 85.7% for detailed (p=0.49, not significant).
Recommendation: EXTEND
Rationale : Insufficient data — gathering more data (max 7/30 runs reached so far).
Recommendation: EXTEND — Most data collected of all experiments (7/30 for detailed). concise shows slightly higher success rate (100% vs 85.7%) but not significant yet.
📊 Summary
Experiment
Workflow
Control
Best variant
p-value
Recommendation
prompt_style
daily-news
detailed
concise
N/A
EXTEND
reasoning_depth
daily-fact
single_pass
multi_candidate
N/A
EXTEND
output_format
daily-issues-report
collapsible
inline
N/A
EXTEND
prompt_style
issue-arborist
concise
detailed
N/A
EXTEND
reasoning_depth
daily-security-red-team
single_pass
iterative
N/A
EXTEND
output_format
deep-report
full_briefing
executive_brief
N/A
EXTEND
prompt_style
daily-astrostylelite-spellcheck
detailed
concise
0.49
EXTEND
Analysis window: last 30 runs per workflow · Significance threshold: p < 0.05 (two-tailed)
Run: §25909006011
Warning
Firewall blocked 1 domain
The following domain was blocked by the firewall during workflow execution:
proxy.golang.org
To allow these domains, add them to the network.allowed list in your workflow frontmatter:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
🧪 Daily Experiment Report — 2026-05-15
7 active experiments analysed across 7 workflows. 0 reached statistical significance — all experiments are in early data-collection phase (EXTEND). No variants have reached
min_samplesyet. All experiments need more runs before conclusions can be drawn.prompt_style·daily-news.lock.ymlRecommendation: EXTEND — Gathering more data. Note all runs failing (workflow-level issue unrelated to the experiment).
reasoning_depth·daily-fact.lock.ymlRecommendation: EXTEND — Very early stage,
multi_candidatevariant not yet sampled.output_format·daily-issues-report.lock.ymlRecommendation: EXTEND — 20% toward min_samples for
inline; workflow failures unrelated to experiment variant.prompt_style·issue-arborist.lock.ymlRecommendation: EXTEND — Very early stage (2 runs per variant).
reasoning_depth·daily-security-red-team.lock.ymlRecommendation: EXTEND — Early positive signal (all runs succeed); need more data.
output_format·deep-report.lock.ymlRecommendation: EXTEND — 3-variant experiment;
full_briefingmost sampled at 26% of min_samples.prompt_style·daily-astrostylelite-markdown-spellcheck.lock.ymlRecommendation: EXTEND — Most data collected of all experiments (7/30 for
detailed).conciseshows slightly higher success rate (100% vs 85.7%) but not significant yet.📊 Summary
Warning
Firewall blocked 1 domain
The following domain was blocked by the firewall during workflow execution:
proxy.golang.orgSee Network Configuration for more information.
Beta Was this translation helpful? Give feedback.
All reactions