[experiments] Daily Experiment Report — 2026-05-15 #32317

2026-05-15T09:06:46Z

github-actions[bot]
Bot May 15, 2026

🧪 Daily Experiment Report — 2026-05-15

7 active experiments analysed across 7 workflows. 0 reached statistical significance — all experiments are in early data-collection phase (EXTEND). No variants have reached min_samples yet. All experiments need more runs before conclusions can be drawn.

`prompt_style` · `daily-news.lock.yml`

Variants: detailed vs concise · Window: last 30 runs · Analysed: 4 runs with assignments
min_samples: 30 per variant · Issue: #31190

H0: no change in output quality. H1: concise prompt reduces token usage by ≥20% with no significant drop in output completeness score

Experiment : prompt_style
Workflow   : daily-news.lock.yml
Hypothesis : H0: no change in output quality. H1: concise reduces token usage ≥20%
Window     : last 30 runs  |  Analysed: 4 runs
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| detailed         |    3 |    0.0%  |          301   | [259, 343]         |   (ref)   |  3/30 (10%)   |
| concise          |    1 |    0.0%  |          N/A   | N/A                |   N/A     |  1/30 (3%)    |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Note: All runs failing — workflow-level issue, not variant-specific.

Recommendation: EXTEND
Rationale     : Insufficient data — all variants need at least 5 runs before analysis.

Recommendation: EXTEND — Gathering more data. Note all runs failing (workflow-level issue unrelated to the experiment).

`reasoning_depth` · `daily-fact.lock.yml`

Variants: single_pass vs multi_candidate · Window: last 30 runs · Analysed: 2 runs with assignments
min_samples: 30 per variant · Issue: #31324

H0: no change in discussion engagement rate. H1: multi_candidate produces more novel verses with higher reaction counts (expected +20% reactions).

Experiment : reasoning_depth
Workflow   : daily-fact.lock.yml
Hypothesis : H1: multi_candidate produces more novel verses (+20% reactions)
Window     : last 30 runs  |  Analysed: 2 runs
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| single_pass      |    2 |    0.0%  |          210   | [64, 357]          |   (ref)   |  2/30 (6%)    |
| multi_candidate  |    0 |    N/A   |          N/A   | N/A                |   N/A     |  0/30 (0%)    |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Note: multi_candidate variant has not been assigned yet. Experiment started 2026-05-11.

Recommendation: EXTEND
Rationale     : Insufficient data — all variants need at least 5 runs before analysis.

Recommendation: EXTEND — Very early stage, multi_candidate variant not yet sampled.

`output_format` · `daily-issues-report.lock.yml`

Variants: collapsible vs inline · Window: last 30 runs · Analysed: 9 runs with assignments
min_samples: 30 per variant · Issue: #30573

H0: no change in discussion engagement score. H1: inline format produces ≥20% higher engagement.

Experiment : output_format
Workflow   : daily-issues-report.lock.yml
Hypothesis : H1: inline format produces ≥20% higher engagement
Window     : last 30 runs  |  Analysed: 9 runs
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| collapsible      |    3 |    0.0%  |          310   | [261, 360]         |   (ref)   |  3/30 (10%)   |
| inline           |    6 |    0.0%  |          390   | [243, 538]         |   N/A     |  6/30 (20%)   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Note: All runs failing (0.0% success rate) — workflow-level issue affecting both variants equally.

Recommendation: EXTEND
Rationale     : Insufficient data — all variants need at least 5 runs before analysis.

Recommendation: EXTEND — 20% toward min_samples for inline; workflow failures unrelated to experiment variant.

`prompt_style` · `issue-arborist.lock.yml`

Variants: concise vs detailed · Window: last 30 runs · Analysed: 4 runs with assignments
min_samples: 30 per variant · Issue: #30015

H0: no change in links_created. H1: detailed instructions produce ≥15% more correct sub-issue links.

Experiment : prompt_style
Workflow   : issue-arborist.lock.yml
Hypothesis : H1: detailed instructions produce ≥15% more correct sub-issue links
Window     : last 30 runs  |  Analysed: 4 runs
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| concise          |    2 |    0.0%  |          165   | [89, 241]          |   (ref)   |  2/30 (6%)    |
| detailed         |    2 |    0.0%  |          165   | [89, 241]          |   N/A     |  2/30 (6%)    |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Note: Experiment started 2026-05-12. Both variants failing identically — workflow-level issue.

Recommendation: EXTEND
Rationale     : Insufficient data — all variants need at least 5 runs before analysis.

Recommendation: EXTEND — Very early stage (2 runs per variant).

`reasoning_depth` · `daily-security-red-team.lock.yml`

Variants: single_pass vs iterative · Window: last 30 runs · Analysed: 3 runs with assignments
min_samples: 30 per variant · Issue: #31673

H0: no change in finding quality. H1: iterative reduces false-positive rate by >20%.

Experiment : reasoning_depth
Workflow   : daily-security-red-team.lock.yml
Hypothesis : H1: iterative reduces false-positive rate by >20%
Window     : last 30 runs  |  Analysed: 3 runs
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| single_pass      |    1 | 100.0%   |          N/A   | N/A                |   (ref)   |  1/30 (3%)    |
| iterative        |    2 | 100.0%   |          302   | [-85, 690]         |   N/A     |  2/30 (6%)    |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Note: All recorded runs succeed. Experiment started 2026-05-12. Both variants at 100% success rate so far.

Recommendation: EXTEND
Rationale     : Insufficient data — all variants need at least 5 runs before analysis.

Recommendation: EXTEND — Early positive signal (all runs succeed); need more data.

`output_format` · `deep-report.lock.yml`

Variants: full_briefing vs executive_brief vs annotated_brief · Window: last 30 runs · Analysed: 7 runs with assignments
min_samples: 15 per variant

H0: no change in discussion engagement or token cost. H1: executive_brief reduces token cost by ≥30%.

Experiment : output_format
Workflow   : deep-report.lock.yml
Hypothesis : H1: executive_brief reduces token cost by ≥30%
Window     : last 30 runs  |  Analysed: 7 runs
min_samples: 15 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| full_briefing    |    4 | 100.0%   |          820   | [624, 1015]        |   (ref)   |  4/15 (26%)   |
| executive_brief  |    2 | 100.0%   |          762   | [-45, 1568]        |   N/A     |  2/15 (13%)   |
| annotated_brief  |    1 |   0.0%   |          N/A   | N/A                | 0.0253 *  |  1/15 (6%)    |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Note: annotated_brief p-value (0.025) is nominal but based on n=1 vs n=4 — do NOT interpret as significant.
      SPARSE_CELL_RISK: no cell has reached min_samples=15. Recommendation remains EXTEND.

Recommendation: EXTEND
Rationale     : Insufficient data — all variants need at least 5 runs for annotated_brief before analysis.

Recommendation: EXTEND — 3-variant experiment; full_briefing most sampled at 26% of min_samples.

`prompt_style` · `daily-astrostylelite-markdown-spellcheck.lock.yml`

Variants: detailed vs concise · Window: last 30 runs · Analysed: 10 runs with assignments
min_samples: 30 per variant

Concise prompt reduces token consumption ≥20% without degrading fix precision.

Experiment : prompt_style
Workflow   : daily-astrostylelite-markdown-spellcheck.lock.yml
Hypothesis : Concise prompt reduces token consumption ≥20% without degrading fix precision
Window     : last 30 runs  |  Analysed: 10 runs
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| detailed         |    7 |   85.7%  |          395   | [286, 503]         |   (ref)   |  7/30 (23%)   |
| concise          |    3 |  100.0%  |          459   | [303, 615]         |  0.4902   |  3/30 (10%)   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001

Note: Most balanced experiment so far. concise shows 100% success vs 85.7% for detailed (p=0.49, not significant).

Recommendation: EXTEND
Rationale     : Insufficient data — gathering more data (max 7/30 runs reached so far).

Recommendation: EXTEND — Most data collected of all experiments (7/30 for detailed). concise shows slightly higher success rate (100% vs 85.7%) but not significant yet.

📊 Summary

Experiment	Workflow	Control	Best variant	p-value	Recommendation
prompt_style	daily-news	detailed	concise	N/A	EXTEND
reasoning_depth	daily-fact	single_pass	multi_candidate	N/A	EXTEND
output_format	daily-issues-report	collapsible	inline	N/A	EXTEND
prompt_style	issue-arborist	concise	detailed	N/A	EXTEND
reasoning_depth	daily-security-red-team	single_pass	iterative	N/A	EXTEND
output_format	deep-report	full_briefing	executive_brief	N/A	EXTEND
prompt_style	daily-astrostylelite-spellcheck	detailed	concise	0.49	EXTEND

Analysis window: last 30 runs per workflow · Significance threshold: p < 0.05 (two-tailed)
Run: §25909006011

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

proxy.golang.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "proxy.golang.org"

See Network Configuration for more information.

Generated by 🧪 daily-experiment-report · ● 44.3M · ◷

expires on May 18, 2026, 9:06 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiments] Daily Experiment Report — 2026-05-15 #32317

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[experiments] Daily Experiment Report — 2026-05-15 #32317

Uh oh!

github-actions[bot] Bot May 15, 2026

🧪 Daily Experiment Report — 2026-05-15

prompt_style · daily-news.lock.yml

reasoning_depth · daily-fact.lock.yml

output_format · daily-issues-report.lock.yml

prompt_style · issue-arborist.lock.yml

reasoning_depth · daily-security-red-team.lock.yml

output_format · deep-report.lock.yml

prompt_style · daily-astrostylelite-markdown-spellcheck.lock.yml

📊 Summary

Replies: 0 comments

github-actions[bot]
Bot May 15, 2026

`prompt_style` · `daily-news.lock.yml`

`reasoning_depth` · `daily-fact.lock.yml`

`output_format` · `daily-issues-report.lock.yml`

`prompt_style` · `issue-arborist.lock.yml`

`reasoning_depth` · `daily-security-red-team.lock.yml`

`output_format` · `deep-report.lock.yml`

`prompt_style` · `daily-astrostylelite-markdown-spellcheck.lock.yml`