- Primary:
- Add limitations:
- Add limitation on the possibility of sampling grid errors canceling out
- Mention hardware discretization in extrapolations, re: https://openathena.slack.com/archives/C0884476QSC/p1773257160211219?thread_ts=1773062864.434009&cid=C0884476QSC
- This is relevant for the simulations too in continuous param space
- Mention C=6ND assumption as a limitation
- Discuss Olmo hybrid assumption of constrained scaling exponents for architecture comparisons
- Bootstrap for within-budget resampling on Approach 2 only
- Add https://arxiv.org/abs/2603.03276 as reference on multimodal asymmetry
- Add Gemini Pretraining notes on MoE data scaling asymmetry as additional need for non-symmetric methods beyond multimodal
- Make a reference implementation
- Mention FLOP factor correction and WLS weighting for approach 2 as possible improvements
- Or at least mention importance of reliance on C=6ND assumption
- Add limitations:
- Secondary:
- Add WLS analysis
- Cite https://arxiv.org/pdf/2406.19146 when discussing WLS adjustments based on noise at different budgets
- See 2.3 Data analysis
- Cite https://arxiv.org/pdf/2406.19146 when discussing WLS adjustments based on noise at different budgets
- Add exp6 validation for proof to appendix
- Consider https://arxiv.org/abs/2603.06603 as another citation for methods that "extend individual terms in isolation (e.g. token scaling terms alone)"
- Cite Gemstones: A Model Suite for Multi-Faceted Scaling Laws on how C=6ND breaks down w/ model shape
- Cite Scaling Laws for Native Multimodal Models on PlantCAD issue for empirical C ~ D^b method (see C. Scaling Laws)
- Mention the demo prompt examples for making your own simulator; examples:
- Add note advising against using logloss given bias in simulations and ml-scalefit reproduction
- Copy intercept-error proof into paper appendix
- Add citations from "Configuration-to-Performance Scaling Law with Neural Ansatz" on other adaptations of functional forms for Chinchilla scaling laws
- Review figures.py for ways to use existing code utilities and then regen (or push back into experiments code)
- Add WLS analysis