The fuzz/combo suite (tests/training/test_fuzz_suite.py +
_fuzz_combo.py) exists to find production bugs. Every test failure is a
real prod bug unless you can prove otherwise. NEVER paper a failing combo
over with any of the following without first fixing the underlying
production code:
- Canonicalisation in
_fuzz_combo.py—_canonical_*methods andcanonical_keyrules must collapse only semantically-equivalent combos (e.g.imbalance="balanced"is the same regardless of the imbalance-mode flag because at 50/50 the mode is meaningless). They must NOT collapse a combo "because it crashes". A canon rule whose justification is "fuzz cXXXX hangs / raises / produces empty val" is prohibited — fix the prod code instead. - Runtime canonicalisations in
test_fuzz_suite.py— the*_eff = ... if condition else ...rewrites at the top oftest_fuzz_train_mlframe_models_suite(before the suite call). Same rule: legitimate only when the rewrite preserves semantics; never as a guard against a real prod crash. pytest.mark.xfail/pytest.skip— reserved for genuine third-party / OS / unfixable issues (Windows symlinks, optional dep missing, sklearn API limitation). A test that surfaces an mlframe-internal bug must be FIXED, not xfailed-with-TODO.- Defensive guards in production (trainer.py
_apply_pre_pipeline_- transforms, wrappers.py CV-fold loop, etc.) —if len(...) == 0: skippatterns are acceptable only when the empty path is a legitimate user scenario. When the empty arises from an upstream bug, the guard is a band-aid that hides the bug forever — fix upstream.
- Bad (2026-04-26):
_canonical_text_col_countzeroed text columns for CB+small-n+heavy-NaN combos because CB'soccurrence_lower_bound=50produces an empty TF-IDF dictionary on tiny inner-CV folds. Hid a real prod hang. Good (2026-04-27): replaced withtraining/helpers.compute_cb_text_processingwhich scalesoccurrence_lower_boundproportionally to the fit-time row count; wired intotrainer._train_model_with_fallbackandfeature_selection/wrappers.pyRFECV inner-fold. Canon retired. - Bad (still active):
canonical_key:327-335collapsesinject_degenerate_cols=TruetoFalseon CB+multilabel because CB mis-detectsnum_const/num_nullas cat features. Hides c0062. Owed fix: explicitcat_features=arg in CB wrapper (or type-cast guard) so CB doesn't auto-detect numeric columns as categorical wheninject_degenerate_cols=True. - Bad (still active):
canonical_key:406forcesremove_constant_columns_cfg=Truewhenever degenerate / all-NaN columns are injected. Hides polars-ds RobustScaler crashing on zero-IQR (c0008/c0116). Owed fix: zero-IQR guard in the polars-ds robust scaler wrapper (skip / clip / fall back). - Bad (still active): four layered "0-row val" tolerances in
trainer.py(_apply_pre_pipeline_transforms×2,_setup_eval_set,_compute_split_metrics). All point at the same upstream defect: outlier detection / aging-limit collapsing val to 0 rows silently. Owed fix:_apply_outlier_detection_globalshould raise on empty val (mirroring the train-sidemin_keepguard at core.py:1021). - Bad (deferred): two
_rule_cb_*functions defined in_fuzz_combo.pylines 833-868 with TODOs but NOT registered inKNOWN_XFAIL_RULES. Either bugs are fixed (delete dead code) or unfixed (silent fail in fuzz). Both were resolved 2026-04-27.
- Read the traceback. Identify the prod-code line.
- Decide: is this a legitimate user-facing bug (yes → fix in prod) or
a genuine third-party limitation (yes → xfail with detailed reason
- open issue / link)?
- If you find yourself reaching for
_canonical_*, ask: "would a real user with these settings hit this same crash?" If yes, you are masking a prod bug — STOP and fix prod instead. - Fixing prod often retires multiple canon rules / guards / xfails at once (e.g. fixing the splitter empty-val edge retires 4 trainer guards + 1 runtime canon + 1 prod-config validator gap).
Frames in mlframe can be 100+ GB. Never copy them to work around a bug. Copying a prod DataFrame doubles peak RAM, which on a 200 GB+ workload means OOM — the user observed this in 2026-04-22 prod logs.
Avoid:
df.copy()(pandas) ordf.clone()(polars) inside hot pathsdf[cols] = df[cols].astype(...)whendfis the caller's frame (pandas broadcasts-copies the sub-frame)- Constructing a fresh
pd.DataFrame(df)/pl.DataFrame(df)to "get a new reference" - Any fit-transform pattern that returns a mutated input
Prefer:
- Work on views (
.iloc, column selection, slices) - Mutate-and-restore:
X[col] = new; try: ... finally: del X[col] - Use
with/ context managers that revert the mutation on exit - Lazy eval via polars
lazy()+.collect()at the leaf call - Pass
inplaceoptions where sklearn / the transformer supports them
Fuzz-caught example: MRMR.fit needed to temporarily inject a targ_<id>
column into X for MI computation. Original code mutated caller's X
in place, leaked the injected column into downstream sklearn steps, and
tripped validate_data on the next transform. Fix in
feature_selection/filters.py:~2895 must inject + remove the column in a
try/finally (never call X.copy()).
If you find a bug that genuinely needs a copy, escalate — the user would rather ship a design change than accept an unconditional copy on a hot DataFrame path.
(Nothing tracked here currently. Polars support for MRMR — both the
selector core and feature engineering — landed 2026-04-22. See tests:
tests/training/test_mrmr_polars_fe.py,
tests/training/test_bizvalue_feature_selection.py::test_mrmr_drops_uninformative_features_on_polars_input,
regression sensors in tests/training/test_fuzz_regression_sensors.py.)