This file tracks releases based on version= changes in setup.py.
- Root cause fix for decoder FFN weight norm runaway: in pre-norm Transformers, LayerNorm on the FFN input provides no constraint on the output. As FFN weights grow, outputs grow proportionally, inflating gradients across the residual stream in a positive feedback loop.
decoder.layers.0.ff.linear1.weightgrew from 29.8 (init) to 88.0 (3.0× init) in 14 epochs with accelerating growth rate, driving clipping saturation from 0.1% to 30%. GLUFeedForward(transformers.py): newuse_output_norm: bool = Falseparameter. When enabled, appliesnn.RMSNorm(d_model)afterlinear2, decoupling output magnitude from weight growth. Verified: with 3× weight scaling, output norm stays at 1.0× (vs 10.7× without).- Threaded through all constructor chains:
ffn_output_normpropagated throughImprovedTransformerEncoderBlock,ImprovedTransformerDecoderBlock,ImprovedTransformerDecoder, and compatibility wrappersTransformerEncoderBlockandTransformerDecoder. - Model constructor (
model.py):KokoroModel.__init__acceptsffn_output_norm: bool = Falseand passes to all encoder blocks and decoder. - Config field (
config.py):ffn_output_norm: bool = True. Disabled by default in code for checkpoint compat; setTruefor fresh training runs. - Trainer wiring (
trainer.py): passesconfig.ffn_output_normtoKokoroModel.
- Phone normalization:
_MFA_PHONE_MAP(26 entries) maps MFA'srussian_mfaphoneme inventory to the text-processor IPA inventory — dental diacritics, palatal symbols, velarized laterals, etc._normalize_mfa_phone()applies NFC normalization and combining-character stripping. - DP sequence alignment (
align_durations()): Needleman-Wunsch dynamic programming aligner with transitions for 1:1 match, 2:1 iotation merge (j+ vowel → single token), 1:2 geminate split, prosody token insertion,<sil>insertion, and 1:Nspn(spoken noise) expansion. Achieves 22,199/22,200 aligned files (1 missing TextGrid). - SPN recovery: single
spnMFA interval can match K consecutive text-processor phones viaS{k}transition, recovering all 11,391 spn-affected utterances that were previously rejected. get_aligned_durations(): new public API used bydataset.pyreplacing the oldget_phoneme_durations()path. Handles all alignment edge cases natively.- Dataset wiring (
dataset.py):__getitem__now callsself.mfa.get_aligned_durations()instead ofget_phoneme_durations()withstrip_outer_silences. FEATURE_CACHE_VERSIONbumped 6 → 7 (dataset.py): invalidates stale cache entries from pre-DP-aligner phoneme sequences.
max_grad_norm1.0 → 1.5: at 38% clipping saturation, the global clip was distorting gradient direction on over a third of steps. Per-parameter pre-clips provide the real spike protection.ffn_spike_clip_norm5.0 → 3.0: decoder FFN is the dominant gradient contributor (24 params, max RMS 0.50 vs ≤0.12 for all other groups). Tighter pre-clip targets the largest source before global clip.dec_ffn_max_weight_norm: float = 95.0(was 0.0 / disabled): post-step L2 norm clamp on all 12 decoder FFN weight matrices (6 layers × linear1 + linear2) plus encoder FFN matrices. Arrests runaway weight growth when weight decay alone is insufficient at the 0.3× LR multiplier.decoder_ffn_weight_decay: float = 0.35: dedicated weight decay for decoder FFN param group, wired separately from the sharedffn_weight_decay(0.1). Applied viagetattr(config, 'decoder_ffn_weight_decay', ffn_weight_decay)in the optimizer setup.
stop_token_pos_weight25 → 17: reduces stop-loss dominance in early training. The stop/mel loss ratio self-corrected from 0.609 to 0.275 by Ep12.stop_head_lr_multiplier0.2 → 0.1: further throttles stop head update magnitude at peak LR.
- Pitch/energy loss weight fallback (
losses.py):getattrfallbacks forpitch_loss_weightandenergy_loss_weightcorrected from0.1to1.0. The wrong fallback silently scaled pitch/energy gradients by 10× less than intended when config fields were absent. - SpecAugment time_mask_max fallback (
trainer.py): default changed from30to20in both thegetattrfallback and the method signature, matching the intended config value. - OneCycleLR step overflow guard (
trainer.py): whenwarmup_steps >= total_steps, warmup is clamped tototal_steps - 1to prevent a crash in theOneCycleLRconstructor. encoder_ffn_spike_clip_normfallback (trainer.py): corrected from10.0to8.0to match config default.- Energy percentile normalization guard (
variance_predictor.py): when sequence lengthT < 3frames, usesmin/maxinstead ofquantile(0.05/0.95)to avoidRuntimeErrorfrom insufficient elements. - MFA duration mismatch log level (
dataset.py): raised fromDEBUGtoWARNINGso alignment mismatches are visible in normal logging. - Precompute features cache version check (
precompute_features.py): fixed to check_cache_versionbefore skipping cached files. Previously skipped stale v6 files without recomputing.
- On fresh (non-resume) training starts, stale
events.out.tfevents.*files are removed from the log directory. Prevents broken TensorBoard plots from overlapping runs.
- FFN output norm migration:
*.ff.output_norm.weightkeys recognized as expected missing keys during checkpoint loading, enablingstrict=Falseload from pre-ffn_output_normcheckpoints without error. - Generalized missing-key warning: migration message now covers both variance_adaptor and ffn_output_norm key families.
| Field | Default | Description |
|---|---|---|
ffn_output_norm |
True |
RMSNorm on FFN output to decouple output magnitude from weight growth |
decoder_ffn_weight_decay |
0.35 |
Dedicated weight decay for decoder FFN param group |
enable_adaptive_memory |
True |
MPS adaptive memory management |
dec_ffn_max_weight_norm |
95.0 |
Post-step max weight norm for FFN matrices |
tests/unit/test_sil_aligned_training_path.py— fixed_mfa()helper to return(alignment, [])tuple matching updatedparse_textgridsignature (was returning bare list, causingValueError: not enough values to unpack). Updatedtest_dataset_getitem_passes_strip_outer_silencesto assertget_aligned_durations(DP aligner) instead of the removedstrip_outer_silences=Truekwarg. 16 test failures resolved.
hidden_dim768 → 512: cuts total parameters from ~35M to ~16M. With 22K training utterances (~1,600 params/sample at 35M), the model was deep in the overfitting regime. At 16M (~730 params/sample) the capacity/data ratio is far more tractable.encoder_ff_dim/decoder_ff_dim2048 → 1536: GLU FFN expansion rebalanced from 5.3× to 4×, reducing FFN dominance in total parameter count.
decoder_dropout: float = 0.25(new config field): dedicated dropout rate for decoder attention/FFN residual connections, separate fromencoder_dropout(0.15). The decoder is more prone to overfitting due to teacher forcing and benefits from stronger regularization.decoder_input_dropout: float = 0.15(new config field): dropout on projected mel input before it enters the decoder (was hardcoded 0.1).- Model wiring (
model.py):KokoroModel.__init__acceptsdecoder_dropout(defaultNone→ falls back toencoder_dropoutfor backward compatibility). Stored asself._decoder_dropoutand passed toTransformerDecoder. - Trainer wiring (
trainer.py): passes bothdecoder_dropoutanddecoder_input_dropoutfrom config toKokoroModel.
- Reduced masking intensity:
spec_augment_time_mask_max30 → 10,spec_augment_num_time_masks2 → 1,spec_augment_freq_mask_max10 → 5. Worst-case temporal masking drops from ~43% to ~7%, preventing the catastrophic autoregressive pathway disruption observed in the previous run (Ep13 shock of +0.053 that never recovered). - Per-sample masking (
trainer.py):_apply_spec_augmentnow generates independent masks for each sample in the batch instead of applying one mask batch-wide. Improves gradient diversity across the batch.
- Audio-level speed perturbation (
dataset.py): randomly resamples training audio by a factor in [0.9, 1.1] before feature extraction, effectively multiplying dataset diversity by ~1.5× without additional data. Applied per-sample in__getitem__after audio normalization but before mel computation viatorchaudio.functional.resample. - MFA duration rescaling: phoneme durations are scaled by
1/factor(speed up → shorter durations) withclamp(min=1). Existing frame-sum correction handles any rounding residual. - Cache bypass: augmented samples skip both cache load and save (stochastic perturbation is incompatible with deterministic caching); unperturbed samples use the cache normally.
- Training-only:
is_trainingflag added toRuslanDataset.__init__; trainer passesis_training=Truefor train,is_training=Falsefor validation. Speed perturbation is never applied to validation data. - New config fields:
use_speed_perturbation: bool = True,speed_perturb_range: float = 0.1,speed_perturb_prob: float = 0.5.
num_epochs60 → 100: extends OneCycleLR to ~32,700 steps, adding ~67% more cosine-decay refinement time. Combined with the smaller model, estimated val_mel at Ep100: ~0.70 (central), range 0.65–0.74.early_stopping_patience10 → 15: prevents premature stopping during the SpecAugment adaptation window.
- Per-head RMSNorm on Q and K projections (
transformers.py):MultiHeadAttentionImprovednow accepts aqk_norm: bool = Falseparameter. When enabled,nn.RMSNorm(d_k)is applied independently to Q and K after the linear projection but before RoPE. This decouples attention logit magnitudes from the projection weight norms, breaking the self-reinforcing growth loop (largerw_o→ larger outputs → larger gradients → largerw_o) that caused unbounded decoder attention weight growth across training runs. - Threaded through all constructor chains:
qk_normis propagated throughImprovedTransformerEncoderBlock,ImprovedTransformerDecoderBlock(both self-attn and cross-attn),ImprovedTransformerDecoder, and the compatibility wrappersTransformerEncoderBlockandTransformerDecoder. - Model constructor (
model.py):KokoroModel.__init__acceptsqk_norm: bool = Falseand passes it to all encoder blocks and the decoder. - Config field (
config.py):TrainingConfig.qk_norm: bool = Falseadded. Set toTrueto enable QK-norm for new training runs. Incompatible with prior checkpoints (newq_norm/k_normparameters in state dict). - Trainer wiring (
trainer.py):_setup_model()passesqk_normfrom config toKokoroModel.
- Perform post-step norm clamping on all decoder layers
- Further tuning
- Separate LR for stop
- Added training analysis script
- Spread heavy batches more evenly across the epoch to prevent clustering
- Added post-step max weight-norm clamp for decoder.layers.0.ff.linear1.weight
- Misc configuration changes to stabilize convergence
- Improved tensorboard logs purge upon resuming from a checkpoint
- Misc configuration changes to stabilize convergence
- Switched to Xavier init instead of Kaiming
- Implemented RoPE - relative displacement between positions, for MPS
- Misc bug-fixes and configuration adjustments
- Encoder/decoder parameter group split (
trainer.py): optimizer now uses two separate AdamW parameter groups. Encoder parameters (text_embedding,stress_embedding, positional encodings,transformer_encoder_layers) are trained atencoder_lr_multiplier × base_lr(default 3×); decoder and variance adaptor parameters usebase_lr. Per-groupmax_lrvalues are set accordingly in the OneCycleLR schedule and manual warmup ramp. Previously both groups shared the same LR, starving the encoder of gradient signal. - Adaptive clip base and explosion floor raised (
trainer.py): base gradient clip norm increased from0.5to1.0; the hard floor applied after an explosion event raised from0.05to0.3. Thesoft_mel_lengththreshold was also raised from900to1400frames to stop penalising normally-sized Russian sequences. encoder_ffn_spike_clip_normloosened (config.py): default raised from10.0to100.0. The old value was zeroing microscopically small but valid encoder FFN gradients at every step.duration_loss_weightraised (config.py): default raised from0.1to0.35, giving the encoder stronger alignment signal through the duration predictor path.decoder_input_dropoutreduced (model.py): default inKokoroModel.__init__changed from0.3to0.1. The high value was over-regularising teacher-forced decoding during early training.max_lr_multiplierraised (config.py): default raised from2.0to5.0, widening the OneCycleLR peak to give the model more room to descend in the first epochs.- Checkpoint resume multi-group support (
checkpoint_manager.py):resume_from_checkpointnow reconstructs the OneCycleLR with per-groupmax_lrvalues ([max_lr * encoder_lr_mult, max_lr]) so LR schedules are correctly restored after resume.
spec_augment_start_epochraised from 5 to 18 (config.py,trainer.py): empirical observation showed val_loss regressing from 1.87 → 2.04 across epochs 5–6 when spec augment activated at epoch 5 while the OneCycleLR was still ramping (peak at ~epoch 15 withpct_start=0.3, 50 epochs). Starting augmentation 3 epochs after the LR peak eliminates ramp-phase double-destabilisation. The fallback default in the trainer is updated to match.
GLUFeedForwardextracted as shared module: the GLU feed-forward block (linear1 → split gate/linear → activation(gate) × linear → linear2) is now a standalonenn.Moduleused by both encoder and decoder layers. Previously each block inlined duplicatelinear1/linear2/dropout_ff/_init_weightslogic._build_activationfactory: activation selection for FFN blocks moved to a single_build_activation(name)factory function instead of being duplicated across encoder and decoder block constructors. RaisesValueErrorfor unrecognised names.use_prenormremoved:ImprovedTransformerEncoderBlockand decoder equivalents no longer accept ause_prenormparameter — pre-norm is now always applied (matching the 0.0.27 change that made pre-norm GELU the fixed architecture). The parameter was dead code.dropout_ffdeduplicated: internal feed-forward dropout is now owned byGLUFeedForward; the redundantself.dropout_ffattribute on the block classes is removed.
| Field | Default | Description |
|---|---|---|
encoder_lr_multiplier |
3.0 |
LR multiplier for the encoder parameter group relative to learning_rate |
spec_augment_start_epoch |
18 |
Epoch at which SpecAugment is first applied |
max_lr_multiplier |
5.0 |
OneCycleLR peak multiplier (was 2.0) |
tests/unit/test_optimizer_param_groups.py— new, 19 tests: verifies two-group structure, encoder LR multiplier, decoder base LR, fallback to single group when no encoder params are found, per-groupmax_lrin scheduler.tests/unit/test_spec_augment.py— addedTestSpecAugmentEpochGateclass (15 tests): gate suppresses augment beforespec_augment_start_epoch, gate opens exactly at the threshold epoch,use_spec_augment=Falsealways suppresses regardless of epoch.tests/unit/test_config_pitch_extraction_defaults.py— updatedtest_training_config_convergence_fix_defaultsto assert new defaults (spec_augment_start_epoch=18,max_lr_multiplier=5.0,duration_loss_weight=0.35,encoder_lr_multiplier=3.0,encoder_ffn_spike_clip_norm=100.0).tests/unit/test_trainer_adaptive_stabilization.py— updated two clip-norm threshold assertions to match new values (0.5→1.0,0.05→0.3).tests/unit/test_transformers.py— expanded coverage forGLUFeedForwardand_build_activation, removed tests for the deleteduse_prenormparameter.
- Moved encoder/decoder to pre-norm GELU
- Fixed bugs in forward training, duration predictor
- Only advance scheduler/EMA on successful optimizer steps
- Adopt oversized sequences rather than skip
- Refactored trainer
- Decoder input dropout now honors constructor config (
model.py):_prepare_training_decoder_inputspreviously appliedtorch.nn.functional.dropout(..., p=0.3, ...)with a hardcoded probability, ignoringKokoroModel(decoder_input_dropout=...). This made architectural tuning ineffective and could over-regularize teacher-forced decoding. Fixed by usingself.decoder_input_dropoutat the dropout call site.
tests/unit/test_decoder_helpers.py— addedtest_prepare_training_decoder_inputs_uses_configured_decoder_input_dropout, which monkeypatchestorch.nn.functional.dropoutand asserts that_prepare_training_decoder_inputspasses the configureddecoder_input_dropoutvalue (andtraining=True).
- Increased default model capacity for new training runs:
hidden_dim:512 -> 768encoder_ff_dim:2048 -> 3072decoder_ff_dim:2048 -> 3072
- Updated both training defaults (
training/config.py) and model construction defaults (model_loader.py) so train/infer paths stay aligned. - Trainer model instantiation now forwards architecture fields (
n_encoder_layers,n_heads,encoder_ff_dim,encoder_dropout,n_decoder_layers,decoder_ff_dim,max_decoder_seq_len) fromTrainingConfiginstead of relying on constructor defaults.
- OneCycle warmup handoff continuity (
trainer.py): when manual warmup is enabled, OneCycleLR now usesdiv_factor=max_lr_multiplierso LR starts exactly atlearning_rateafter warmup, avoiding a warmup→scheduler LR jump. - Resume-time scheduler consistency (
trainer.py): OneCycle reconstruction now uses the same effectivediv_factorthat was used at creation. - Stale metadata FF-dim mismatch guard (
inference.py): if checkpoint metadataencoder_ff_dim/decoder_ff_dimdisagree with actuallinear1.weightshapes, inference auto-corrects to weight-derived values and logs a warning, preventing shape-load failures from stale config metadata.
tests/unit/test_onecycle_warmup_continuity.py— validates smooth LR continuity across linear warmup into OneCycle phase.
- Autoregressive inference extracted (
model/generator.py): introducedKokoroGeneratorto encapsulate generation loop, stop criteria, cache precompute/updates, and inference-time housekeeping. - Encode/expand path unified (
model.py): added_encode_and_expandhelper used by both training and inference to reduce divergence between code paths. - Training decoder input prep unified (
model.py): added_prepare_training_decoder_inputshelper for shift/pad, projection, dropout, positional encoding, and mask handling. - Duration adaptor interface unified (
model.py): both variance-enabled and fallback duration paths now flow through a common adaptor interface (VarianceAdaptorWrapper/SimpleDurationAdaptor) to keep behavior consistent.
- Legacy variance-adaptor checkpoint compatibility (
trainer.py): EMA checkpoint loading now handles older key layouts by performing a partial non-strict load and mapping compatible parameters into the current structure. - Stop-token loss rebalanced (
training/config.py): defaultstop_token_loss_weightreduced from1.0to0.5.
tests/unit/test_encode_and_expand.pytests/unit/test_generator.pytests/unit/test_model_refactors.pytests/unit/test_model_log_memory_refactor.pytests/unit/test_spec_augment.pytests/unit/test_decoder_helpers.py(expanded for decoder helper coverage)
- OneCycle tuning (
training/config.py): reduced aggressiveness (max_lr_multiplier: 3.0 -> 2.0,pct_start: 0.4 -> 0.3) for more stable early training. - Localized gradient clipping expanded (
trainer.py): added dedicated FFN clipping controls (ffn_spike_clip_norm,encoder_ffn_spike_clip_norm) and extended pre-clip logic to target known spike-pronelinear1/linear2weights. - TensorBoard resume cleanup (
trainer.py): writer is reopened withpurge_stepon resume to hide stale events beyond resume step. - Scheduler resume reconstruction (
trainer.py): OneCycleLR is rebuilt from current config after checkpoint resume to avoid stale schedule boundaries from old optimizer metadata. - Histogram logging (
trainer.py): added epoch-level parameter histogram logging for better training diagnostics. - Checkpoint metrics (
trainer.py):val_lossis now included in checkpoint payload.
- Decoder pre-net dropout introduced (
model.py): training path applies dropout to projected decoder inputs (p=0.3) to reduce teacher-forcing over-reliance. - Inference collapse guard (
model.py): added energy-based early-stop fallback when recent generated frames indicate prolonged near-silence collapse.
tests/unit/test_trainer_adaptive_stabilization.pytests/unit/test_trainer_checkpoint_step_counters.pytests/unit/test_trainer_loss_stability.py
KokoroModel now accepts an optional stress_indices tensor that is embedded via a dedicated nn.Embedding table and added to the encoder input in parallel with the phoneme embedding. This gives the model an explicit signal for which syllable carries lexical stress in each word.
model.py: Addedstress_embedding: nn.Embedding(vocab_size, d_model)toKokoroModel.__init__.encode_textsumsphoneme_embed + stress_embedbefore the positional encoding layer.stress_indicesdefaults toNone(zero-vector contribution) so the change is fully backward compatible with checkpoints that pre-date it.dataset.py:RuslanDataset.__getitem__now callsaudio_utils.get_stress_indices_with_silto produce a per-phoneme stress index tensor and stores it in the feature cache under keystress_indices.collate_fnpads and stacks the new field.trainer.py: All sixmodel(...)call sites intrain_epoch,_run_single_batch, andvalidate_epochforwardstress_indices=stress_indicesfrom the batch.inference.py:text_to_speechconstructsstress_indicesfromRussianPhonemeProcessor.process_textand passes it to the model.model_loader.py: Checkpoint metadata is extended to record stress embedding presence; missing keys are patched in at load time for smooth migration from old checkpoints.audio_utils.py: Newget_stress_indices_with_silhelper builds a per-phoneme integer tensor fromStressInfo, inserting0for silence tokens.FEATURE_CACHE_VERSIONbumped to 6 to invalidate cache entries that do not containstress_indices.
- Iotated vowel
j-prefix dropped inapply_vowel_reduction(russian_phoneme_processor.py): Unstressed iotated vowels (ja,je,jo) were reduced to bareɐ/ɪ/ə, silently discarding thej. For example, the initialяin unstressedязы́кproducedɐinstead ofjɐ. Fixed by trackingis_iotatedbefore stripping the base, then prependingjto the reduced form when a reduction actually occurred. Non-reducible iotated vowels (e.g.ju) are left untouched. New reduced-iotated phonemesjɐ,jɪ,jəadded to_multi_char_phonemes,_build_vocab, and thefrom_dictforward-compatibility patch. logging.basicConfigremoved from module scope (russian_phoneme_processor.py): The module was unconditionally installing aStreamHandleron the root logger at import time, hijacking log configuration in any host application. Removed; module-levellogger = logging.getLogger(__name__)retained.@lru_cachememory leak on instance methods (russian_phoneme_processor.py): Python'sfunctools.lru_cacheapplied as a decorator to instance methods keeps a strong reference toselfin every cache key, preventing garbage collection for the lifetime of the process. Replaced with per-instance caches created in__init__(self.normalize_text = lru_cache(1000)(self._normalize_text_impl)), so the cache is released when the instance is collected.- Combining marks stripped too late in
apply_consonant_assimilation(russian_phoneme_processor.py): NFD stress diacritics embedded in a word (e.g.здра́вствуйте) were stripped only after all Cyrillicstr.replacecluster patterns, causing every cluster simplification (вств→ств,тся→ца,стн→сн,сч→щ, etc.) to silently fail on marked input. There.sub(r'[\u0300-\u036f]', '', word)call is now the first operation afterword.lower(). _int_to_wordsmissing billions tier (russian_phoneme_processor.py): Numbers ≥ 1 000 000 000 fell into the thousands branch, producing nonsensical output (e.g.1 000 000 000→"одна тысяча миллионов"). Added a dedicated billions block with correct Russian склонение (миллиард/миллиарда/миллиардов).get_stress_indices_with_silcrash onstress_info=None(audio_utils.py):DummyProcessorreturns 3-tuples withNoneas the stress field during testing. The vowel-count comparisonvowel_count == stress_info.positionraisedAttributeError. Fixed by defaultingstress_position = stress_info.position if stress_info is not None else -1.
tests/unit/test_phoneme_processor_fixes.py— 23 tests across four classes:TestNoRootLoggerHijack— confirms root logger has zero handlers after module import.TestPerInstanceLRUCache—weakrefGC check, two-instance cache isolation,clear_cachescoping.TestStressMarkStrippedBeforeAssimilation— cluster simplifications fire correctly on words with embedded combining marks; end-to-end IPA forздравствуйте.TestIotatedJPrefixPreservedInReduction— each iotated vowel in each reduction tier,junon-reduction, vocab/tokenizer presence, and end-to-end word tests (язык,яблоко).
- Energy axis layout fix (
dataset.py):torchaudio.MelSpectrogramreturns(n_mels, T)after squeeze;EnergyExtractor.extract_energy_from_melexpects(..., n_mels)on the last axis. Without the transpose,mean(dim=-1)averaged over the time axis and produced 80 per-band scalars instead ofTper-frame energy values. Fixed by passingmel_spec_linear[:, :num_mel_frames].T. mel_spec_linearclip sync fix (dataset.py):mel_specwas clipped tomax_seq_lengthframes butmel_spec_linearwas not, leaving the two tensors out of sync. Both are now clipped together at the source.FEATURE_CACHE_VERSIONbumped to 3 (dataset.py): All cache entries written before the axis layout and clip fixes contain corrupted energy values. Bumping the version forces automatic re-computation of any stale cache entry rather than silently serving wrong data.duration_targetexpansion bug (model.py):VarianceAdaptor.LengthRegulatorcasts durations to.long()immediately. The old code passedtorch.log1p(phoneme_durations)(e.g.1.79for a 5-frame phoneme), which truncated to1frame per phoneme. Fixed by passing raw integer frame countsphoneme_durations.float(); the log-domain target for the duration MSE loss is computed separately in the trainer.- Auto-recovery attribute path fix (
trainer.py):self.model.pitch_predictor._init_weights()was a silent no-op becauseKokoroModelhas no top-levelpitch_predictor. Fixed toself.model.variance_adaptor.pitch_predictor._init_weights(). forward_inferencediscarded pitch/energy embeddings (model.py): The inference path calledvariance_adaptoronly to obtain predicted durations, then discardedadapted_outputand re-ran_length_regulateon the bare encoder output — dropping all pitch and energy embeddings. Fixed by usingadapted_encoder_outputdirectly, matching training behavior.
tests/unit/test_dataset_energy_axis_layout.py— 21 tests covering the(n_mels, T)→(T, n_mels)transpose contract, the pre-clip slice pattern[:, :num_mel_frames].T, per-frame energy content correctness, batch layout, and documentation of the pre-fix wrong-shape behaviour as a regression canary.tests/unit/test_model_inference_adaptor_output.py— fixedtest_variance_adaptor_called_exactly_once_during_inference: replacedpatch.object(model, 'variance_adaptor', ...)(rejected bynn.Module.__setattr__) withpatch.object(model.variance_adaptor, 'forward', ...).
- Added unit tests for
<sil>token handling and fallback behavior - Added diagnostic logging for duration predictions vs. targets in the trainer (
_calculate_losses), gated byconfig.verbosefor local debugging of duration-loss convergence - Fixed unit test expectations to match the trainer's current log-duration target computation (uses +1.0 in the log target), preventing false failures in CI/local runs
- Improved preprocessing with better silence support
- Lowered default MPS_MAX_FRAMES_PER_BATCH
- Cleanup
- Pre-allocate chunk slices to reduce memory pressure
- Inference improvements
- Minor GPU memory optimizations
- Implemented adaptive bucketed batching
- Vectorized expansion in length regulator and variance adaptor
- Improved pitch extractor
- Implemented length regulator
- Improved checkpointing
- Improved phoneme processor
- Vectorized average pitch by duration
- Fixed pitch and energy normalization bugs
- Improved feature cache (more work needed)
- Added auto EMA decay calculation
- Make DataLoader workers configurable via
TrainingConfig.num_workersand wireprefetch_factor/persistent_workersappropriately. - Auto-tune inference controls per-checkpoint (
stop_threshold,min_len_ratio,max_len,min_len_floor) frommodel_metadata.inference_controlswith safe bounds and explicit-override behavior. - Add epoch-level feature-cache hit/miss delta summaries and a final cumulative "FEATURE CACHE SUMMARY" at training completion for improved observability.
- Add/adjust unit tests covering metadata strictness, inference auto-tuning, and cache telemetry.
- Save and restore model metadata with checkpoints. BREAKING CHANGE
- Data loader optimizations.
- Variance predictor rework.
- Improved checkpointing, inference, and userland tooling.
- AdamW optimizer enabled with MPS.
- Documentation and unit-test cleanup.
- Increased frame budget.
- Minor transformer improvements.
- MPS memory cleanup.
- Moved to 0.0.x versioning.
- Refactored code.
- Historical transition version recorded in
setup.pyhistory before moving to the 0.0.x versioning.