Support embeddings with different sizes and improve evaluation script by andreasvc · Pull Request #14 · SapienzaNLP/xcore

andreasvc · 2026-04-22T07:20:59Z

No description provided.

- The attention() class had a hard-coded dimension of 2048 for the input, which works for Deberta, but not other models such as mmbert; the input dimension is now specified with a parameter. - Set the encoder model to training mode after loading https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForPreTraining.from_pretrained - mmbert stores token embeddings in "self.encoder.tok_embeddings" instead of "self.encoder.word_embeddings"; both are now supported

- Suppress warnings about multiprocessing for data loaders; loading the data is not a bottleneck so the warnings are unnecessary - Depending on the GPU, you may get warnings about setting matmul_precision. Added code to set matmul precision to medium, which seems like a good trade-off (but your mileage may vary on different hardware) - Disable evaluation on the test set during training. Apparently, the code has bugs and is not expected to work. Therefore it is now disabled, to avoid giving the impression that there is an actual issue with training the model. Evaluation is supposed to be performed with the evaluate.py script.

- store the .conll output of the model, useful if you want to run other evaluation scripts on the output. - write output to a separate directory and use filenames of the form '{subset}_{modality}', e.g. 'test_output' and 'test_gold' to clearly indicate the type of file. The output is written to a directory based on the dataset: 'experiments/xcore/myexperiment/wandb/run-2026{...}/files/{dataset}' A model can therefore be evaluated on multiple datatsets. - pretty-print evaluation results

andreasvc added 4 commits April 20, 2026 14:35

Disable fast tokenizer to avoid warnings

e1aa3d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support embeddings with different sizes and improve evaluation script#14

Support embeddings with different sizes and improve evaluation script#14
andreasvc wants to merge 4 commits into
SapienzaNLP:masterfrom
andreasvc:master

andreasvc commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreasvc commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant