feat(text-metrics): split text_score pair#647
Conversation
Adds text_score and oneig_text_score metrics together with shared OCR text utilities and benchmark wiring for Long Text Bench and OneIG Text Rendering. Made-with: Cursor
2627d78 to
c51653e
Compare
3cdc2bb to
946b068
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 946b068. Configure here.
| """ | ||
| out = text or "" | ||
| for keyword in _OCR_HALLUCINATION_KEYWORDS: | ||
| out = out.replace(keyword, "").replace(f"\n{keyword}", "").replace(f"{keyword}\n", "") |
There was a problem hiding this comment.
Hallucination keyword replacement order creates dead code
Low Severity
In clean_oneig_ocr_hallucinations, the chained .replace() calls have the wrong order. The first call out.replace(keyword, "") removes all occurrences of keyword, so the subsequent .replace(f"\n{keyword}", "") and .replace(f"{keyword}\n", "") can never find a match — they are dead code. The intent was to cleanly remove adjacent newlines together with the keyword, but the current order leaves orphan newlines behind instead. The practical impact is mitigated by downstream whitespace normalization in preprocess_string_oneig, but the logic is inverted from its clear intent.
Reviewed by Cursor Bugbot for commit 946b068. Configure here.
begumcig
left a comment
There was a problem hiding this comment.
I really like this implementation David! Perhaps we do not need one of the metrics in this PR, and I asked some questions to understand the preprocessing a little better, but almost ready to merge!
| Preprocessed string (ground truth or OCR). | ||
| """ | ||
| raw = s or "" | ||
| cleaned = re.sub( |
There was a problem hiding this comment.
can't we use normalize_text_simple here?
| ) | ||
| if contains_chinese(cleaned): | ||
| pattern = re.compile( | ||
| r"[\u4e00-\u9fa5a-zA-Z0-9àâäéèêëîïôöùûüçÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ]", |
There was a problem hiding this comment.
are we doing this to remove the spaces if we see chinese characters?
| ) | ||
| if contains_chinese(cleaned): | ||
| pattern = re.compile( | ||
| r"[\u4e00-\u9fa5a-zA-Z0-9àâäéèêëîïôöùûüçÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ]", |
There was a problem hiding this comment.
is there any reason we used sub in the first iteration and compile in the second iteration?
| """ | ||
| out = text or "" | ||
| for keyword in _OCR_HALLUCINATION_KEYWORDS: | ||
| out = out.replace(keyword, "").replace(f"\n{keyword}", "").replace(f"{keyword}\n", "") |
| """ | ||
| inputs = metric_data_processor(x, gt, outputs, self.call_type) | ||
| images = _process_images(inputs[0]) | ||
| auxiliaries = inputs[1] if len(inputs) > 1 and isinstance(inputs[1], (list, tuple)) else [{}] * len(images) |
There was a problem hiding this comment.
for the call type assigned, i think the len(inputs) check would always be longer than 1!
| from collections import Counter | ||
| from typing import Literal | ||
|
|
||
| _OCR_HALLUCINATION_KEYWORDS = ("addCriterion", "No text recognized.", "No text recognized") |
There was a problem hiding this comment.
I understand this hallucination keywords are coming from the metric itself, but what's the reason for addCriterion?
|
|
||
|
|
||
| @MetricRegistry.register("ocr_levenshtein") | ||
| @MetricRegistry.register("text_score") |
There was a problem hiding this comment.
a bit of a stupid question but why do we need this metric? It implements the edit score, which already exists in OneIG-text score?
| def _compute_result_value(self) -> float: | ||
| """Return the scalar reported as ``MetricResult.result``.""" | ||
|
|
||
| def update(self, x: list[Any] | torch.Tensor, gt: list[str], outputs: torch.Tensor) -> None: |
There was a problem hiding this comment.
I really like that call to VLM for text metrics now live in a shared update function!
| "", | ||
| raw, | ||
| ) | ||
| if contains_chinese(cleaned): |
There was a problem hiding this comment.
so sorry to drop a 52nd comment on this tiny function but i noticed that the codes for seaching chinese characters are different in contains_chinese and this function. contains_chinese has \u4e00-\u9fff and this function has \u4e00-\u9fa5. Idk about chinese characters so maybe this is how it should be done
| "handle complex multi-clause descriptions and maintain coherence across long instructions." | ||
| ), | ||
| metrics=[], # Paper uses text_score/TIT-Score; not in Pruna | ||
| metrics=["text_score"], |
There was a problem hiding this comment.
Ah I see now where we are using this metric, I am not so sure if the Long Text Bench is using an edit distance based metric though, if I am not wrong 🥹 perhaps we should remove this?
There was a problem hiding this comment.
https://github.com/X-Omni-Team/X-Omni/blob/main/textbench/summary_scores.py They are using word accuracy metric rather than a character accuracy one (like the edit distance)


Summary
Splits the
text_scorepair into a dedicated stacked PR, adds OCR-based text metrics and shared utilities, and wires Long Text Bench + OneIG Text Rendering.Stack Position
feat/vlm-pr-3b-oneig-alignment)feat/vlm-pr-3d-oneig-reasoning)feat/vlm-pr-5-e2e-tests)feat/metrics-vlm-support)Files
src/pruna/evaluation/metrics/metric_text_score.pysrc/pruna/evaluation/metrics/metric_text_score_utils.pysrc/pruna/evaluation/benchmarks.pyTest Plan
Review Focus
Review Flow (Order)
Review the stack in this exact order:
This PR in the flow (5/10)