feat(text-metrics): split text_score pair by davidberenstein1957 · Pull Request #647 · PrunaAI/pruna

davidberenstein1957 · 2026-04-28T13:04:01Z

Summary

Splits the text_score pair into a dedicated stacked PR, adds OCR-based text metrics and shared utilities, and wires Long Text Bench + OneIG Text Rendering.

Stack Position

Base: PR feat(text-metrics): split oneig_alignment #646 (feat/vlm-pr-3b-oneig-alignment)
Next: PR feat(text-metrics): split oneig_reasoning #648 (feat/vlm-pr-3d-oneig-reasoning)
Final integration: PR feat(e2e-tests): stacked e2e after split metrics #641 (feat/vlm-pr-5-e2e-tests)
Canonical umbrella reference: PR feat(evaluation): add VLMMetrics #545 (feat/metrics-vlm-support)

Files

src/pruna/evaluation/metrics/metric_text_score.py
src/pruna/evaluation/metrics/metric_text_score_utils.py
src/pruna/evaluation/benchmarks.py

Test Plan

uv run pytest tests/evaluation/test_text_metrics.py -k text_score

Review Focus

OCR scoring behavior
Long Text Bench and OneIG Text Rendering wiring

Review Flow (Order)

Review the stack in this exact order:

feat(vendor): add LLM2Vec embedding model #637 vendor
feat(infrastructure): add VLM base classes and utilities #638 infrastructure
feat(text-metrics): split qa_accuracy #645 qa_accuracy
feat(text-metrics): split oneig_alignment #646 oneig_alignment
feat(text-metrics): split text_score pair #647 text_score pair
feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
feat(vision-metrics): split vqa #649 vqa
feat(vision-metrics): split vie_score #650 vie_score
feat(vision-metrics): split img_edit_score #651 img_edit_score
feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (5/10)

Review after PR feat(text-metrics): split oneig_alignment #646.
Next PR to review: feat(text-metrics): split oneig_reasoning #648.
Confirm this PR's tests and scope before continuing.

Adds text_score and oneig_text_score metrics together with shared OCR text utilities and benchmark wiring for Long Text Bench and OneIG Text Rendering. Made-with: Cursor

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 946b068. Configure here.}

cursor · 2026-05-08T09:08:23Z

+    """
+    out = text or ""
+    for keyword in _OCR_HALLUCINATION_KEYWORDS:
+        out = out.replace(keyword, "").replace(f"\n{keyword}", "").replace(f"{keyword}\n", "")


Hallucination keyword replacement order creates dead code

Low Severity

In clean_oneig_ocr_hallucinations, the chained .replace() calls have the wrong order. The first call out.replace(keyword, "") removes all occurrences of keyword, so the subsequent .replace(f"\n{keyword}", "") and .replace(f"{keyword}\n", "") can never find a match — they are dead code. The intent was to cleanly remove adjacent newlines together with the keyword, but the current order leaves orphan newlines behind instead. The practical impact is mitigated by downstream whitespace normalization in preprocess_string_oneig, but the logic is inverted from its clear intent.

^{Reviewed by Cursor Bugbot for commit 946b068. Configure here.}

begumcig

I really like this implementation David! Perhaps we do not need one of the metrics in this PR, and I asked some questions to understand the preprocessing a little better, but almost ready to merge!

begumcig · 2026-05-15T15:12:25Z

+        Preprocessed string (ground truth or OCR).
+    """
+    raw = s or ""
+    cleaned = re.sub(


can't we use normalize_text_simple here?

begumcig · 2026-05-15T15:27:04Z

+    )
+    if contains_chinese(cleaned):
+        pattern = re.compile(
+            r"[\u4e00-\u9fa5a-zA-Z0-9àâäéèêëîïôöùûüçÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ]",


are we doing this to remove the spaces if we see chinese characters?

begumcig · 2026-05-15T15:27:59Z

+    )
+    if contains_chinese(cleaned):
+        pattern = re.compile(
+            r"[\u4e00-\u9fa5a-zA-Z0-9àâäéèêëîïôöùûüçÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ]",


is there any reason we used sub in the first iteration and compile in the second iteration?

begumcig · 2026-05-15T15:37:30Z

+    """
+    out = text or ""
+    for keyword in _OCR_HALLUCINATION_KEYWORDS:
+        out = out.replace(keyword, "").replace(f"\n{keyword}", "").replace(f"{keyword}\n", "")


begumcig · 2026-05-15T15:41:27Z

+        """
+        inputs = metric_data_processor(x, gt, outputs, self.call_type)
+        images = _process_images(inputs[0])
+        auxiliaries = inputs[1] if len(inputs) > 1 and isinstance(inputs[1], (list, tuple)) else [{}] * len(images)


for the call type assigned, i think the len(inputs) check would always be longer than 1!

begumcig · 2026-05-15T16:38:59Z

+from collections import Counter
+from typing import Literal
+
+_OCR_HALLUCINATION_KEYWORDS = ("addCriterion", "No text recognized.", "No text recognized")


I understand this hallucination keywords are coming from the metric itself, but what's the reason for addCriterion?

begumcig · 2026-05-15T17:00:28Z

+
+
+@MetricRegistry.register("ocr_levenshtein")
+@MetricRegistry.register("text_score")


a bit of a stupid question but why do we need this metric? It implements the edit score, which already exists in OneIG-text score?

begumcig · 2026-05-15T17:01:18Z

+    def _compute_result_value(self) -> float:
+        """Return the scalar reported as ``MetricResult.result``."""
+
+    def update(self, x: list[Any] | torch.Tensor, gt: list[str], outputs: torch.Tensor) -> None:


I really like that call to VLM for text metrics now live in a shared update function!

begumcig · 2026-05-15T17:05:03Z

+        "",
+        raw,
+    )
+    if contains_chinese(cleaned):


so sorry to drop a 52nd comment on this tiny function but i noticed that the codes for seaching chinese characters are different in contains_chinese and this function. contains_chinese has \u4e00-\u9fff and this function has \u4e00-\u9fa5. Idk about chinese characters so maybe this is how it should be done

begumcig · 2026-05-15T17:08:18Z

            "handle complex multi-clause descriptions and maintain coherence across long instructions."
        ),
-        metrics=[],  # Paper uses text_score/TIT-Score; not in Pruna
+        metrics=["text_score"],


Ah I see now where we are using this metric, I am not so sure if the Long Text Bench is using an edit distance based metric though, if I am not wrong 🥹 perhaps we should remove this?

https://github.com/X-Omni-Team/X-Omni/blob/main/textbench/summary_scores.py They are using word accuracy metric rather than a character accuracy one (like the edit distance)

This was referenced Apr 28, 2026

feat(text-metrics): add text-based VLM judge metrics #639

Closed

feat(vision-metrics): add vision-based VLM judge metrics #640

Closed

feat(text-metrics): split text_score pair into dedicated branch

946b068

Adds text_score and oneig_text_score metrics together with shared OCR text utilities and benchmark wiring for Long Text Bench and OneIG Text Rendering. Made-with: Cursor

davidberenstein1957 force-pushed the feat/vlm-pr-3b-oneig-alignment branch from 2627d78 to c51653e Compare May 8, 2026 09:01

davidberenstein1957 force-pushed the feat/vlm-pr-3c-text-score-pair branch from 3cdc2bb to 946b068 Compare May 8, 2026 09:01

cursor Bot reviewed May 8, 2026

View reviewed changes

begumcig self-requested a review May 15, 2026 15:12

begumcig requested changes May 15, 2026

View reviewed changes



		@MetricRegistry.register("ocr_levenshtein")
		@MetricRegistry.register("text_score")

Conversation

davidberenstein1957 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack Position

Files

Test Plan

Review Focus

Review Flow (Order)

This PR in the flow (5/10)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 8, 2026

Choose a reason for hiding this comment

Hallucination keyword replacement order creates dead code

Uh oh!

Choose a reason for hiding this comment

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davidberenstein1957 commented Apr 28, 2026 •

edited

Loading