Skip to content

feat(text-metrics): split text_score pair#647

Open
davidberenstein1957 wants to merge 1 commit into
feat/vlm-pr-3b-oneig-alignmentfrom
feat/vlm-pr-3c-text-score-pair
Open

feat(text-metrics): split text_score pair#647
davidberenstein1957 wants to merge 1 commit into
feat/vlm-pr-3b-oneig-alignmentfrom
feat/vlm-pr-3c-text-score-pair

Conversation

@davidberenstein1957
Copy link
Copy Markdown
Member

@davidberenstein1957 davidberenstein1957 commented Apr 28, 2026

Summary

Splits the text_score pair into a dedicated stacked PR, adds OCR-based text metrics and shared utilities, and wires Long Text Bench + OneIG Text Rendering.

Stack Position

Files

  • src/pruna/evaluation/metrics/metric_text_score.py
  • src/pruna/evaluation/metrics/metric_text_score_utils.py
  • src/pruna/evaluation/benchmarks.py

Test Plan

uv run pytest tests/evaluation/test_text_metrics.py -k text_score

Review Focus

  • OCR scoring behavior
  • Long Text Bench and OneIG Text Rendering wiring

Review Flow (Order)

Review the stack in this exact order:

  1. feat(vendor): add LLM2Vec embedding model #637 vendor
  2. feat(infrastructure): add VLM base classes and utilities #638 infrastructure
  3. feat(text-metrics): split qa_accuracy #645 qa_accuracy
  4. feat(text-metrics): split oneig_alignment #646 oneig_alignment
  5. feat(text-metrics): split text_score pair #647 text_score pair
  6. feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
  7. feat(vision-metrics): split vqa #649 vqa
  8. feat(vision-metrics): split vie_score #650 vie_score
  9. feat(vision-metrics): split img_edit_score #651 img_edit_score
  10. feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (5/10)

Adds text_score and oneig_text_score metrics together with shared OCR text utilities and benchmark wiring for Long Text Bench and OneIG Text Rendering.

Made-with: Cursor
@davidberenstein1957 davidberenstein1957 force-pushed the feat/vlm-pr-3b-oneig-alignment branch from 2627d78 to c51653e Compare May 8, 2026 09:01
@davidberenstein1957 davidberenstein1957 force-pushed the feat/vlm-pr-3c-text-score-pair branch from 3cdc2bb to 946b068 Compare May 8, 2026 09:01
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 946b068. Configure here.

"""
out = text or ""
for keyword in _OCR_HALLUCINATION_KEYWORDS:
out = out.replace(keyword, "").replace(f"\n{keyword}", "").replace(f"{keyword}\n", "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hallucination keyword replacement order creates dead code

Low Severity

In clean_oneig_ocr_hallucinations, the chained .replace() calls have the wrong order. The first call out.replace(keyword, "") removes all occurrences of keyword, so the subsequent .replace(f"\n{keyword}", "") and .replace(f"{keyword}\n", "") can never find a match — they are dead code. The intent was to cleanly remove adjacent newlines together with the keyword, but the current order leaves orphan newlines behind instead. The practical impact is mitigated by downstream whitespace normalization in preprocess_string_oneig, but the logic is inverted from its clear intent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 946b068. Configure here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@begumcig begumcig self-requested a review May 15, 2026 15:12
Copy link
Copy Markdown
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this implementation David! Perhaps we do not need one of the metrics in this PR, and I asked some questions to understand the preprocessing a little better, but almost ready to merge!

Preprocessed string (ground truth or OCR).
"""
raw = s or ""
cleaned = re.sub(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we use normalize_text_simple here?

)
if contains_chinese(cleaned):
pattern = re.compile(
r"[\u4e00-\u9fa5a-zA-Z0-9àâäéèêëîïôöùûüçÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ]",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we doing this to remove the spaces if we see chinese characters?

)
if contains_chinese(cleaned):
pattern = re.compile(
r"[\u4e00-\u9fa5a-zA-Z0-9àâäéèêëîïôöùûüçÀÂÄÉÈÊËÎÏÔÖÙÛÜÇ]",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason we used sub in the first iteration and compile in the second iteration?

"""
out = text or ""
for keyword in _OCR_HALLUCINATION_KEYWORDS:
out = out.replace(keyword, "").replace(f"\n{keyword}", "").replace(f"{keyword}\n", "")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

"""
inputs = metric_data_processor(x, gt, outputs, self.call_type)
images = _process_images(inputs[0])
auxiliaries = inputs[1] if len(inputs) > 1 and isinstance(inputs[1], (list, tuple)) else [{}] * len(images)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the call type assigned, i think the len(inputs) check would always be longer than 1!

from collections import Counter
from typing import Literal

_OCR_HALLUCINATION_KEYWORDS = ("addCriterion", "No text recognized.", "No text recognized")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this hallucination keywords are coming from the metric itself, but what's the reason for addCriterion?



@MetricRegistry.register("ocr_levenshtein")
@MetricRegistry.register("text_score")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit of a stupid question but why do we need this metric? It implements the edit score, which already exists in OneIG-text score?

def _compute_result_value(self) -> float:
"""Return the scalar reported as ``MetricResult.result``."""

def update(self, x: list[Any] | torch.Tensor, gt: list[str], outputs: torch.Tensor) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that call to VLM for text metrics now live in a shared update function!

"",
raw,
)
if contains_chinese(cleaned):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so sorry to drop a 52nd comment on this tiny function but i noticed that the codes for seaching chinese characters are different in contains_chinese and this function. contains_chinese has \u4e00-\u9fff and this function has \u4e00-\u9fa5. Idk about chinese characters so maybe this is how it should be done

"handle complex multi-clause descriptions and maintain coherence across long instructions."
),
metrics=[], # Paper uses text_score/TIT-Score; not in Pruna
metrics=["text_score"],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now where we are using this metric, I am not so sure if the Long Text Bench is using an edit distance based metric though, if I am not wrong 🥹 perhaps we should remove this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/X-Omni-Team/X-Omni/blob/main/textbench/summary_scores.py They are using word accuracy metric rather than a character accuracy one (like the edit distance)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants