Skip to content

Fix _handle_sentence merging adjacent BIO entities#570

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/bio-parser-adjacent-entities
Open

Fix _handle_sentence merging adjacent BIO entities#570
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/bio-parser-adjacent-entities

Conversation

@Chessing234
Copy link
Copy Markdown

Summary

_handle_sentence in data_util.py parses BIO-tagged NER data into spaCy entity spans. When two entities are adjacent (no O token between them), a B- tag is silently treated as a continuation of the current entity instead of starting a new one.

Root cause: The condition if in_entity: pass does not distinguish I- (continuation) from B- (new entity start). Any non-O tag while inside an entity is ignored.

Example:

brain   B-ORGAN
cancer  B-TUMOR
is      O
  • Before: One span (0, 12, "ORGAN") covering "brain cancer"
  • After: Two spans (0, 5, "ORGAN") and (6, 12, "TUMOR")

Fix: When a B- tag is encountered mid-entity, close the current entity and start a new one. I- tags continue to extend the current entity as before.

Test plan

  • Existing NER data loading tests should continue to pass
  • Adjacent entities in MEDMENTIONS/CRAFT datasets are now correctly split

🤖 Generated with Claude Code

When two entities are adjacent (no O token between them), a B- tag
following an I- or B- tag was silently treated as a continuation of
the current entity. The fix closes the current entity before starting
a new one when a B- tag is encountered mid-entity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant