Skip to content

segmenter: add OffsetInBytes and LengthInBytes to Grapheme, Line, and Word#241

Merged
benoitkugler merged 1 commit intogo-text:mainfrom
hajimehoshi:bytes
Mar 9, 2026
Merged

segmenter: add OffsetInBytes and LengthInBytes to Grapheme, Line, and Word#241
benoitkugler merged 1 commit intogo-text:mainfrom
hajimehoshi:bytes

Conversation

@hajimehoshi
Copy link
Copy Markdown
Contributor

Track UTF-8 byte positions alongside rune positions in the attribute iterator, and expose them as OffsetInBytes and LengthInBytes fields on Grapheme, Line, and Word structs. This allows users to efficiently extract segments from byte slices or strings without O(n) conversion.

Fixes #240

Copilot AI review requested due to automatic review settings March 4, 2026 16:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the segmenter package API to expose UTF-8 byte offsets/lengths for each produced segment, enabling callers to slice the original string/[]byte input without converting to []rune first (Fixes #240).

Changes:

  • Track UTF-8 byte position alongside rune position inside the internal attributeIterator.
  • Expose OffsetInBytes and LengthInBytes on Grapheme, Line, and Word.
  • Add a test validating that byte slicing via the new fields matches the segment Text for several UTF-8 inputs across all init modes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
segmenter/segmenter.go Adds internal byte-position tracking and exposes byte offset/length on segment structs.
segmenter/segmenter_test.go Adds coverage ensuring byte offsets/lengths reproduce the expected substring for graphemes/lines/words.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread segmenter/segmenter.go Outdated
Comment thread segmenter/segmenter.go
@hajimehoshi hajimehoshi force-pushed the bytes branch 3 times, most recently from 89b991a to 5229631 Compare March 4, 2026 17:53
@hajimehoshi
Copy link
Copy Markdown
Contributor Author

I think it's ready. Please take a look, thanks

@benoitkugler
Copy link
Copy Markdown
Contributor

benoitkugler commented Mar 5, 2026

Thank you for this contribution.

I'm not a big fan however of the additional complexity incurred by invalid utf8 handling.

Have you concrete use case you must support ? Could we instead require valid utf8 is passed to the segmenter ? That would align more closely to the scope of the library (implicitly defined by the use of rune slices).

@hajimehoshi
Copy link
Copy Markdown
Contributor Author

hajimehoshi commented Mar 5, 2026

Have you concrete use case you must support ?

In Guigui (a GUI framework made with Ebitengine), I need to implement a text auto-wrapping, and this requires a segmenter. Now I am using uniseg, but this is relatively inactive.

Could we instead require valid utf8 is passed to the segmenter ?

Yeah, I think this makes sense.

I think there are some options, but the simplest way is let it panic:

  1. Panic at InitWithString and InitWithBytes when the source includes an invalid sequence. (BTW, if the source includes U+FFFD as a valid sequence intentionally, this should be treated correctly, so we should distinguish them...)
  2. Panic at (*attributeIterator).next if the rune is an invalid rune (e.g. U+FFFF, which never exists)

This enables to remove some checks like invalidUTF8Indices from the current change.

Another option is just silently ignore such runes, but this sounds a little risky. What do you think?

@benoitkugler
Copy link
Copy Markdown
Contributor

What about returning an error from InitWithString and InitWithBytes ? And then panic in (*attributeIterator).next

@hajimehoshi
Copy link
Copy Markdown
Contributor Author

hajimehoshi commented Mar 5, 2026

What about returning an error from InitWithString and InitWithBytes ?

I slightly prefer panics since the condition when to panic is very clear and a user can check before invoking InitWithString/Bytes. Also this would break a comaptibility with v0.3.4. But if you want to change the signature, I'm fine to do so.

@hajimehoshi
Copy link
Copy Markdown
Contributor Author

Also, if InitWithString and InitWithBytes can return an error, should Init also return an error?

@benoitkugler
Copy link
Copy Markdown
Contributor

benoitkugler commented Mar 5, 2026

Ah, my mistake, I've overlooked the fact that InitWithString and InitWithBytes are already defined in v0.3.4.

I'm in favor of the panic then.

Whats @andydotxyz @whereswaldon think about this question ?

@andydotxyz
Copy link
Copy Markdown
Contributor

Please don't (ever) panic in a library. Unless it is truly an unrecoverable fatal flaw in the input.
Displaying unicode incorrectly is not one of those things.

We could log it aggressively if returning an error is not an option.

@hajimehoshi
Copy link
Copy Markdown
Contributor Author

So if we don't panic, don't return errors, and don't handle invalid sequences in the current way, the only way is to just ignore such sequences or log them, and the result of *InBytes would be undefined for such sequences or runes.

@benoitkugler
Copy link
Copy Markdown
Contributor

benoitkugler commented Mar 5, 2026

Perhaps this case justifies an API change ?
We only added the new InitXXX functions really recently, @hajimehoshi is probably the only consumer so far.

@benoitkugler
Copy link
Copy Markdown
Contributor

Also, if InitWithString and InitWithBytes can return an error, should Init also return an error?

Only InitWithString and InitWithBytes would return an error, Init would not.

@hajimehoshi
Copy link
Copy Markdown
Contributor Author

Perhaps this case justifies an API change ?
We only added the new InitXXX functions really recently, @hajimehoshi is probably the only consumer so far.

True. Almost nobody uses the API. I've not used these yet neither.

Only InitWithString and InitWithBytes would return an error, Init would not.

It's ok, but I feel like this is inconsistent. An invalid rune like U+FFFF can come to Init, right?

@benoitkugler
Copy link
Copy Markdown
Contributor

benoitkugler commented Mar 5, 2026

It's ok, but I feel like this is inconsistent. An invalid rune like U+FFFF can come to Init, right?

Yes, you're right. So we would have an undefined behavior in case someone uses Input with invalid runes and String/Bytes iterators. This is not the proper way to use the Segmenter, so I'm fine with keeping this edge case (for simplicity and performance in the common case).

@hajimehoshi
Copy link
Copy Markdown
Contributor Author

Updated this PR.

@hajimehoshi
Copy link
Copy Markdown
Contributor Author

hajimehoshi commented Mar 5, 2026

An invalid rune like U+FFFF can come to Init, right?

This was my understanding but U+FFFF is treated as a one valid rune in Go. An invalid rune is for example U+7FFFFFFF. https://go.dev/play/p/-Axax5XcldD

Copy link
Copy Markdown
Contributor

@benoitkugler benoitkugler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the simplification, I'm really happy with this change !

Comment thread segmenter/segmenter.go
… Word

Track UTF-8 byte positions alongside rune positions in the attribute
iterator, and expose them as OffsetInBytes and LengthInBytes fields on
Grapheme, Line, and Word structs. This allows users to efficiently
extract segments from byte slices or strings without O(n) conversion.

Fixes go-text#240
Copy link
Copy Markdown
Member

@whereswaldon whereswaldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this. Looks reasonable to me!

@benoitkugler benoitkugler merged commit 94fe510 into go-text:main Mar 9, 2026
7 checks passed
@hajimehoshi hajimehoshi deleted the bytes branch March 9, 2026 14:43
3ace pushed a commit to unidoc/typesetting that referenced this pull request Apr 2, 2026
… Word (go-text#241)

Track UTF-8 byte positions alongside rune positions in the attribute
iterator, and expose them as OffsetInBytes and LengthInBytes fields on
Grapheme, Line, and Word structs. This allows users to efficiently
extract segments from byte slices or strings without O(n) conversion.

Fixes go-text#240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

segmenter: proposal: add members for byte positions to Grapheme, Line, and Word

5 participants