segmenter: add OffsetInBytes and LengthInBytes to Grapheme, Line, and Word by hajimehoshi · Pull Request #241 · go-text/typesetting

hajimehoshi · 2026-03-04T16:31:48Z

Track UTF-8 byte positions alongside rune positions in the attribute iterator, and expose them as OffsetInBytes and LengthInBytes fields on Grapheme, Line, and Word structs. This allows users to efficiently extract segments from byte slices or strings without O(n) conversion.

Fixes #240

Copilot

Pull request overview

This PR extends the segmenter package API to expose UTF-8 byte offsets/lengths for each produced segment, enabling callers to slice the original string/[]byte input without converting to []rune first (Fixes #240).

Changes:

Track UTF-8 byte position alongside rune position inside the internal attributeIterator.
Expose OffsetInBytes and LengthInBytes on Grapheme, Line, and Word.
Add a test validating that byte slicing via the new fields matches the segment Text for several UTF-8 inputs across all init modes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
segmenter/segmenter.go	Adds internal byte-position tracking and exposes byte offset/length on segment structs.
segmenter/segmenter_test.go	Adds coverage ensuring byte offsets/lengths reproduce the expected substring for graphemes/lines/words.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hajimehoshi · 2026-03-04T17:59:25Z

I think it's ready. Please take a look, thanks

benoitkugler · 2026-03-05T06:57:59Z

Thank you for this contribution.

I'm not a big fan however of the additional complexity incurred by invalid utf8 handling.

Have you concrete use case you must support ? Could we instead require valid utf8 is passed to the segmenter ? That would align more closely to the scope of the library (implicitly defined by the use of rune slices).

hajimehoshi · 2026-03-05T07:23:11Z

Have you concrete use case you must support ?

In Guigui (a GUI framework made with Ebitengine), I need to implement a text auto-wrapping, and this requires a segmenter. Now I am using uniseg, but this is relatively inactive.

Could we instead require valid utf8 is passed to the segmenter ?

Yeah, I think this makes sense.

I think there are some options, but the simplest way is let it panic:

Panic at InitWithString and InitWithBytes when the source includes an invalid sequence. (BTW, if the source includes U+FFFD as a valid sequence intentionally, this should be treated correctly, so we should distinguish them...)
Panic at (*attributeIterator).next if the rune is an invalid rune (e.g. U+FFFF, which never exists)

This enables to remove some checks like invalidUTF8Indices from the current change.

Another option is just silently ignore such runes, but this sounds a little risky. What do you think?

benoitkugler · 2026-03-05T08:24:41Z

What about returning an error from InitWithString and InitWithBytes ? And then panic in (*attributeIterator).next

hajimehoshi · 2026-03-05T08:31:15Z

What about returning an error from InitWithString and InitWithBytes ?

I slightly prefer panics since the condition when to panic is very clear and a user can check before invoking InitWithString/Bytes. Also this would break a comaptibility with v0.3.4. But if you want to change the signature, I'm fine to do so.

hajimehoshi · 2026-03-05T08:34:40Z

Also, if InitWithString and InitWithBytes can return an error, should Init also return an error?

benoitkugler · 2026-03-05T08:51:45Z

Ah, my mistake, I've overlooked the fact that InitWithString and InitWithBytes are already defined in v0.3.4.

I'm in favor of the panic then.

Whats @andydotxyz @whereswaldon think about this question ?

andydotxyz · 2026-03-05T09:20:34Z

Please don't (ever) panic in a library. Unless it is truly an unrecoverable fatal flaw in the input.
Displaying unicode incorrectly is not one of those things.

We could log it aggressively if returning an error is not an option.

hajimehoshi · 2026-03-05T09:42:44Z

So if we don't panic, don't return errors, and don't handle invalid sequences in the current way, the only way is to just ignore such sequences or log them, and the result of *InBytes would be undefined for such sequences or runes.

benoitkugler · 2026-03-05T10:03:11Z

Perhaps this case justifies an API change ?
We only added the new InitXXX functions really recently, @hajimehoshi is probably the only consumer so far.

benoitkugler · 2026-03-05T10:05:54Z

Also, if InitWithString and InitWithBytes can return an error, should Init also return an error?

Only InitWithString and InitWithBytes would return an error, Init would not.

hajimehoshi · 2026-03-05T10:58:57Z

Perhaps this case justifies an API change ?
We only added the new InitXXX functions really recently, @hajimehoshi is probably the only consumer so far.

True. Almost nobody uses the API. I've not used these yet neither.

Only InitWithString and InitWithBytes would return an error, Init would not.

It's ok, but I feel like this is inconsistent. An invalid rune like U+FFFF can come to Init, right?

benoitkugler · 2026-03-05T11:12:38Z

It's ok, but I feel like this is inconsistent. An invalid rune like U+FFFF can come to Init, right?

Yes, you're right. So we would have an undefined behavior in case someone uses Input with invalid runes and String/Bytes iterators. This is not the proper way to use the Segmenter, so I'm fine with keeping this edge case (for simplicity and performance in the common case).

hajimehoshi · 2026-03-05T14:52:52Z

Updated this PR.

hajimehoshi · 2026-03-05T14:59:45Z

An invalid rune like U+FFFF can come to Init, right?

This was my understanding but U+FFFF is treated as a one valid rune in Go. An invalid rune is for example U+7FFFFFFF. https://go.dev/play/p/-Axax5XcldD

benoitkugler

Thank you for the simplification, I'm really happy with this change !

… Word Track UTF-8 byte positions alongside rune positions in the attribute iterator, and expose them as OffsetInBytes and LengthInBytes fields on Grapheme, Line, and Word structs. This allows users to efficiently extract segments from byte slices or strings without O(n) conversion. Fixes go-text#240

whereswaldon

Thank you for implementing this. Looks reasonable to me!

… Word (go-text#241) Track UTF-8 byte positions alongside rune positions in the attribute iterator, and expose them as OffsetInBytes and LengthInBytes fields on Grapheme, Line, and Word structs. This allows users to efficiently extract segments from byte slices or strings without O(n) conversion. Fixes go-text#240

hajimehoshi requested review from andydotxyz, benoitkugler and whereswaldon as code owners March 4, 2026 16:31

Copilot AI review requested due to automatic review settings March 4, 2026 16:31

Copilot started reviewing on behalf of hajimehoshi March 4, 2026 16:32 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Comment thread segmenter/segmenter.go Outdated

Comment thread segmenter/segmenter.go

hajimehoshi force-pushed the bytes branch 3 times, most recently from 89b991a to 5229631 Compare March 4, 2026 17:53

hajimehoshi force-pushed the bytes branch from 5229631 to 4e3ef2f Compare March 5, 2026 14:52

hajimehoshi force-pushed the bytes branch from 4e3ef2f to 5e1febc Compare March 5, 2026 14:53

benoitkugler approved these changes Mar 5, 2026

View reviewed changes

hajimehoshi commented Mar 5, 2026

View reviewed changes

Comment thread segmenter/segmenter.go

hajimehoshi force-pushed the bytes branch from 5e1febc to 4867aaa Compare March 5, 2026 18:18

whereswaldon approved these changes Mar 9, 2026

View reviewed changes

benoitkugler merged commit 94fe510 into go-text:main Mar 9, 2026
7 checks passed

hajimehoshi deleted the bytes branch March 9, 2026 14:43

Conversation

hajimehoshi commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

hajimehoshi commented Mar 4, 2026

Uh oh!

benoitkugler commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hajimehoshi commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benoitkugler commented Mar 5, 2026

Uh oh!

hajimehoshi commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hajimehoshi commented Mar 5, 2026

Uh oh!

benoitkugler commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andydotxyz commented Mar 5, 2026

Uh oh!

hajimehoshi commented Mar 5, 2026

Uh oh!

benoitkugler commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benoitkugler commented Mar 5, 2026

Uh oh!

hajimehoshi commented Mar 5, 2026

Uh oh!

benoitkugler commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hajimehoshi commented Mar 5, 2026

Uh oh!

hajimehoshi commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benoitkugler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

whereswaldon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benoitkugler commented Mar 5, 2026 •

edited

Loading

hajimehoshi commented Mar 5, 2026 •

edited

Loading

hajimehoshi commented Mar 5, 2026 •

edited

Loading

benoitkugler commented Mar 5, 2026 •

edited

Loading

benoitkugler commented Mar 5, 2026 •

edited

Loading

benoitkugler commented Mar 5, 2026 •

edited

Loading

hajimehoshi commented Mar 5, 2026 •

edited

Loading