benchmarks: harden longmemeval runner for Windows encoding by mschultheiss83 · Pull Request #1204 · MemPalace/mempalace

mschultheiss83 · 2026-04-25T21:53:03Z

Summary

Narrow Windows fix for the LongMemEval benchmark runner.

This PR addresses two reproducible failures in benchmarks/longmemeval_bench.py on native Windows:

local benchmark files were opened with the platform default encoding instead of explicit UTF-8
benchmark console output used non-ASCII separator characters that could raise UnicodeEncodeError on a default cp1252 console

Changes

make LongMemEval dataset / split / cache / result file I/O explicit UTF-8
replace non-ASCII console separator/output characters in the LongMemEval runner with ASCII-safe equivalents
add a focused regression test for Windows default-encoding behavior in tests/benchmarks/test_longmemeval_bench.py

Reproduction

Before this change on Windows:

python benchmarks/longmemeval_bench.py /path/to/longmemeval_s_cleaned.json --limit 20

could fail with a UnicodeDecodeError while reading the dataset, and after fixing file I/O the same runner could still fail with UnicodeEncodeError when printing separators to a non-UTF-8 console.

After this change, the runner completes on the same machine without forcing UTF-8 mode.

Validation

Ran:

python -m pytest tests/benchmarks/test_longmemeval_bench.py -q --basetemp .pytest-tmp

Result:

2 passed

Ran:

python benchmarks/longmemeval_bench.py benchmarks-data/longmemeval_s_cleaned.json --limit 1

Result:

completed successfully on native Windows
wrote a normal benchmark results file

Scope / Notes

intentionally scoped only to benchmarks/longmemeval_bench.py
I have not claimed broader Windows benchmark hardening in other runners
I have not re-run the full project test suite here because there are existing unrelated Windows failures outside this runner path
related issue: benchmarks: LongMemEval runner opens JSON with platform default encoding on Windows #1203

mvalentsev · 2026-04-25T22:05:42Z

Nice fix on both layers. The same dual pattern (bare open(...) plus non-ASCII chars in print / docstrings) shows up in the other three benchmark runners on develop:

benchmarks/locomo_bench.py: 4 file-IO sites (lines 639, 664, 706, 980)
benchmarks/membench_bench.py: 2 file-IO sites (207, 422)
benchmarks/convomem_bench.py: 5 file-IO sites (73, 79, 92, 103, 305)

The issue framing ("audit benchmark runners", plural) puts them in scope, FWIW.

jphein · 2026-04-26T02:59:33Z

Audited the other benchmark runners for the same open(...) pattern in case it's useful for scoping:

benchmarks/membench_bench.py:207   open(fpath)              # read
benchmarks/membench_bench.py:422   open(out_file, "w")      # write
benchmarks/convomem_bench.py:73    open(cache_path)         # read
benchmarks/convomem_bench.py:79    open(cache_path)         # read
benchmarks/convomem_bench.py:92    open(cache_path)         # read
benchmarks/convomem_bench.py:103   open(cache_path, "w")    # write
benchmarks/convomem_bench.py:305   open(out_file, "w")      # write

Same failure shape on any non-ASCII content. Up to you whether to expand scope here or leave them for a follow-up — both are reasonable, and the LongMemEval fix is a clean unit on its own. Thanks for the explicit-UTF-8 + ASCII-console-separator hardening; that's the right pattern.

mschultheiss83 · 2026-04-26T08:37:25Z

Thanks, this is helpful.

I intentionally kept this PR narrow to the reproducible LongMemEval Windows failure from #1203 so the change stayed easy to review and validate.

I agree the same pattern exists in locomo_bench.py, membench_bench.py, and convomem_bench.py. Would you prefer that I expand this PR to cover the other benchmark runners as well, or keep this one scoped to LongMemEval and send a follow-up PR for the rest?

jphein · 2026-04-26T16:46:45Z

Follow-up PR is the right call — keeps this one focused on the reproducible failure you opened it for, and the other three runners can land separately without complicating review here.

If it's useful, I can take a swing at the follow-up since the audit was already done — but happy to leave it to you if you'd rather keep ownership of the benchmark hardening thread. Either works.

mschultheiss83 · 2026-04-27T06:30:31Z

Thanks, @jphein — agree a follow-up PR is the right call.

If you’re up for it, please take a swing at the follow-up and own the changes for benchmarks/locomo_bench.py, benchmarks/membench_bench.py, and benchmarks/convomem_bench.py (same explicit encoding="utf-8" pattern + any non-ASCII-safe console output tweaks as needed). I’ll keep this PR focused on the reproducible LongMemEval Windows failure from #1203.

fix: harden longmemeval benchmark for Windows encoding

c2e85c9

mschultheiss83 requested review from bensig, igorls and milla-jovovich as code owners April 25, 2026 21:53

mschultheiss83 changed the title ~~fix: harden longmemeval benchmark for Windows encoding~~ benchmarks: harden longmemeval runner for Windows encoding Apr 25, 2026

mschultheiss83 marked this pull request as draft April 25, 2026 21:54

mschultheiss83 marked this pull request as ready for review April 26, 2026 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks: harden longmemeval runner for Windows encoding#1204

benchmarks: harden longmemeval runner for Windows encoding#1204
mschultheiss83 wants to merge 1 commit intoMemPalace:developfrom
mschultheiss83:fix-windows-benchmark-utf8-1203

mschultheiss83 commented Apr 25, 2026 •

edited

Loading

Uh oh!

mvalentsev commented Apr 25, 2026

Uh oh!

jphein commented Apr 26, 2026

Uh oh!

mschultheiss83 commented Apr 26, 2026

Uh oh!

jphein commented Apr 26, 2026

Uh oh!

mschultheiss83 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mschultheiss83 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Reproduction

Validation

Scope / Notes

Uh oh!

mvalentsev commented Apr 25, 2026

Uh oh!

jphein commented Apr 26, 2026

Uh oh!

mschultheiss83 commented Apr 26, 2026

Uh oh!

jphein commented Apr 26, 2026

Uh oh!

mschultheiss83 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mschultheiss83 commented Apr 25, 2026 •

edited

Loading