Skip to content

benchmarks: harden longmemeval runner for Windows encoding#1204

Open
mschultheiss83 wants to merge 1 commit intoMemPalace:developfrom
mschultheiss83:fix-windows-benchmark-utf8-1203
Open

benchmarks: harden longmemeval runner for Windows encoding#1204
mschultheiss83 wants to merge 1 commit intoMemPalace:developfrom
mschultheiss83:fix-windows-benchmark-utf8-1203

Conversation

@mschultheiss83
Copy link
Copy Markdown

@mschultheiss83 mschultheiss83 commented Apr 25, 2026

Summary

Narrow Windows fix for the LongMemEval benchmark runner.

This PR addresses two reproducible failures in benchmarks/longmemeval_bench.py on native Windows:

  • local benchmark files were opened with the platform default encoding instead of explicit UTF-8
  • benchmark console output used non-ASCII separator characters that could raise UnicodeEncodeError on a default cp1252 console

Changes

  • make LongMemEval dataset / split / cache / result file I/O explicit UTF-8
  • replace non-ASCII console separator/output characters in the LongMemEval runner with ASCII-safe equivalents
  • add a focused regression test for Windows default-encoding behavior in tests/benchmarks/test_longmemeval_bench.py

Reproduction

Before this change on Windows:

python benchmarks/longmemeval_bench.py /path/to/longmemeval_s_cleaned.json --limit 20

could fail with a UnicodeDecodeError while reading the dataset, and after fixing file I/O the same runner could still fail with UnicodeEncodeError when printing separators to a non-UTF-8 console.

After this change, the runner completes on the same machine without forcing UTF-8 mode.

Validation

Ran:

python -m pytest tests/benchmarks/test_longmemeval_bench.py -q --basetemp .pytest-tmp

Result:

2 passed

Ran:

python benchmarks/longmemeval_bench.py benchmarks-data/longmemeval_s_cleaned.json --limit 1

Result:

  • completed successfully on native Windows
  • wrote a normal benchmark results file

Scope / Notes

@mschultheiss83 mschultheiss83 changed the title fix: harden longmemeval benchmark for Windows encoding benchmarks: harden longmemeval runner for Windows encoding Apr 25, 2026
@mschultheiss83 mschultheiss83 marked this pull request as draft April 25, 2026 21:54
@mvalentsev
Copy link
Copy Markdown
Contributor

Nice fix on both layers. The same dual pattern (bare open(...) plus non-ASCII chars in print / docstrings) shows up in the other three benchmark runners on develop:

  • benchmarks/locomo_bench.py: 4 file-IO sites (lines 639, 664, 706, 980)
  • benchmarks/membench_bench.py: 2 file-IO sites (207, 422)
  • benchmarks/convomem_bench.py: 5 file-IO sites (73, 79, 92, 103, 305)

The issue framing ("audit benchmark runners", plural) puts them in scope, FWIW.

@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 26, 2026

Audited the other benchmark runners for the same open(...) pattern in case it's useful for scoping:

benchmarks/membench_bench.py:207   open(fpath)              # read
benchmarks/membench_bench.py:422   open(out_file, "w")      # write
benchmarks/convomem_bench.py:73    open(cache_path)         # read
benchmarks/convomem_bench.py:79    open(cache_path)         # read
benchmarks/convomem_bench.py:92    open(cache_path)         # read
benchmarks/convomem_bench.py:103   open(cache_path, "w")    # write
benchmarks/convomem_bench.py:305   open(out_file, "w")      # write

Same failure shape on any non-ASCII content. Up to you whether to expand scope here or leave them for a follow-up — both are reasonable, and the LongMemEval fix is a clean unit on its own. Thanks for the explicit-UTF-8 + ASCII-console-separator hardening; that's the right pattern.

@mschultheiss83
Copy link
Copy Markdown
Author

Thanks, this is helpful.

I intentionally kept this PR narrow to the reproducible LongMemEval Windows failure from #1203 so the change stayed easy to review and validate.

I agree the same pattern exists in locomo_bench.py, membench_bench.py, and convomem_bench.py. Would you prefer that I expand this PR to cover the other benchmark runners as well, or keep this one scoped to LongMemEval and send a follow-up PR for the rest?

@mschultheiss83 mschultheiss83 marked this pull request as ready for review April 26, 2026 08:38
@jphein
Copy link
Copy Markdown
Collaborator

jphein commented Apr 26, 2026

Follow-up PR is the right call — keeps this one focused on the reproducible failure you opened it for, and the other three runners can land separately without complicating review here.

If it's useful, I can take a swing at the follow-up since the audit was already done — but happy to leave it to you if you'd rather keep ownership of the benchmark hardening thread. Either works.

@mschultheiss83
Copy link
Copy Markdown
Author

Thanks, @jphein — agree a follow-up PR is the right call.

If you’re up for it, please take a swing at the follow-up and own the changes for benchmarks/locomo_bench.py, benchmarks/membench_bench.py, and benchmarks/convomem_bench.py (same explicit encoding="utf-8" pattern + any non-ASCII-safe console output tweaks as needed). I’ll keep this PR focused on the reproducible LongMemEval Windows failure from #1203.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants