Skip to content

v2: arm64 codegen perf + transformer/cleanc micro-opts#27258

Open
medvednikov wants to merge 113 commits into
masterfrom
v2-arm64-perf-may26
Open

v2: arm64 codegen perf + transformer/cleanc micro-opts#27258
medvednikov wants to merge 113 commits into
masterfrom
v2-arm64-perf-may26

Conversation

@medvednikov
Copy link
Copy Markdown
Member

Summary

Speeds up the v2 ARM64 backend by ~10x on self-compile and trims hot loops in the transformer and cleanc backend.

ARM64 codegen

  • O(N²) → O(N) pending-label scan (e0e5c7a): the branch-fixup loop in gen_func scanned the full pending_label_blks list once per block. Replaced with a per-block linked list (pending_head[blk] + pending_next[i]), backed by a sparse []int (avoids map ops and a v2-self-compile hang seen with the map variant). ARM64 Gen on cmd/v2/v2.v: cleanc-built v2 ~25s → ~3.3s; arm64-built v3 ~36s → ~6.3s. Chain still converges (v4 = v5 except filename byte).
  • Inner-loop micro-opts in arm64.v (selected_opcode, reg_map single lookup, etc.) — ~6x on top of the original baseline.
  • Earlier pre-pass restructuring that took the hottest path from ~17s → ~1s.

Transformer

  • Cache is_native_be on the Transformer struct so hot loops (transform_stmts, IfGuardExpr handling, OrExpr expansion) skip per-call t.pref nil/enum checks.
  • Lazy smartcast snapshot in transform_stmts: most blocks enter with an empty smartcast stack, so skip the .clone() of stack + counts map when there is nothing to restore.

cleanc backend

  • gen_direct_fn_pointer_call: when a fnptr expects a pointer param and the argument is already a pointer (or a mutable param), don't re-& it.
  • Per-stage timing stats (mark_cgen_step) inside gen_passes_1_to_4 for -stats: surfaces setup.generic_struct_bindings, setup.discover_generic_specs, setup.fn_signatures, etc. — no runtime cost when -stats is off.

Test plan

  • ARM64 self-host chain still converges: v -o cmd/v2/v2 cmd/v2/v2.v && cmd/v2/v2 --no-parallel -backend arm64 -o v3 cmd/v2/v2.v && ./v3 --no-parallel -gc none -backend arm64 -o v4 cmd/v2/v2.v — v4 = v5 modulo filename byte.
  • ARM64 test suites pass (string 109/109, array 131/131, map 54/54, math, int, hex, sorting, etc.).
  • CI passes on master.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4078333289

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vlib/v2/gen/arm64/arm64.v
fn (mut g Gen) record_pending_label(blk int) {
off := g.macho.text_data.len - g.curr_offset
new_idx := g.pending_label_offs.len
prev_head := g.pending_head[blk]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate block index before queuing pending labels

record_pending_label now dereferences g.pending_head[blk] directly, but several branch lowering paths can pass through unresolved block IDs (e.g. values that map to -1 in val_to_block) into this helper. Before this change those cases were just queued as unresolved; now they can trigger an out-of-bounds panic and hard-crash codegen. Please guard blk bounds (or keep a safe fallback path) before indexing pending_head.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c776b4816d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vlib/v2/transformer/transformer.v Outdated
Comment on lines +296 to +297
is_native_be: p != unsafe { nil }
&& (p.backend == .arm64 || p.backend == .x64)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute native-backend state when preferences change

Do not treat is_native_be as immutable state: pref.backend is mutable after transformer creation (for example, tests set t.pref.backend = .arm64 on an existing transformer), but this cache is only initialized once here and then used in many lowering paths. When the backend is changed later, native-only rewrites are skipped (or C-specific rewrites are applied), which can produce incorrect transformed AST/code for that backend. Compute this from t.pref.backend at use sites or update the cache whenever pref.backend changes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 749669eeaa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vlib/v2/gen/arm64/arm64.v
t2 := time.now()
g.gen_post_pass()
post_ms := f64(time.since(t2)) / f64(time.millisecond)
eprintln('ARM64 gen sub: pre=${pre_ms:.1}ms funcs=${funcs_ms:.1}ms post=${post_ms:.1}ms')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate ARM64 timing logs behind an opt-in flag

gen() now unconditionally writes ARM64 gen sub: ... to stderr on every successful ARM64 compilation. This changes normal compiler output in non-debug builds and can break callers/tests that treat any stderr text as a failure signal or rely on stable diagnostic streams; the timing print should be behind -stats or an env flag (like the detailed timer line below it) instead of always-on.

Useful? React with 👍 / 👎.

Linear-time pending_label scan via per-block linked list,
selected_opcode and reg_map micro-opts, assorted arm64 codegen
speedups. Squashed from a series of wip commits.
Replaces `saved := t.pending_stmts.clone(); t.pending_stmts.clear()` with
`saved := t.pending_stmts; t.pending_stmts = []ast.Stmt{}` across 19 sites
in expr.v, for.v, and transformer.v. The original pattern allocated a deep
copy on every save and again on every restore; the swap pattern transfers
ownership without copying buffer contents.
Skipping mem2reg + phi elimination by default cuts peak memory significantly
and makes compile times shorter. -prod restores the previous optimize-by-default
behavior alongside -O3 -flto for the C compiler.
Set V2_MEM=1 to print resident memory at parse / type check / transform /
mark used / codegen boundaries. Used to pinpoint phase-level allocation
hotspots when chasing v1 memory parity.
Initial flat AST representation: contiguous nodes/edges/strings
arenas, FlatBuilder for streaming legacy AST in, FlatReader for
rehydration, basic tests.
walk_stmt/walk_expr/walk_type were adding sizeof(Stmt|Expr|Type) at the
top of each function while also charging the same 16-byte sumtype slot
via add_array_storage at the parent, and never counted the heap-allocated
variant payload (FnDecl, CallExpr, ArrayType, ...). Result: legacy memory
estimate was undercounted by ~5x and meaningless to compare against flat.

Introduce add_variant_payload(size) helper that bumps node_bytes and a
new allocs counter together, call it from each variant arm with the real
variant struct's sizeof, and strip the top-level sizeof from plain-struct
walkers (their storage is already counted at parent/array level).

Add LegacyAstStats.allocs and surface it in print_flat_ast_summary: full
v2 source now reports 1.15M legacy heap allocations collapsing to 4 flat
arenas - the dominant memory win is fewer GC-tracked objects, not raw
bytes (still ~2.8% reduction).

Round-trip tests (11/11) still pass.
The parser emits ast.empty_expr / ast.empty_stmt (always EmptyExpr(0) /
EmptyStmt(0)) anywhere an Expr/Stmt slot is structurally required but
absent (for-loop init/cond/post, optional clause arms, ...). The flat
converter was minting a fresh 32-byte node for every occurrence: 65,897
expr_empty + 3,032 stmt_empty cells on a full v2 self-build.

Cache one singleton per kind in FlatBuilder (same pattern as empty_list_id)
and reuse it whenever the source value is the canonical zero. Non-zero u8
payloads still emit fresh nodes for losslessness (none exist in current
parser output, but the converter must not silently drop them).

Result on cmd/v2/v2.v clean build:
  - nodes:  854,818 → 785,889  (-68,929, -8.1%)
  - memory: 31.6 MB → 29.4 MB  (-2.2 MB, 9.65% vs legacy AST)
  - allocs: still 4 arenas

Also add FlatAst.count_nodes_by_kind() and a V2_FLAT_HIST=1-gated
per-kind histogram in print_flat_ast_summary for future schema work
(used to find these singletons in the first place).

Round-trip tests (11/11) still pass — signature is unchanged because
the singleton has the same extra=0/pos=0 as every original instance.
…lder

Set up the public API surface the type checker, transformer, and SSA
builder will need to consume FlatAst directly. The recursive AST stays
in place for now — downstream consumers haven't been migrated yet — but
this commit lands two things that gate every follow-up step:

  1. ast.FlatBuilder is now public, with `new_flat_builder()` and a new
     `append_file(File) FlatNodeId` shim. The legacy converters
     (add_expr/add_stmt/...) stay internal; front-ends must go through
     `append_file` during the transition.

  2. Parser.parse_files_to_flat(files, fs) ast.FlatAst is the streaming
     entry point: parse one file, convert, drop the legacy struct,
     loop. Peak resident AST is one legacy file plus the cumulative
     flat, instead of the full legacy forest plus the flat copy that
     `ast.flatten_files(p.parse_files(...))` produces today.

flat_streaming_test.v asserts byte-for-byte signature equality between
the streamed FlatAst and the two-step `flatten_files(parse_files())`
output across a synthetic single-file source, a two-file source (to
exercise per-file intern reuse and node-id ordering), and the real
vlib/v2/ast/ast.v. All 11 existing round-trip tests still pass.

Builder still calls parse_files() — wiring it through to the streaming
path comes after the type checker can read FlatAst.
…TRIP)

Adds end-to-end validation of the streaming flat AST + lossless roundtrip
on the full v2 self-host. With V2_FLAT_ROUNDTRIP=1 the builder routes
every parser.parse_files() through parse_files_to_flat + FlatAst.to_files(),
producing byte-identical C output (281,705 lines) vs the direct path.

- ast.arena_caps_for_bytes(): shared pre-sizing heuristic (~150 nodes/KB,
  ~170 edges/KB, ~5 strings/KB) used by both flatten_files and the
  streaming parse path. Eliminates ~10 geometric reallocs on self-host.
- parse_files_to_flat: uses the shared helper.
- Builder.parse_batch(): wrapper routing parser calls through direct or
  streaming+roundtrip; controlled by flat_roundtrip_enabled (env-gated).
- Warn when V2_FLAT_ROUNDTRIP=1 is set alongside parallel parse (the
  parallel path doesn't honor the switch yet).

Phase 2 invariant validated: streaming parse + lossless conversion. Next
step (consumer migration) will let us drop the legacy AST entirely.
Forward declaration of the type checker's FlatAst-aware API surface. The
current implementation rehydrates via FlatAst.to_files() then calls
check_files; future PRs progressively replace the body with direct flat
reads (selector_names first, then file metadata, then statements) so the
legacy AST need not be materialized.

Includes a regression test that builds both an env-via-files and an
env-via-flat for the same source and verifies registered functions and
methods match. Guards the eventual direct-FlatAst implementation against
behavioral drift.
First real consumer-side replacement: Checker.check_flat now reads the
selector_names map directly off FlatFile entries instead of letting the
rehydrated legacy ast.File supply them. The rest of check_files's body
is inlined into check_flat (rehydrating once into a local) so each
migrated step can be pulled out of the to_files() path incrementally.

Builder gets a V2_CHECK_FLAT=1 env switch that routes type_check_files
through Checker.check_flat. End-to-end validation: V2_CHECK_FLAT=1 (and
V2_FLAT_ROUNDTRIP=1 + V2_CHECK_FLAT=1 combined) produce byte-identical
C output on the v2 self-host.
Adds public FlatAst.read_file_imports / read_file_stmts / file_mod
helpers and migrates register_imported_symbols to a _from_flat variant
that drives off them, removing one more dependency on the rehydrated
[]ast.File path in check_flat. C output remains byte-identical with
V2_CHECK_FLAT=1 vs baseline.
Adds preregister_all_scopes_from_flat and preregister_scopes_from_flat
which use FlatAst.file_mod + active_file_imports_from_flat directly,
removing one more legacy []ast.File hop in check_flat. C output remains
byte-identical with V2_CHECK_FLAT=1 vs baseline.
Adds preregister_all_types_from_flat and preregister_all_fn_signatures_from_flat
that drive their per-file loops off FlatAst.file_mod + read_file_stmts.
check_flat now only rehydrates []ast.File for the final per-file
check_file pass; everything before that runs straight off the flat
representation. C output remains byte-identical with V2_CHECK_FLAT=1.
Adds check_file_from_flat, check_struct_field_defaults_from_flat, and
check_enum_field_values_from_flat. check_flat now drives every step
directly off FlatAst; the only remaining to_files() call is gated
behind \$if ownership ? for the lifetime/escape validators that still
walk recursive ast.File. C output remains byte-identical with
V2_CHECK_FLAT=1 vs baseline.
Adds b.flat populated once after parse when V2_CHECK_FLAT=1, and switches
type_check_files to consume it directly instead of re-running
flatten_files on every checker entry. Sets up the persistent FlatAst
surface the transformer and SSA builders will read from next. C output
remains byte-identical with V2_CHECK_FLAT=1 vs baseline.
When V2_CHECK_FLAT=1 the builder now owns a persistent ast.FlatBuilder
that every parse_batch streams into via the new
Parser.parse_files_into_flat helper. Per batch we rehydrate just the
freshly-appended FlatFiles through the new FlatAst.to_files_range
helper, so legacy downstream consumers still see []ast.File while
b.flat ends up populated as a side effect of parsing. The post-parse
flatten_files() pass is replaced with a direct b.flat_builder.flat
handoff; the parallel parse path keeps the legacy fallback since it
bypasses parse_batch. C output remains byte-identical with
V2_CHECK_FLAT=1 vs baseline.
`check_flat` no longer rehydrates the file slice for the `$if ownership ?`
gate. Adds `lifetime_validate_files_from_flat` and `escape_validate_files_from_flat`
(plus a flat passthrough prescan) so lifetime and escape validation walks
`flat.read_file_stmts(ff)` directly. Drops the last `flat.to_files()` call
out of the checker entry point.
`parse_files` now resolves the core + expanded-user path list upfront and
hands it to `init_flat_builder_for_paths`, which stats the union and
calls `arena_caps_for_bytes(total * 2)` to seed the flat arenas before
the first `parse_batch`. The 2x scale factor reserves room for the
import walk, which is the only batch we cannot enumerate ahead of time.

Closes most of the realloc churn the one-shot `flatten_files` was
side-stepping. The lazy `ensure_flat_builder_inited` fallback stays in
place for entry points that bypass `parse_files` (e.g. tests).
Adds a FlatAst-driven entry point to the sequential transformer:

  - pre_pass_from_flat(flat) mirrors pre_pass; the elided_fns walk,
    collect_generic_fn_value_refs, and collect_runtime_const_inits all
    grow _from_flat variants that iterate flat.files and pull stmts
    via flat.read_file_stmts(ff).
  - transform_files_from_flat(flat, files) wraps that pre-pass with
    the existing per-file transform_file loop and post_pass — the
    rehydrated file slice is still consumed by transform_file itself,
    so this is purely a source-of-truth change for the accumulators.
  - Builder routes the serial transform through the new entry when
    flat_check_enabled; parallel transform is unchanged.

Validated byte-identical C output (V2_CHECK_FLAT=1, --no-parallel)
against the legacy path; the existing transformer/flat tests still
pass.
Routes the parallel transformer's pre-pass through pre_pass_from_flat
when flat_check_enabled. The per-file workers and the post-pass still
walk rehydrated ast.File values; only the accumulator pass changes
its source of truth here.

Both parallel and serial transform paths now share the same FlatAst-
seeded pre-pass, matching the checker's wiring.
…rial

Two pairs of files defined duplicate symbols under `-d parallel`:
- builder/type_check_parallel.v vs type_check_d_parallel.v
- util/worker_pool.v vs worker_pool_d_parallel.v

V's file filter only excludes the always-included variant when it ends in
`_notd_<flag>.v`. Renaming unblocks the `-d parallel` build (compile-time
duplicate-method/struct errors are gone).

Also make the real parallel type-check honor V2_CHECK_FLAT=1 by delegating
to the serial flat path; a parallel flat pipeline is future work.
The flat-aware transform entry point used to take both `&FlatAst` and
`[]ast.File` and only used the latter. Now it materializes the per-file
input via `flat.to_files()` itself, so the transformer no longer reads
`b.files` on this path.

A precondition for eventually skipping `b.files` between parse and
transform when V2_CHECK_FLAT=1 is on.
When V2_CHECK_FLAT=1, materialize the per-file input via flat.to_files()
once before chunking, so the parallel worker slices no longer alias
b.files. Both transform paths (serial and parallel) now read their input
from flat in flat mode — b.files is only fed in on the legacy path.
The `files []ast.File` argument was already ignored (`_ = files`) in both
constructors. Removing it makes the API match what the transformer
actually consumes — files come in via transform_files / transform_files_from_flat
or are pulled per-worker through new_worker_clone, not the constructor.

Builder and test call sites updated accordingly.
update_parse_summary_counts and count_parsed_v_lines walked b.files
purely for file.name. When V2_CHECK_FLAT=1, pull the same names from
b.flat.files via flat.file_name(ff) — no rehydration needed.

print_flat_ast_summary also stops re-flattening b.files when b.flat
is already populated.
The LambdaExpr arm in transform_expr_to_flat now direct-emits via a
recursive transform_expr_to_flat for the body expression, a leaf
out.emit_expr(ast.Expr(arg)) per arg Ident, and the new
emit_lambda_expr_by_ids builder helper. Mirrors add_expr(LambdaExpr)
encoding: edge[0] = body, edge[1..] = args.

The transform_expr arm for LambdaExpr is a pure wrapper — args ([]Ident)
are copied verbatim while only the body is transformed. Direct-emit
mirrors that exactly: each arg goes through the leaf emit path (Idents
are identity), the body goes through recursive transform-and-emit.

Adds fixture_lambda_expr (apply(f fn (int) int, x int) higher-order fn +
use_lambda() caller passing |y| y + 1) registered across all 5 harness
invariants — 61 → 66 transformer-diff tests.

First port pattern where the legacy arm has partially-transformed
children — same shape as future FnLiteral (captured_vars verbatim, stmts
transformed) and MatchExpr.branches ports.
The FnLiteral arm in transform_expr_to_flat now direct-emits via three
leaf/transform child paths plus the new emit_fn_literal_by_ids builder
helper:
- typ (FnType): out.emit_type(ast.Type(expr.typ)) — verbatim leaf
- captured_vars ([]Expr): out.emit_expr(cv) per element — verbatim leaf
- stmts ([]Stmt): transform_stmts(expr.stmts) then out.emit_stmt(...)
  per result — still allocates a []Stmt slice; the transform_stmts_to_flat
  seam is the next big port target.

Mirrors add_expr(FnLiteral) encoding exactly: edge[0]=type,
edge[1..1+captured.len]=captured_vars, edge[1+captured.len..]=stmts;
captured_vars.len in extra so the boundary is recoverable on decode.

Skips the ast.FnLiteral wrapper struct allocation per occurrence.

Adds fixture_fn_literal (make_doubler() returning fn(int) int +
use_fn_literal() that binds a fn(int) int to a local and calls it)
registered across all 5 harness invariants — 66 → 71 transformer-diff
tests.

First expression port where the legacy arm transforms a stmt list — the
"call existing transform_stmts then emit per-result via out.emit_stmt"
pattern lets us land per-arm wins without touching the body driver.
The PostfixExpr arm in transform_expr_to_flat now direct-emits via a
recursive transform_expr_to_flat for the inner expression plus the new
emit_postfix_expr_by_id builder helper. Skips the ast.PostfixExpr struct
allocation for plain `++` / `--` (.inc / .dec).

Legacy fallback guard: expr.op == .not || expr.op == .question.
- native backends keep the postfix node intact
- non-native backends rewrite expr! / expr? into a CastExpr over the
  inner Result/Option base type
- string-range check renames substr → substr_checked when inner is a
  string[range] expression
- multiple fallbacks may produce a CastExpr, an Ident, or an unwrapped
  inner — none of which are bit-equal to a structural PostfixExpr rebuild

Adds fixture_postfix_expr (mut-bound local + a++ a++ a-- + return)
registered across all 5 harness invariants — 71 → 76 transformer-diff
tests.

With ParenExpr/PrefixExpr/PostfixExpr/ModifierExpr ported, all four
unary-op wrapper variants are now direct-emit.
The CastExpr arm in transform_expr_to_flat now direct-emits via the
leaf out.emit_expr(expr.typ) (type is NOT transformed — copied
verbatim, mirroring transform_expr's CastExpr arm) plus a recursive
transform_expr_to_flat for the inner expression plus the new
emit_cast_expr_by_ids builder helper. Skips the ast.CastExpr struct
allocation for the primitive-cast common case (int(x), f64(y),
u8(z), etc.).

Legacy fallback guard: when expr.typ resolves to a sum type via
t.type_expr_name_full(expr.typ) + t.is_sum_type(name).
transform_cast_expr rewrites sumtype casts into SumType{type_tag,
payload_ptr} struct literal initializers (variant-tag lookup,
payload allocation) — not a bit-equal structural rebuild. Falls
back to out.emit_expr(t.transform_expr(expr)).

Adds fixture_cast_expr (primitive-cast fn use_cast() exercising
f64(3.5), int(a), u8(b + 1), int(c) + b) registered across all 5
invariants — 76 → 81 transformer-diff tests. Existing
fixture_sumtype_is_as already exercises the sumtype-fallback path.

First port to use read-only Transformer helpers
(type_expr_name_full, is_sum_type) inside the arm to gate fast path
vs legacy fallback — the shape future rewrite-aware ports
(InfixExpr operator overload, CallExpr method-to-function,
AsCastExpr smartcast snapshot) will follow.
The FieldInit arm in transform_expr_to_flat now direct-emits via a
recursive transform_expr_to_flat for the inner value plus the
existing emit_field_init_by_id builder helper. Skips the
ast.FieldInit struct allocation per occurrence — transform_expr's
FieldInit arm just transforms value and copies name verbatim.

FieldInit reaches the expression dispatch when it appears as a
standalone call arg in struct-shorthand / named-arg syntax
(foo(name: value)). The same emit_field_init_by_id helper that
ConstDecl uses for its aux_field_init child node covers the
expression position too — add_expr(FieldInit) and
add_field_init(field) both route through add_field_init so the
encodings are bit-equal.

Adds fixture_field_init (struct Cfg + a make(name: 'hello', n: 5)
call) registered across all 5 invariants — 81 → 86 transformer-diff
tests.

First port where the helper was already available — pattern for
future variants that share an encoding with an aux node (e.g.
StructInit fields list, function-call FieldInit args).
The AsCastExpr arm in transform_expr_to_flat now direct-emits via a
recursive transform_expr_to_flat for the inner expression plus the
leaf out.emit_expr(expr.typ) (type is NOT transformed — copied
verbatim) plus the new emit_as_cast_expr_by_ids builder helper.
Skips the ast.AsCastExpr struct allocation per occurrence.

First port that mutates Transformer state inside the arm. To keep
the `as` cast intact (a same-expression smartcast would otherwise
rewrite the operand into a direct sum payload access and hide the
original storage type from the backend — wrong for nested sum
types), the arm snapshots smartcast_stack + smartcast_expr_counts,
drains matching smartcast entries via a bounded
remove_smartcast_for_expr loop (limit smartcast_search_limit=128),
recurses into the inner expression, then restores the snapshots.
The legacy arm and direct-emit share the prologue/drain/restore
verbatim — only the inner recursion target changes (flat vs legacy).

Adds fixture_as_cast_expr (sum type Shape with Circle + Square
variants, pick(s Shape, k int) f64 returning (s as Circle).r or
(s as Square).side) registered across all 5 invariants — 86 → 91
transformer-diff tests.

The shape for future state-mutating ports (Ident smartcast lookup,
IfGuardExpr stmt synthesis, OrExpr prefix_stmts collection,
UnsafeExpr stmt-list transform).
The UnsafeExpr arm in transform_expr_to_flat now direct-emits via
two paths matching transform_expr's arm:

  1. unsafe { nil } normalises to a plain nil Ident (detected via
     t.is_unsafe_nil_expr(expr)) — emitted as a leaf Ident through
     out.emit_expr(...). Identity in transform_expr's else case.

  2. otherwise transform expr.stmts via transform_stmts and emit
     each result via out.emit_stmt(...), then assemble via the new
     emit_unsafe_expr_by_ids builder helper. Mirrors
     add_expr(UnsafeExpr) encoding exactly (edges = body stmts in
     order). Skips the ast.UnsafeExpr struct allocation per non-nil
     occurrence.

The body stmt-list is still materialised because transform_stmts
returns one — the stmt-list expansion seam is the next big port
target.

Adds fixture_unsafe_expr (use_unsafe() binding p := unsafe { nil },
q := unsafe { a := 1; a + 2 }, and comparing p == unsafe { nil })
registered across all 5 invariants — 91 → 96 transformer-diff tests.

Second port (after FnLiteral session 9) that runs transform_stmts
inside the arm. Both will share the transform_stmts_to_flat seam
refactor when it lands.
The LockExpr arm in transform_expr_to_flat now rewrites
`lock x { body }` / `rlock x { body }` into a sequence of mutex
lock/unlock calls wrapped in an UnsafeExpr (compound expression).
expand_lock_expr(expr) returns [lock_calls..., body...,
unlock_calls...]; the legacy arm duplicates the last body stmt
after the unlock calls so GCC compound-expression value semantics
return the body's tail value instead of void (since the last stmt
determines the compound's value).

Direct-emit mirrors the rewrite exactly — same expand_lock_expr
call, same duplicate-last fix-up, then emits each stmt via
out.emit_stmt(...) and assembles via the existing
emit_unsafe_expr_by_ids helper (the legacy LockExpr arm produces
an UnsafeExpr so session-14's helper covers this directly). Skips
the ast.UnsafeExpr wrapper struct allocation per occurrence.

CRITICAL encoding detail: the legacy ast.UnsafeExpr{stmts: stmts}
is constructed with no explicit pos (defaults to zero), so direct-
emit must also pass token.Pos{} — passing expr.pos (the LockExpr's
pos) would diverge from the legacy encoding. First port to import
v2.token in flat_write.v for the zero-pos literal.

Adds fixture_lock_expr (Counter{mut value int} struct + use_lock()
with shared c and value-form x := lock c { c.value = 5; c.value })
registered across all 5 invariants — 96 → 101 transformer-diff
tests.

First port where the rewrite produces a different expression shape
than the input (LockExpr → UnsafeExpr) and reuses an existing emit
helper for the target shape rather than the input shape.
The OrExpr arm in transform_expr_to_flat now direct-emits the
expression-context or-block (statement-level sites go through
dedicated handlers and never reach this arm).
expand_single_or_expr returns the data-access expression that
stands in for the or-block value, and appends prefix stmts (typed
temp init, optional-state branch). When prefix_stmts is non-empty
the legacy arm wraps everything in an UnsafeExpr (GCC compound
expression) — direct-emit mirrors that via the existing
emit_unsafe_expr_by_ids helper (passing token.Pos{}). When empty,
the result expression goes through out.emit_expr(...) (leaf).

Skips the ast.UnsafeExpr wrapper allocation per occurrence in the
compound-expression case.

Adds fixture_or_expr (maybe(n) ?int + pair(a, b) + use_or_expr(n)
returning pair(maybe(n) or { -1 }, 2) — or-block nested in call
args) registered across all 5 invariants — 101 → 106
transformer-diff tests.

Third shape-changing port reusing emit_unsafe_expr_by_ids (after
LockExpr session 15). The "lower to UnsafeExpr-with-stmts" pattern
is firmly established.
The Ident arm in `transform_expr_to_flat` direct-emits via the leaf
`out.emit_expr(ast.Expr(expr))` for the default path — Ident is identity
in `transform_expr`'s arm (same shape returned, no allocation). Two
state-dependent rewrite branches fall back to legacy:
  1. `@VMODROOT` → `vmodroot_string_literal` produces a StringLiteral.
  2. `find_smartcast_for_expr(expr.name)` hit → `apply_smartcast_direct_ctx`
     lowers to a sum payload access.
Both gate checks are immutable lookups (`&Transformer` receiver), safe
to probe upfront.

First port where the saving is **dispatch-skip** rather than struct-
allocation skip: Ident reaches `transform_expr_to_flat` recursively from
every ported ancestor arm (ParenExpr, PrefixExpr, ModifierExpr, CastExpr,
AsCastExpr, FieldInit, LambdaExpr, IndexExpr, SelectorExpr, ComptimeExpr
default, InitExpr default fields, etc.) — porting skips the
`transform_expr` function call + match dispatch per Ident reached via
those ports.

Adds `fixture_ident` covering file-scope Idents inside already-ported
expr arms: `const neg_base = -base` (Ident inside PrefixExpr),
`const wrapped_base = (base)` (Ident inside ParenExpr), `const inv_base =
~base` (Ident inside PrefixExpr `~`), plus mut-param + multiple Ident
references in InfixExpr operands. 131 → 136 transformer-diff tests, all
green; harness runs in 984ms.
@medvednikov medvednikov force-pushed the v2-arm64-perf-may26 branch from 749669e to 13a8614 Compare May 26, 2026 20:53
@medvednikov
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13a86143cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vlib/v2/gen/arm64/arm64.v
t2 := time.now()
g.gen_post_pass()
post_ms := f64(time.since(t2)) / f64(time.millisecond)
eprintln('ARM64 gen sub: pre=${pre_ms:.1}ms funcs=${funcs_ms:.1}ms post=${post_ms:.1}ms')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate ARM64 timing logs behind an opt-in flag

This now prints ARM64 gen sub: ... on every arm64 compilation, even when no debug/profiling flag is enabled. Unconditional stderr output from the compiler backend is a behavior regression that can break scripts/tests expecting clean tool output and adds per-compile formatting overhead in normal runs. The summary line should be behind the same env-gated path as detailed timing output (or a stats/debug flag).

Useful? React with 👍 / 👎.

Comment thread vlib/v2/types/checker.v
Comment on lines +1268 to +1270
if sym_obj := import_scope.lookup_parent(symbol.name, 0) {
file_scope.insert(symbol.name, sym_obj)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve module-storage symbol filter in flat import binding

The flat-path importer inserts every looked-up symbol directly into file scope, but the legacy path explicitly skips Global symbols where is_module_storage() is true. With V2_CHECK_FLAT=1, import mod { name } can now bind module-storage globals that were intentionally excluded, changing name resolution semantics between flat and non-flat checking modes.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant