Add optimized AVX2 kernels for csetv and zsetv on AMD Zen architectures by harsdave · Pull Request #922 · flame/blis

harsdave · 2026-03-18T14:52:35Z

Description:
This commit introduces optimized intrinsic-based implementations for bli_csetv_zen_int and bli_zsetv_zen_int. These kernels are designed to leverage AVX2 SIMD instructions for single-precision and double-precision complex vector initialization on AMD Zen/2/3 microarchitectures.

Key highlights:
Vectorization: Utilizes _mm256_storeu_ps/pd to perform 256-bit wide stores, significantly improving throughput for unit-stride cases (incx == 1).

Loop Unrolling: Implements a multi-tiered unrolling strategy (64, 32, 16, 8, 4 for single-precision; 32, 16, 8, 4, 2 for double-precision) to maximize pipeline utilization and reduce loop overhead.

Conjugation Support: Correctly handles conjalpha by pre-processing the imaginary component of the alpha value before broadcasting.

Fringe Handling: Uses bit-masking logic (n & ~0x3F, etc.) to efficiently process remainders in decreasing powers of two.

Performance Safety: Includes _mm256_zeroupper() in the zsetv kernel to mitigate AVX-SSE transition penalties.

Stride Support: Maintains a scalar fallback for non-unit stride cases to ensure functional correctness across all inputs.

Description: This commit introduces optimized intrinsic-based implementations for bli_csetv_zen_int and bli_zsetv_zen_int. These kernels are designed to leverage AVX2 SIMD instructions for single-precision and double-precision complex vector initialization on AMD Zen/2/3 microarchitectures. Key highlights: Vectorization: Utilizes _mm256_storeu_ps/pd to perform 256-bit wide stores, significantly improving throughput for unit-stride cases (incx == 1). Loop Unrolling: Implements a multi-tiered unrolling strategy (64, 32, 16, 8, 4 for single-precision; 32, 16, 8, 4, 2 for double-precision) to maximize pipeline utilization and reduce loop overhead. Conjugation Support: Correctly handles conjalpha by pre-processing the imaginary component of the alpha value before broadcasting. Fringe Handling: Uses bit-masking logic (n & ~0x3F, etc.) to efficiently process remainders in decreasing powers of two. Performance Safety: Includes _mm256_zeroupper() in the zsetv kernel to mitigate AVX-SSE transition penalties. Stride Support: Maintains a scalar fallback for non-unit stride cases to ensure functional correctness across all inputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized AVX2 kernels for csetv and zsetv on AMD Zen architectures#922

Add optimized AVX2 kernels for csetv and zsetv on AMD Zen architectures#922
harsdave wants to merge 1 commit intoflame:masterfrom
harsdave:amd-optimized-csetv_and_zsetv_kernels

harsdave commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

harsdave commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant