Skip to content

Add optimized AVX2 kernels for csetv and zsetv on AMD Zen architectures#922

Open
harsdave wants to merge 1 commit intoflame:masterfrom
harsdave:amd-optimized-csetv_and_zsetv_kernels
Open

Add optimized AVX2 kernels for csetv and zsetv on AMD Zen architectures#922
harsdave wants to merge 1 commit intoflame:masterfrom
harsdave:amd-optimized-csetv_and_zsetv_kernels

Conversation

@harsdave
Copy link
Copy Markdown
Contributor

Description:
This commit introduces optimized intrinsic-based implementations for bli_csetv_zen_int and bli_zsetv_zen_int. These kernels are designed to leverage AVX2 SIMD instructions for single-precision and double-precision complex vector initialization on AMD Zen/2/3 microarchitectures.

Key highlights:
Vectorization: Utilizes _mm256_storeu_ps/pd to perform 256-bit wide stores, significantly improving throughput for unit-stride cases (incx == 1).

Loop Unrolling: Implements a multi-tiered unrolling strategy (64, 32, 16, 8, 4 for single-precision; 32, 16, 8, 4, 2 for double-precision) to maximize pipeline utilization and reduce loop overhead.

Conjugation Support: Correctly handles conjalpha by pre-processing the imaginary component of the alpha value before broadcasting.

Fringe Handling: Uses bit-masking logic (n & ~0x3F, etc.) to efficiently process remainders in decreasing powers of two.

Performance Safety: Includes _mm256_zeroupper() in the zsetv kernel to mitigate AVX-SSE transition penalties.

Stride Support: Maintains a scalar fallback for non-unit stride cases to ensure functional correctness across all inputs.

Description:
This commit introduces optimized intrinsic-based implementations for
bli_csetv_zen_int and bli_zsetv_zen_int. These kernels are designed to
leverage AVX2 SIMD instructions for single-precision and double-precision
complex vector initialization on AMD Zen/2/3 microarchitectures.

Key highlights:
Vectorization: Utilizes _mm256_storeu_ps/pd to perform 256-bit wide stores,
significantly improving throughput for unit-stride cases (incx == 1).

Loop Unrolling: Implements a multi-tiered unrolling strategy
(64, 32, 16, 8, 4 for single-precision; 32, 16, 8, 4, 2 for double-precision)
to maximize pipeline utilization and reduce loop overhead.

Conjugation Support: Correctly handles conjalpha by pre-processing the imaginary
component of the alpha value before broadcasting.

Fringe Handling: Uses bit-masking logic (n & ~0x3F, etc.) to efficiently process
remainders in decreasing powers of two.

Performance Safety: Includes _mm256_zeroupper() in the zsetv kernel to mitigate
AVX-SSE transition penalties.

Stride Support: Maintains a scalar fallback for non-unit stride cases to ensure
functional correctness across all inputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant