Support for saturating doubling multiply add #2050

Ryo-not-rio · 2024-04-03T13:37:20Z

There is a saturating doubling multiply add instructions in NEON and SVE - vqdmla, svqdmla which are equivalent to hn::SaturatedAdd(hn::Mul(2, hn::Mul(a_real, b_imag)),hn::Mul(2, hn::Mul(a_imag, b_real)));. In testing, we have seen a significant performance decrease using highway vs NEON and highway vs SVE. I was wondering if it would be possible to add a similar instruction to highway so we are able to leverage these qdmla instructions.

The text was updated successfully, but these errors were encountered:

jan-wassenberg · 2024-04-03T14:10:39Z

It sounds reasonable to want access to this instruction. I'm curious about the purpose of the extra 2x mul in their definition?

Because NEON differs from SVE in that it takes the upper/lower half, vs even/odd lanes, I think we'd want to define the op like ReorderWidenMulAccumulate, which is allowed to return any lane order. This is fine for dot products but I wonder if it's still OK for your use case, or whether we have some other idea how to bridge the gap between NEON and SVE?

Ryo-not-rio · 2024-04-03T15:17:32Z

The use case we've tested out is a vector multiplication so the order needs to be preserved. You raise a good point on the difference between NEON and SVE, we could potentially return two vectors - the lower and upper half of the result - this sounds like a good compromise since I'd imagine you'd need the whole vector for most use cases anyway

jan-wassenberg · 2024-04-03T15:43:38Z

Makes sense. ReorderWidenMulAccumulate also returns a second value using an output param, so there is precedent.
So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2?

I'm still curious why the 2x factor in the instruction, is it also for Q1.x fixed-point? (similar to MulFixedPoint15)

Ryo-not-rio · 2024-04-03T15:50:36Z

So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2?

yes that is correct, I'm still trying to find out why there is a 2x factor myself, I'll give an update once I have an answer

Ryo-not-rio · 2024-04-03T16:14:45Z

So it looks like the doubling is just a side effect of the way the multiplication is done. The multiplication is done in Q0.15 format which is saturated to Q0.31 format and this is equivalent to doubling saturating multiply-add of two 16 bit integers

jan-wassenberg · 2024-04-03T19:03:34Z

Got it, thanks. In that case adding FixedPoint to the op name may be helpful.

johnplatts · 2024-04-03T20:02:51Z

AVX3_DL is also capable of carrying out the saturated doubling multiply add using the _mm*_dpwssds_epi32 intrinsics.

Here is how the vqdmla op can be implemented on AVX3_DL for I32 vectors that have 1 to 4 lanes:

template <class VI32, HWY_IF_V_SIZE_LE_V(VI32, 16)>
HWY_API VI32 Vqdmla(VI32 a, Vec<Rebind<int16_t, DFromV<VI32>>> b, Vec<Rebind<int16_t, DFromV<VI32>>> c) {
  const Rebind<int16_t, DFromV<VI32>> di16;
  const Twice<decltype(di16)> dt_i16;

  const auto dup_b = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, b), ResizeBitCast(dt_i16, b));
  const auto dup_c = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, c), ResizeBitCast(dt_i16, c));

  return VI32{_mm_dpwssds_epi32(a.raw, dup_b.raw, dup_c.raw)};
}

The vqdmla op can be carried out on PPC8/PPC9/PPC10 using vec_msums. Here is how the vqdmla op can be implemented on PPC8/PPC9/PPC10:

template <class VI32, HWY_IF_V_SIZE_LE_V(VI32, 16)>
HWY_API VI32 Vqdmla(VI32 a, Vec<Rebind<int16_t, DFromV<VI32>>> b, Vec<Rebind<int16_t, DFromV<VI32>>> c) {
  const Rebind<int16_t, DFromV<VI32>> di16;
  const Twice<decltype(di16)> dt_i16;

  const auto dup_b = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, b), ResizeBitCast(dt_i16, b));
  const auto dup_c = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, c), ResizeBitCast(dt_i16, c));

  return VI32{vec_msums(dup_b.raw, dup_c.raw, a.raw)};
}

jan-wassenberg · 2024-04-04T08:35:34Z

Very nice!
@Ryo-not-rio FYI John's solution defines the op in terms of NEON/x86, so for SVE we have two extra Zip. Does that work for you?

Ryo-not-rio · 2024-04-05T08:49:01Z

@jan-wassenberg yes, that is perfect! I've added a comment on the PR requesting support for 8 bit and 16 bit elements as well

johnplatts mentioned this issue Apr 4, 2024

Added SatWidenMulPairwiseAccumulate and SatWidenMulAccumFixedPoint ops #2055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for saturating doubling multiply add #2050

Support for saturating doubling multiply add #2050

Ryo-not-rio commented Apr 3, 2024

jan-wassenberg commented Apr 3, 2024

Ryo-not-rio commented Apr 3, 2024

jan-wassenberg commented Apr 3, 2024

Ryo-not-rio commented Apr 3, 2024 •

edited

Loading

Ryo-not-rio commented Apr 3, 2024

jan-wassenberg commented Apr 3, 2024

johnplatts commented Apr 3, 2024

jan-wassenberg commented Apr 4, 2024

Ryo-not-rio commented Apr 5, 2024

Support for saturating doubling multiply add #2050

Support for saturating doubling multiply add #2050

Comments

Ryo-not-rio commented Apr 3, 2024

jan-wassenberg commented Apr 3, 2024

Ryo-not-rio commented Apr 3, 2024

jan-wassenberg commented Apr 3, 2024

Ryo-not-rio commented Apr 3, 2024 • edited Loading

Ryo-not-rio commented Apr 3, 2024

jan-wassenberg commented Apr 3, 2024

johnplatts commented Apr 3, 2024

jan-wassenberg commented Apr 4, 2024

Ryo-not-rio commented Apr 5, 2024

Ryo-not-rio commented Apr 3, 2024 •

edited

Loading