Unorthodox
There’re instructions for min/max just like in SSE, in NEON they are vmin[q]_f32 and vmax[q]_f32.
There’s an instruction for absolute difference, vabd[q]_f32.
There’re instructions to compute approximations of
1
𝑥
and
1
√
𝑥
, they are vrecpe[q]_f32 and vrsqrte[q]_f32
respectively.
Comparisons
There’re all-lanes versions of equality comparison and the rest of them, <, >, ≤, ≥. The only missing one is
inequality, ≠. Moreover, there’re instruction to compare absolute values in two vectors. These intrinsics
return uint32x4_t vector with lanes either 0 or all ones, which is UINT_MAX.
There’s no equivalent of _mm_movemask_ps or _mm_testz_ps. If you need results in a scalar register to
be able to write if( something ) in C++, you’ll have to implement manually with integer NEON
intrinsics.
Horizontal Operations
There’s vpadd_f32 intrinsic that takes two vectors [ a, b ] and [ c, d ], and returns [ a+b, c+d ]. No 16-byte
version is available, only the 8-byte version. A useful application is final reduction step of dot product.
Same with horizontal pairwise minimum and maximum, the instructions are vpmin_f32 and vpmax_f32,
the documentation calls them “folding minimum / maximum”.
Shuffles
Unlike SSE, there’s no generic shuffles. Whenever you need it, you’ll have to do something else instead.
Here’s what NEON can do for you.
• Split then combine back. This way you can shuffle [a, b, c, d] into e.g. [c, d, c, d] or [c, d, a, b].
• Concatenate two vectors and cut a piece of the merged sequence, vextq_f32 does that. When the 2
input vectors are [a, b, c, d] and [e, f, g, h], it can return [b, c, d, e], [c, d, e, f] or [d, e, f, g].
• Broadcast a single lane into all lanes of the output with vdupq_lane_f32. But note some arithmetic
intrinsics like vmul_laneq_f32 use scalar operand from arbitrary lane of a vector, you don’t need
broadcast for that.
• Interleave elements. The vzipq_f32 takes 2 vectors [a, b, c, d] and [e, f, g, h], returns other 2 vectors
with [a, e, b, f] and [ c, g, d, h ].
• De-interleave elements. vuzpq_f32 takes 2 vectors [a, b, c, d] and [e, f, g, h], returns other 2 vectors
with [a, c, e, g] and [b, d, f, h]
• Swap pairs of elements: vrev64q_f32 takes [a, b, c, d] and returns [b, a, d, c].
• Transpose elements.
This instruction views first vector as the top rows of a sequence of 2x2 matrices, and second vector
as bottom rows of corresponding matrices. It transposes all matrices stored this way, and returns 2
new vectors with transposed 2x2 matrices.
Specifically for floats, vtrnq_f32 takes 2 vectors [a, b, c, d] and [e, f, g, h], returns other 2 vectors
with [a, e, c, g] and [b, f, d, h]
• You can move individual lanes with vsetq_lane_f32 and similar.