Using your C compiler to exploit NEON™ Advanced SIMD
4
NEON Architecture
NEON is integrated into the ARM core as a coprocessor. Since NEON shares registers with the vector
floating point unit (VFP), the latter will always be present in cores with Advanced SIMD extensions.
NEON stores operands and results in a set of 16 quad-word registers (q0—q15), that can be used
alternatively as 32 double-word registers (d0—d31). Values can be moved between NEON registers
and core registers (r0—r15), as well as between memory and NEON registers. Data types supported
by NEON are vectors of 8, 16, 32, and 64 bit integer, as well as single precision (32 bit) floating point
values and polynomials.
The Advanced SIMD instruction set comprises of arithmetic/logical instructions including standard
operations (e.g. add, subtract, and multiply), shift and complex bit operations, as well as complex
operations like square root, reciprocal estimate. All of these instructions will accept modifiers
determining operand and result data types and optional rounding, saturation, doubling/halving
behavior.
For accessing memory, there are various instructions available that support simple streaming of
scalar values between consecutive memory addresses and vectors, and data structure streaming,
where corresponding elements of adjoining data structures will be regarded as vectors.
Hand-coding NEON instructions will enable an expert to write the fastest/smallest possible
implementation of a given algorithm, but this comes at a price. Even well written assembler code is
harder to comprehend than well written code in a higher language. It is often not possible to tell
whether a side effect of an instruction is part of the algorithm or not. Accordingly, maintenance and
development costs are quite high.
In modern superscalar CPU cores, the processing pipelines are normally much too complex to be
fully understood by even better-than-average developers. Comprehending the exact scheduling
implications that an instruction might cause with respect to other instructions got extremely difficult
in superscalar cores. The additional NEON complexity with feedback into the core pipeline doesn’t
exactly help. A quote from the Cortex-A8 Technical Reference Manual makes it quite clear:
The complexity of the processor makes it impossible to guarantee precise timing information with
hand calculations. The timing of an instruction is often affected by other concurrent instructions,
memory system activity, and additional events outside the instruction flow. Describing all possible
instruction interactions and all possible events taking place in the processor is beyond the scope of
this document. Only a cycle-accurate model of the processor can produce precise timings for a
particular instruction sequence.
NEON instruction examples
VADD.I16 D0,D1,D2 ; Four 16-bit integer additions
VSUBL.I64 Q8,D1,D5 ; Two 32-bit integer subtraction, 64 bit result
VMUL.I16 D1,D7,D4[2] ; Four 16 bit integer by 16-bit scalar mult.
VQADD.S16 D0,D1,D2 ; Four 16-bit integer saturating additions
VLD1.32 {D0,D1},[R1]!; load 4 32-bit elements into D0 and D1,
; and update R1
VST1.32 {D0,D1},[R0]!; store 4 32-bit elements from D0 and D1,
; and update R0