In some compilers (I'm told clang 10, in particular), the NEON
intrinsic vaddq_p128 is missing, even though its input type poly128_t
is provided.
vaddq_p128 is just an XOR of two vector registers, so that's easy to
work around by casting to a more mundane type and back. Added a
configure-time test for that intrinsic, and a workaround to be used in
its absence.