Instead of the basic square-and-multiply strategy which requires a
square and a multiply per exponent bit (i.e. two modular
multiplications per bit in total), we instead reduce to a square per
exponent bit and an extra multiply only every 5 bits, because the
value we're multiplying in is derived from 5 of the exponent bits at
once via a table lookup.
To avoid the obvious side-channel leakage of a literal table lookup,
we read the whole table every time, mp_selecting the right value into
the multiplication input. This isn't as slow as it sounds when the
alternative is four entire modular multiplications! In my testing,
this commit speeds up large modpows by a factor of just over 1.5, and
it still gets a clean pass from 'testsc'.