Would you happen to remember what the optimization was, mathematically?
https://stackoverflow.com/questions/20036698/subdivide-a-modulo-function-16-bit-but-can-only-do-8-bits-at-a-time#20036828 seems to say that it's "impossible afaik", and I can't seem to optimize it myself (though this kind of math isn't my forte)
Ahh, that makes sense. Powers of two are real convenient. Your math is a little wrong though: X != (X & 0xFF) + (X >> 8), but X = (X & 0xFF) + (X >> 8) << 8 The right half can be removed entirely if you're doing modulo 16, since the first 4 bits will always be 0. So it simply becomes
X & 15
! Much cleaner for sure.