even with such a small amount of code things may get incredibly complicated. I’m not saying we all should be experts in the hardware we are coding for, but at least be informed about such issues. Don’t take the first measured value as a final one. Collect profiles and check that you didn’t hit some architectural performance hit.
https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues