Also optimized Blake2b SSE4.1 code size to avoid code cache pollution.
+0.15% on `rx/0` +0.3% on `rx/wow`