CombinedShannonEntropy takes 30% for lossless compression.
This implementation speeds up the overall process by 2 to 3 %.
Change-Id: I04a71743284c38814fd0726034d51a02b1b6ba8f
Changed the code (again) to process 4 pixels at a time. Loop is more
involved, but overall it's faster.
Removed the SSE4.1 implementation which is now slower than SSE2.
Change-Id: I7734e371033ad8929ace7f7e1373ba930d9bb5f1
generates a stub function when the specific architecture is not enabled,
exposing a symbol in the module, avoiding a compiler warning
Change-Id: Ia9336e57466a9b5241b85c1c95838e91c9283147