Vincent Rabaud bf2b4f114f Regroup common SSE code + optimization.
The transpose refactoring will help removing a transpose in a
later CL.

The horizontal add function helps removing a _mm_sad_epu8 in DC8uv
=> the latency/throughput went from 29/25 to 23/19

Change-Id: I5f3dfd4aad614eb079b1e83631e6a7cef49a3766
2016-02-16 18:34:34 +01:00
..
2015-12-17 19:45:14 -08:00
2015-12-17 19:45:14 -08:00
2016-01-07 18:23:45 +01:00
2015-12-17 22:52:10 +00:00
2015-12-17 19:45:14 -08:00