Commit Graph

16 Commits

Author SHA1 Message Date
Pascal Massimino
fc6c75a2a2 SSE2: 53% faster TransformColor[Inverse]
Changed the code (again) to process 4 pixels at a time. Loop is more
involved, but overall it's faster.

Removed the SSE4.1 implementation which is now slower than SSE2.

Change-Id: I7734e371033ad8929ace7f7e1373ba930d9bb5f1
2015-06-23 14:52:01 -07:00
Pascal Massimino
49073da6d6 SSE2: 46% speed-up of TransformColor[Inverse]
Change-Id: If3bf26dc8ed32a7c03cb438e5d5fc996e2e96b5e
2015-06-23 20:09:04 +02:00
James Zern
b44eda3f60 dsp: add DSP_INIT_STUB
generates a stub function when the specific architecture is not enabled,
exposing a symbol in the module, avoiding a compiler warning

Change-Id: Ia9336e57466a9b5241b85c1c95838e91c9283147
2015-04-02 23:55:35 -07:00
James Zern
553051f741 dsp/lossless: split enc/dec functions
adds lossless_enc*.c; reduces the size of the decode-only so: ~78K
w/gcc-4.8.2 on x86_64.

Change-Id: If5e4610b67d05eba5896bc64bab79e9df92b2092
2015-03-23 22:57:50 -07:00
James Zern
1d93ddec19 dsp/lossless*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: If8b4459556e6bfaa36ef046f66520558b9444fc2
2015-03-20 19:19:46 -07:00
James Zern
b969f5dfac dsp: normalize WEBP_TSAN_IGNORE_FUNCTION usage
the attribute is only necessary in one location; remove it from the
prototypes.

Change-Id: I3820a3c34fbb029fd7ac69a1b0a9b76091bdbde2
2015-02-13 15:23:40 -08:00
James Zern
1829c42c58 cosmetics: lossless_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: I2405b18c6bb829b76c3a9814057ccbe6e14220d9
2015-02-05 23:51:44 -08:00
James Zern
a4c3a31b8f WEBP_TSAN_IGNORE_FUNCTION: fix gcc compat warning
move the attribute to the front of the function to quiet clang warning:
GCC does not allow no_sanitize_thread attribute in this position on a
function definition

Change-Id: Ie4cc6e35a07bd00eab67d9cd6801bd2be9cfe676
2014-10-16 18:06:43 +02:00
Pascal Massimino
80247291c6 mark some init function as being safe for thread_sanitizer.
introduces the macro WEBP_TSAN_IGNORE_FUNCTION

Change-Id: I3de2b6c1a2076fba4da7ae50322551e026b2082b
2014-10-16 16:34:07 +02:00
Vikas Arora
6d6865f0db Added SSE2 variants for Average2/3/4
The predictors based on Average2 are tad slower.

Following is the performance data for these predictors normalized to
number of instruction cycles (as per valgrind) per operation:
- Predictor6 & Predictor7 now takes 15 instruction cycles compared to 11
  instruction cycles for the C version.
- Predictor8 & Predictor9 now takes 15 instruction cycles compared to 12
  instruction cycles for the C version.

The predictors based on Average4 is faster and Average3 is tad slower:
- Predictor10 (Average4) now takes 23 instruction cycles compared to 25
  instruction cycles for the C version.
- Predictor5 (Average3) now takes 20 instruction cycles compared to 18
  instruction cycles for the C version.

Maybe SSE2 version of Average2 can be improved further. Otherwise, we can
remove the SSE2 version and always fallback to the C version.

Change-Id: I388b2871919985bc28faaad37c1d4beeb20ba029
2014-04-28 14:47:30 -07:00
Pascal Massimino
b3a616b356 make HistogramAdd() a pointer in dsp
* merged the two HistogramAdd/AddEval() into a single call
  (with detection of special case when b==out)
* added a SSE2 variant
* harmonize the histogram type to 'uint32_t' instead
  of just 'int'. This has a lot of ripples on signatures.
* 1-2% faster

Change-Id: I10299ff300f36cdbca5a560df1ae4d4df149d306
2014-04-28 10:09:34 -07:00
Urvang Joshi
c90a902eff Add SSE2 version of forward cross-color transform
Encoding speed is roughly the same.

Change-Id: I6b976d0eb24e1847714e719762cb8403768da66c
2014-04-02 12:21:20 -07:00
Urvang Joshi
d4813f0cb2 Add SSE2 function for Inverse Cross-color Transform
Lossless decoding is now ~3% faster.

Change-Id: Idafb5c73e5cfb272cc3661d841f79971f9da0743
2014-04-01 15:52:25 -07:00
Urvang Joshi
4fd7c82e6a SSE2 variants of Subtract-Green: Rectify loop condition
When 4 pixels are left, they should be processed with SSE2.

Decoding is marginally faster (~0.4%).
Encoding speed: No observable difference.

Change-Id: I3cf21c07145a560ff795451e65e64faf148d5c3e
2014-03-31 10:51:45 -07:00
James Zern
51f406a5d7 lossless_sse2: relocate VP8LDspInitSSE2 proto
this is in line with the other dsp files and silences a build warning.

Change-Id: I03ee3032c11d4c731cc10bfa0a2dcb6866756ba2
2014-03-27 15:07:43 -07:00
skal
0f4f721b12 separate SSE2 lossless functions into its own file
expose the predictor array as function pointers instead
of each individual sub-function

+ merged Average2() into ClampedAddSubtractHalf directly
+ unified the signature as "VP8LProcessBlueAndRedFunc"

no speed diff observed

Change-Id: Ic3c45dff11884a8330a9ad38c2c8e82491c6e044
2014-03-27 21:43:55 +01:00