The luminance needs to be pre- and post- multiplied by
the alpha value in case of rescaling, for proper averaging.
Also:
- removed util/alpha_processing and moved it to dsp/
- removed WebPInitPremultiply() which was mostly useless
and merged it with the new function WebPInitAlphaProcessing()
Change-Id: If089cefd4ec53f6880a791c476fb1c7f7c5a8e60
VP8EncDspInitAVX2 is included in sse2 builds for now, later a configure
flag should be added to avoid the stub when avx2 is unavailable/disabled
Change-Id: I6127b687c273f46f41652aaf8e3b86ae3cfb8108
* remove some sign-bit flipping
* turn some macro into inline functions
* fix some 'const' in signatures
* clarify the int8/uint8 usage
Change-Id: Ib04459ac34cb280c57579c5d79a5efd2f8d5e99d
The predictors based on Average2 are tad slower.
Following is the performance data for these predictors normalized to
number of instruction cycles (as per valgrind) per operation:
- Predictor6 & Predictor7 now takes 15 instruction cycles compared to 11
instruction cycles for the C version.
- Predictor8 & Predictor9 now takes 15 instruction cycles compared to 12
instruction cycles for the C version.
The predictors based on Average4 is faster and Average3 is tad slower:
- Predictor10 (Average4) now takes 23 instruction cycles compared to 25
instruction cycles for the C version.
- Predictor5 (Average3) now takes 20 instruction cycles compared to 18
instruction cycles for the C version.
Maybe SSE2 version of Average2 can be improved further. Otherwise, we can
remove the SSE2 version and always fallback to the C version.
Change-Id: I388b2871919985bc28faaad37c1d4beeb20ba029
* merged the two HistogramAdd/AddEval() into a single call
(with detection of special case when b==out)
* added a SSE2 variant
* harmonize the histogram type to 'uint32_t' instead
of just 'int'. This has a lot of ripples on signatures.
* 1-2% faster
Change-Id: I10299ff300f36cdbca5a560df1ae4d4df149d306
move simple loop filter defines closer to their use and LOAD* to a
location common with the intrinsics
Change-Id: Iaec506d27bbc9a01be20936e30b68a4b0e690ee3
the complex loop filter has no inline equivalent; the simple loop filter
remains conditional on USE_INTRINSICS: it's left undefined for now.
Change-Id: I4f258e10458df53a7a1819707c8f46b450e9d9d2
CollectHistogram / SSE* / QuantizeBlock have no inline equivalents,
enable them where possible and use USE_INTRINSICS to control borderline
cases: it's left undefined for now.
Change-Id: I62235bc4ddb8aa0769d1ce18a90e0d7da1e18155
using this in Load4x16 was slightly slower and didn't help mitigate any
of the remaining build issues with 4.6.x.
Change-Id: Idabfe1b528842a514d14a85f4cefeb90abe08e51
HuffmanCost and HuffmanCostCombined optimized and added
'const' to some variables from ExtraCost functions.
Change-Id: I28b2b357a06766bee78bdab294b5fc8c05ac120d
Some versions of compiler in debug build can't find
a register in class 'GR_REGS' while reloading 'asm'
Number of used registers is decreased in this fix.
Change-Id: I7d7b8172b8f37f1de4db3d8534a346d7a72c5065
This is to help further optimizations.
(like in https://gerrit.chromium.org/gerrit/#/c/69787/)
There's a small slowdown (~0.5% at -z 9 quality) due to
function pointer usage. Note that, for speed, it's important
to return VP8LStreaks by value, and not pass a pointer.
Change-Id: Id4167366765fb7fc5dff89c1fd75dee456737000
avoids:
src/dsp/enc_mips32.c: In function 'ITransformOne':
src/dsp/enc_mips32.c:123:3: can't find a register in class 'GR_REGS' while reloading 'asm'
src/dsp/enc_mips32.c:123:3: 'asm' operand has impossible constraints
Change-Id: Ic469667ee572f25e502c9873c913643cf7bbe89d
apparently faster, but we might save some load/store to/from memory
once we settle for the intrinsics-based FTransform()
(also: fixed some #ifdef USE_INTRINSICS problems)
Change-Id: I426dea299cea0c64eb21c4d81a04a960e0c263c7