we compare the current VP8GetCPUInfo pointer to the last used.
This is less code overall and each implementation is still
testable separately (by just changing VP8GetCPUInfo, but not
a separate threads!)
Change-Id: Ia13fa8ffc4561a884508f6ab71ed0d1b9f1ce59b
inline function MakeARGB32 calls changed to call
via pointers to functions which make (a)rgb for
entire row
Change-Id: Ia4bd4be171a46c1e1821e408b073ff5791c587a9
and apply Paeth predictor (predictor#11) for the low effort (m=0) mode.
For 1000 image PNG corpus (m=0), this change yields speedup of 25% at lower quality
range and about 10% for higher quality range.
Change-Id: I0f036b8ffe45c241e63a067cbf01527b13d8de93
check for __apple_build_version__ to distinguish the two; a version
check could work as Apple bumped Xcode's to 5.x/6.x, but it's unclear
how upstream will deal with their versioning as they go 3.6+, so avoid
it for now.
Change-Id: I67cda67c4f68e262a92d805a63cc1496374be063
we don't need to store the whole distribution in order to compute the alpha
Later, we can incorporate the max_value / last_non_zero bookkeeping
in SSE2 directly.
Change-Id: I748ccea4ac17965d7afcab91845ef01be3aa3e15
this is a first step to unifying encoding/decoding cache stride
and possibly sharing the prediction functions in dsp/
With this layout, there's a little (~7%) space lost with unused samples.
But no speed change was observed.
Change-Id: I016df8cad41bde5088df3579e6ad65d884ee711e
~68% faster
reuses TM4() adding support for the additional rows, the columns were
already being done.
Change-Id: I6eac17e58cd1c636082bf7281f70f884ec399a6b
Move all the Entropy evaluation methods to lossless.c (from histogram.c).
There's slight difference in the way entropy is computed for evaluating
entropy in prediction methods and histogram (literal) for huffman trees.
Plan (later) to merge few (static) methods and reduce the code size.
This change has no impact on the compression speed/density.
Change-Id: Ife3d96a3c4a8d78a91723d9e0a8d1b78c0256a15
set WEBP_EXTERN to visibility=default
+ explicitly mark VP8GetCPUInfo as it's referenced within the examples
Change-Id: Ie3d2b15088e888f0b55203b205993eba75899d99
move the attribute to the front of the function to quiet clang warning:
GCC does not allow no_sanitize_thread attribute in this position on a
function definition
Change-Id: Ie4cc6e35a07bd00eab67d9cd6801bd2be9cfe676
avoids an ICE with NDK r10b + NDK_TOOLCHAIN_VERSION=4.9
In function 'SSE16x16':
enc_mips32.c (684) internal compiler error: Segmentation fault
Change-Id: I1a3d33c0a9534c97633ab93bcdf9bf59d3a7e473
got rid of the |a-b|^|b-a| method and went back
to just (a-b)^2 instead.
quality | size(bytes) after/before | time (ms) after/before
Change-Id: Ia3e0e6507b3f903deb1e182f78dad6df07380fd0
SSE2 version is 2.1x faster
This is used to transfer the alpha plane to green channel before lossless compression.
Change-Id: I01d9df0051c183b1ff5d6eb69961d4f43e33141a
* We were re-doing most of the work in plain-C as 'left-over'.
* we were always returning has_alpha = true because of a bad mask all_0xff
These bugs were conservative and silent, in the sense that we were 'just' doing
more work than necessary.
Now, the SSE2 version is really 2x faster than the C version.
Change-Id: I6c8132a267fe3c7a3d1fa70e7a5fcd10719543fa
Split initialization of YUV444Converters[] out of Upsamplers init.
update test for NULL function pointers
Change-Id: I9603f54250f90c85a12ffbecfd6c59e9b06c47e0
vtbl4_u8 is available everywhere except iOS arm64: use vtbl2q_u8 there
with a corresponding change in the load.
Change-Id: Ib84212dda3c7875348282726c29e3b79b78b0eac
_M_IX86 will be defined in mingw builds after including windows.h. as
the gcc inline asm is first, this missing check would only have caused
an error if the code was reorganized.
Change-Id: I395679bcfc43e94d308d1ceb0c0fbf932b2c378c
this defines WORDS_BIGENDIAN, replacing uses of
__BIG_ENDIAN__/__BYTE_ORDER__ with it
+ fixes lossless BGRA output with big-endian toolchains
that do not define __BIG_ENDIAN__ (codesourcery mips gcc)
Change-Id: Ieaccd623292d235343b5e34b7a720fc251c432d7
this will remove a warning about the shift amount not being
an immediate (=constant).
Change-Id: Ie9a00fefdb9a07ec8994fb113f24234518bc878a
Also: fix the NULL sharpen argument mismatch.
Another store to load forward block was detected coming from the function
FTransform.
FTransform save the output data 4 times 8 bytes each. when this data is
later being loaded by the QuantizeBlock function in one chunk of 16 bytes
that caused a store to load forward block.
The fix was done in the FTransform function where each two consecutive 8 bytes
were merged into one 16 bytes register and saved into the memory.
This fix gives ~21% function level gain and 1.6% user level gain.
Change-Id: Idc27c307d5083f3ebe206d3ca19059e5bd465992
only 1 of <lib>_CPPFLAGS and AM_CPPFLAGS is used, with the former
getting precedence when it's defined. configure's DEFAULT_INCLUDES is
covering what's necessary given the include paths are all source
relative.
Change-Id: I7d14076acd266b28a88a3d92bcc3d7165284d5f3
this change has the side-effect of using directory names in the
include, silencing a lint warning.
Change-Id: Ib91cf63a90534e32fadfa5c2372bfdb29f854d02
if res->first = 1, coeffs[0]=0 because of quant.c:749 and line
added at quant.c:744
So, no need for the extra case.
Going forward, TrellisQuantizeBlock() should also be calling
a variant of VP8SetResidualCoeffs() to set the 'last' field.
also: fixes a warning for win64
+ slight speed-up
Change-Id: Ib24b611f7396d24aeb5b56dc74d5c39160f048f0
+ add a WEBP_HAVE_SSE2 to dsp.h
not all 32-bit toolchain configurations will have sse2 enabled by
default
Change-Id: I7c675e511581f93cf55c79f960fa7efa2df4987e
this is used to set WEBP_USE_AVX2 in files where the build flag won't be
used, i.e., dsp/enc.c, which enables VP8EncDspInitAVX2() to be called
Change-Id: I362f4ba39ca40d3e07a081292d5f743c649d9d7f
Extract loop invariant and avoid storing/loading samples
if they can be re-used. This is particularly interesting when
a transpose is involved (HFilter16i).
Change-Id: I93274620f6da220a35025ff8708ff0c9ee8c4139
The luminance needs to be pre- and post- multiplied by
the alpha value in case of rescaling, for proper averaging.
Also:
- removed util/alpha_processing and moved it to dsp/
- removed WebPInitPremultiply() which was mostly useless
and merged it with the new function WebPInitAlphaProcessing()
Change-Id: If089cefd4ec53f6880a791c476fb1c7f7c5a8e60
VP8EncDspInitAVX2 is included in sse2 builds for now, later a configure
flag should be added to avoid the stub when avx2 is unavailable/disabled
Change-Id: I6127b687c273f46f41652aaf8e3b86ae3cfb8108
* remove some sign-bit flipping
* turn some macro into inline functions
* fix some 'const' in signatures
* clarify the int8/uint8 usage
Change-Id: Ib04459ac34cb280c57579c5d79a5efd2f8d5e99d
The predictors based on Average2 are tad slower.
Following is the performance data for these predictors normalized to
number of instruction cycles (as per valgrind) per operation:
- Predictor6 & Predictor7 now takes 15 instruction cycles compared to 11
instruction cycles for the C version.
- Predictor8 & Predictor9 now takes 15 instruction cycles compared to 12
instruction cycles for the C version.
The predictors based on Average4 is faster and Average3 is tad slower:
- Predictor10 (Average4) now takes 23 instruction cycles compared to 25
instruction cycles for the C version.
- Predictor5 (Average3) now takes 20 instruction cycles compared to 18
instruction cycles for the C version.
Maybe SSE2 version of Average2 can be improved further. Otherwise, we can
remove the SSE2 version and always fallback to the C version.
Change-Id: I388b2871919985bc28faaad37c1d4beeb20ba029
* merged the two HistogramAdd/AddEval() into a single call
(with detection of special case when b==out)
* added a SSE2 variant
* harmonize the histogram type to 'uint32_t' instead
of just 'int'. This has a lot of ripples on signatures.
* 1-2% faster
Change-Id: I10299ff300f36cdbca5a560df1ae4d4df149d306
move simple loop filter defines closer to their use and LOAD* to a
location common with the intrinsics
Change-Id: Iaec506d27bbc9a01be20936e30b68a4b0e690ee3
the complex loop filter has no inline equivalent; the simple loop filter
remains conditional on USE_INTRINSICS: it's left undefined for now.
Change-Id: I4f258e10458df53a7a1819707c8f46b450e9d9d2
CollectHistogram / SSE* / QuantizeBlock have no inline equivalents,
enable them where possible and use USE_INTRINSICS to control borderline
cases: it's left undefined for now.
Change-Id: I62235bc4ddb8aa0769d1ce18a90e0d7da1e18155
using this in Load4x16 was slightly slower and didn't help mitigate any
of the remaining build issues with 4.6.x.
Change-Id: Idabfe1b528842a514d14a85f4cefeb90abe08e51
HuffmanCost and HuffmanCostCombined optimized and added
'const' to some variables from ExtraCost functions.
Change-Id: I28b2b357a06766bee78bdab294b5fc8c05ac120d
Some versions of compiler in debug build can't find
a register in class 'GR_REGS' while reloading 'asm'
Number of used registers is decreased in this fix.
Change-Id: I7d7b8172b8f37f1de4db3d8534a346d7a72c5065
This is to help further optimizations.
(like in https://gerrit.chromium.org/gerrit/#/c/69787/)
There's a small slowdown (~0.5% at -z 9 quality) due to
function pointer usage. Note that, for speed, it's important
to return VP8LStreaks by value, and not pass a pointer.
Change-Id: Id4167366765fb7fc5dff89c1fd75dee456737000
avoids:
src/dsp/enc_mips32.c: In function 'ITransformOne':
src/dsp/enc_mips32.c:123:3: can't find a register in class 'GR_REGS' while reloading 'asm'
src/dsp/enc_mips32.c:123:3: 'asm' operand has impossible constraints
Change-Id: Ic469667ee572f25e502c9873c913643cf7bbe89d
apparently faster, but we might save some load/store to/from memory
once we settle for the intrinsics-based FTransform()
(also: fixed some #ifdef USE_INTRINSICS problems)
Change-Id: I426dea299cea0c64eb21c4d81a04a960e0c263c7
Functions VP8LFastLog2Slow and VP8LFastSLog2Slow
also: replaced some "% y" by "& (y-1)" in the C-version
(since y is a power-of-two)
Change-Id: I875170384e3c333812ca42d6ce7278aecabd60f0
Verified OK, but right now they don't seem faster.
So they are disabled behind a USE_INTRINSICS flag (off for now)
Change-Id: I72a1c4fa3798f98c1e034f7ca781914c36d3392c
+ reorganize the cost-evaluation code by moving some functions
to cost.h/cost.c and exposing VP8Residual
Change-Id: Id976299b5d4484e65da8bed31b3d2eb9cb4c1f7d
slightly faster than the inline asm
in practice not much faster than the C-code in a full NEON build, but
still better overall in an Android-like one that only enables NEON for
certain files.
Change-Id: I69534016186064fd92476d5eabc0f53462d53146
* inverse transform is actually slower with intrinsics + gcc-4.6,
so is left disabled for now.
With gcc-4.8, it's a bit faster than inlined assembly.
* Sum of Square error function provide a 2-3% speed up
There's enabled by default (since there's no inlined-asm equivalent)
Change-Id: I361b3f0497bc935da4cf5b35e330e379e71f498a
+ misc cosmetics
* seems 4% slower than inlined-asm with gcc-4.6
* is a tad faster (<1%) with gcc-4.8
(disabled for now)
Change-Id: Iea6cd00053a2e9c1b1ccfdad1378be26584f1095
The nice trick is to pack 8 u + 8 v samples into a single uint8x16x_t
register, and re-use the previous (luma) functions
Change-Id: Idf50ed2d6b7137ea080d603062bc9e0c66d79f38
+ added some work-around gcc-4.6 to make it compile (except one function).
+ lots of revamping
All variants tested ok.
Speed-up is ~5-7%
Change-Id: I5ceda2ee5debfada090907fe3696889eb66269c3
vertical only currently, 2.5-3% faster
placed under USE_INTRINSICS as this change depends on the simple
loopfilter
improves the simple loopfilter slightly thanks to some reorganization
Change-Id: I6611441fa54228549b21ea74c013cb78d53c7155
When 4 pixels are left, they should be processed with SSE2.
Decoding is marginally faster (~0.4%).
Encoding speed: No observable difference.
Change-Id: I3cf21c07145a560ff795451e65e64faf148d5c3e
new file: lossless_neon.c
speedup is ~5%
gcc 4.6.3 seems to be doing some sub-optimal things here,
storing register on stack using 'vstmia' and such.
Looks similar to gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
I've tried adding -fno-split-wide-types and it does help
the generated assembly. But the overall speed gets worse with
this flag. We should only compile lossless_neon.c with it -> urk.
Change-Id: I2ccc0929f5ef9dfb0105960e65c0b79b5f18d3b0
It's disable for now, because it crashes gcc-4.6.3 during compilation
with -O2 or -O3. It's been tested OK with -O1.
Code is still globally disabled with USE_INTRINSICS, though.
Change-Id: I3ca6cf83f3b9545ad8909556f700758b3cefa61c
disabled for now (but tested OK), thanks to the USE_INTRINSICS #define
We'll activate the code when we're on par with non-intrinsics
Change-Id: Idbfb9cb01f4c7c9f5131b270f8c11b70d0d485ff
expose the predictor array as function pointers instead
of each individual sub-function
+ merged Average2() into ClampedAddSubtractHalf directly
+ unified the signature as "VP8LProcessBlueAndRedFunc"
no speed diff observed
Change-Id: Ic3c45dff11884a8330a9ad38c2c8e82491c6e044
Get back some of the compression gains by extending the search space for
GetBestGreenRedToBlue. Also removed the SkipRepeatedPixels call, as it was not
helping much in yielding better compression density.
Before:
1000 files, 63530337 pixels, 1 loops => 45.0s (45.0 ms/file/iterations)
Compression (output/input): 2.463/3.268 bpp, Encode rate (raw data): 1.347 MP/s
After:
1000 files, 63530337 pixels, 1 loops => 45.9s (45.9 ms/file/iterations)
Compression (output/input): 2.461/3.268 bpp, Encode rate (raw data): 1.321 MP/s
Change-Id: I044ba9d3f5bec088305e94a7c40c053ca237fd9d
Restructure PredictorInverseTransform & ColorSpaceInverseTransform to remove
one if condition inside the main/critial loop. Also separated TransformColor &
TransformColorInverse into separate functions and avoid one 'if condition'
inside this critical method.
This change speeds up lossless decoding for Lenna image about 5% and 1000 image
corpus by 3-4%.
Change-Id: I4bd390ffa4d3bcf70ca37ef2ff2e81bedbba197d
Speedup lossless encoder by 20-25% by optimizing:
- GetBestColorTransformForTile: Use techniques like binary search and
local minima search to reduce the search space.
- VP8LFastSLog2Slow & VP8LFastLog2Slow: Adding the correction factor for
log(1 + x) and increase the threshold for calling the approximate
version of log_2 (compared to costly call to log()).
Change-Id: Ia2444c914521ac298492aafa458e617028fc2f9d
converts 2 s16 vectors to 2 u8 and store to uint8_t destination;
TransformAC3 can reuse this after a rework
Change-Id: Ia9370283ee3d9bfbc8c008fa883412100ff483d0
-> remove the 'color_transform' multiplier, use more constants, etc.
This function is particularly critical, mostly because of
GetBestColorTransformForTile().
Loop is a bit faster (maybe ~1%)
Change-Id: I90c96a3437cafb184773acef55c77e40c224388f
The WEBP_SWAP_16BIT_CSP flag needs to be honored while filling the Alpha (4 bits)
data in the destination buffer and while pre-multiplying the alpha to RGB colors.
Change-Id: I3b07307d60963db8d09c3b078888a839cefb35ba
The registers and instructions are quite different to 32bit
and the assembly code needs a rewrite.
more info: http://people.linaro.org/~rikuvoipio/aarch64-talk/
Change-Id: Id75dbc1b7bf47f43a426ba2831f25bb8fa252c4f
add TransformDC special case, and make the switch function inlined.
Recovers a few of the CPU lost during the addition of TransformAC3
(only on ARM)
Change-Id: I21c1f0c6a9cb9d1dfc1e307b4f473a2791273bd6