configure gets 2 new options:
--enable-neon / --enable-neon-rtcd
the NEON modules are split to their own convenience lib and built with
auto-detected flags if none are given via CFLAGS.
the /proc/cpuinfo check will only be used for armv7 targets whose
toolchain does not enable NEON by default or didn't have NEON forced by
the CFLAGS from the environment.
Change-Id: I2755bc1d065d5d6ee6143b44978c2082f8bef1c5
This will allow to work in-place on cropped area later.
Also sped up the inverse gradient filtering in SSE2 (~4%)
Change-Id: I463149eee95d36984328f163a1e17f8cabd87441
This is in preparation for some SSE2 code.
And generally speaking, the whole SSIM code needs some
revamp: we're not averaging the SSIM value at each pixels
but just computing the overall SSIM value once, for the whole
plane. The former might be better than the latter.
Change-Id: I935784a917f84a18ef08dc5ec9a7b528abea46a5
based on the sse2 change in:
9960c31 Remove an unnecessary transposition in TTransform.
~9-10.5% faster at the function-level, < 1% overall
Change-Id: I44413369b230b250fb0dbc51ff2f17cfeda609b7
SSE4.1 is slower than the SSE2 implementation and this seems to
be due to a slow _mm_loadl_epi64 implementation by gcc
(hence a bug with my gcc 4.8) and a very slow _mm_hadd_epi32. Both
got confirmed by IACA and experiments.
Change-Id: I05607f66b7ccd8f4f42e000693aea583ffd5768f
The transpose refactoring will help removing a transpose in a
later CL.
The horizontal add function helps removing a _mm_sad_epu8 in DC8uv
=> the latency/throughput went from 29/25 to 23/19
Change-Id: I5f3dfd4aad614eb079b1e83631e6a7cef49a3766
'implicit conversion from 'int' to 'short' changes value from 33050 to
-32486'
original patch:
https://codereview.chromium.org/1657313003/
Make libwebp build with -Wconstant-conversion from newer clangs.
After http://llvm.org/viewvc/llvm-project?rev=259271&view=rev, clang
points out that _mm_set1_epi16(33050) causes an overflow in the short
argument to _mm_set1_epi16(). Since there's no version that takes an
unsigned short, add an explicit cast to tell the compiler that this is
intentional.
No behavior change.
Change-Id: I6b4e3401b15cfbcc895f9e81b5c2dc59d43ffb9b
The code and logic is unified when computing bit entropy + Huffman cost.
Speed-wise, we gain 8% for lossless encoding.
Logic-wise, the beginning/end of the distributions are handled properly
and the compression ratio does not change much.
Change-Id: Ifa91d7d3e667c9a9a421faec4e845ecb6479a633
The functions containing magic constants are moved out of ./dsp .
VP8LPopulationCost got put back in ./enc
VP8LGetCombinedEntropy is now unrefined (refinement happening in ./enc)
VP8LBitsEntropy is now unrefined (refinement happening in ./enc)
VP8LHistogramEstimateBits got put back in ./enc
VP8LHistogramEstimateBitsBulk got deleted.
Change-Id: I09c4101eebbc6f174403157026fe4a23a5316beb
This implementation brings:
- an SSE implementation of packing / unpacking
- bigger buffers processed at the same time
The speedup is of 4% on lossy decoding (YUV to RGB), 0.5% on
lossy encoding (RGB to YUV was already optimized).
Change-Id: Iec677ee17f91c08614d1adab67c6df551925767f
INFO: From Compiling src/dsp/cpu.c:
src/dsp/cpu.c: In function 'x86CPUInfo':
src/dsp/cpu.c:36:3: inconsistent operand constraints in an 'asm'
With PIC and mcmodel=medium, the %rbx register must be saved and
restored which causes this problem. This was also solved in GCC-4.9 with
this patch:
https://gcc.gnu.org/ml/gcc-patches/2012-12/msg01484.html
Tested:
Builds fine with this change.
Change-Id: Icca8eea7bf5af3ef9f17f6ae2886e3430143febf
CombinedShannonEntropy takes 30% for lossless compression.
This implementation speeds up the overall process by 2 to 3 %.
Change-Id: I04a71743284c38814fd0726034d51a02b1b6ba8f
* Precision is slightly different
* also implemented in SSE2 the missing WebPUpsamplers for MODE_ARGB, MODE_Argb, MODE_RGB565, etc.
* removing yuv_tables_sse2.h saved ~8k of binary size
* the mips32/mips_dsp_r2 code is disabled for now, since it has drifted away
* the NEON code is somewhat tricky
Change-Id: Icf205faa62cf46c2825d79f3af6725dc1ec7f052
Gives 0.9% smaller (2.4% compared to before alpha cleanup) size on the 1000 PNGs dataset:
Alpha cleanup before: 18856614
Alpha cleanup after: 18685802
For reference, with no alpha cleanup: 19159992
Note: WebPCleanupTransparentArea is still also called in WebPEncode. This cleanup still helps
preprocessing in the encoder, and the cases when the prediction transform is not used.
Change-Id: I63e69f48af6ddeb9804e2e603c59dde2718c6c28
for more speed.
This gives a roughly a 1% speedup for low_effort. But actually this is a
preparation for the upcoming CL that changes RGB values of transparent pixels
based on prediction, which should not be done for low_effort because that would
slightly hurt its performance.
On 1000 PNGs, with quality 0, method 0:
Before:
Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.034 MP/s
After:
Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.428 MP/s
Change-Id: I5ed9f599bbf908a917723f3c780551ceb7fd724d
The same computation was done for both values: go over two buffers,
sum them up, and take a decision on the sum at each iteration.
MIPS32 code has been disabled for now, pending a code update.
Change-Id: I997984326f7092b3dbb8cfa1e524bd8132b2ab9d
instead of per block. This prepares for a next CL that can make the
predictors alter RGB value behind transparent pixels for denser
encoding. Some predictors depend on the top-right pixel, and it must
have been already processed to know its new RGB value, so requires per
scanline instead of per block.
Running the encode speed test on 1000 PNGs 10 times with default
settings:
Before:
Compression (output/input): 2.3745/3.2667 bpp, Encode rate (raw data): 1.497 MP/s
After:
Compression (output/input): 2.3745/3.2667 bpp, Encode rate (raw data): 1.501 MP/s
Same but with quality 0, method 0 and 30 iterations:
Before:
Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.379 MP/s
After:
Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.462 MP/s
No effect on compressed size, this produces exactly same files. No
significant measured effect on speed. Expected faster speed from better
memory layout with scanline processing but slower speed due to needing
to get predictor mode per pixel, may compensate each other.
Change-Id: I40f766f1c1c19f87b62c1e2a1c4cd7627a2c3334
the problem was the incorporation of the extra constant 1<<16 in the kC1
constant, to emulate the addition. It's now removed and the addition is
performed explicitly.
No real speed difference observed.
cf. issue #278
Change-Id: I2c6499031571d98afff392fb5ebe21a5fa60722d