The old implementation in enc/near_lossless.c performing a separate
preprocessing step is used only when a prediction filter is not used,
otherwise a new implementation integrated into lossless_enc.c is used.
It retains the same logic for converting near lossless quality into max
number of bits dropped, and for adjusting the number of bits based on
the smoothness of the image at a given pixel. As before, borders are not
changed.
Then, instead of quantizing raw component values, the residual after
subtract green and after prediction is quantized according to the
resulting number of bits, taking care to not cross the boundary between
255 and 0 after decoding. Ties are resolved by moving closer to the
prediction instead of by bankers’ rounding.
This results in about 15% size decrease for the same quality.
Change-Id: If3e9c388158c2e3e75ef88876703f40b932f671f
copy and paste error in the previous commit, change
no_sanitize("unsigned-integer-overflow") from WEBP_UBSAN_IGNORE_UNDEF ->
WEBP_UBSAN_IGNORE_UNSIGNED_OVERFLOW
Change-Id: Id178ee14df1f2c4923a91ce423241e26b60b5d32
include utils.h directly where needed to allow utils.h to rely on
defines from dsp.h in a follow-up.
Change-Id: I32e26aaeb0b04ba60b3332f685f9a2be5a0a8d3d
configure gets 2 new options:
--enable-neon / --enable-neon-rtcd
the NEON modules are split to their own convenience lib and built with
auto-detected flags if none are given via CFLAGS.
the /proc/cpuinfo check will only be used for armv7 targets whose
toolchain does not enable NEON by default or didn't have NEON forced by
the CFLAGS from the environment.
Change-Id: I2755bc1d065d5d6ee6143b44978c2082f8bef1c5
This will allow to work in-place on cropped area later.
Also sped up the inverse gradient filtering in SSE2 (~4%)
Change-Id: I463149eee95d36984328f163a1e17f8cabd87441
This is in preparation for some SSE2 code.
And generally speaking, the whole SSIM code needs some
revamp: we're not averaging the SSIM value at each pixels
but just computing the overall SSIM value once, for the whole
plane. The former might be better than the latter.
Change-Id: I935784a917f84a18ef08dc5ec9a7b528abea46a5
based on the sse2 change in:
9960c31 Remove an unnecessary transposition in TTransform.
~9-10.5% faster at the function-level, < 1% overall
Change-Id: I44413369b230b250fb0dbc51ff2f17cfeda609b7
SSE4.1 is slower than the SSE2 implementation and this seems to
be due to a slow _mm_loadl_epi64 implementation by gcc
(hence a bug with my gcc 4.8) and a very slow _mm_hadd_epi32. Both
got confirmed by IACA and experiments.
Change-Id: I05607f66b7ccd8f4f42e000693aea583ffd5768f
The transpose refactoring will help removing a transpose in a
later CL.
The horizontal add function helps removing a _mm_sad_epu8 in DC8uv
=> the latency/throughput went from 29/25 to 23/19
Change-Id: I5f3dfd4aad614eb079b1e83631e6a7cef49a3766
'implicit conversion from 'int' to 'short' changes value from 33050 to
-32486'
original patch:
https://codereview.chromium.org/1657313003/
Make libwebp build with -Wconstant-conversion from newer clangs.
After http://llvm.org/viewvc/llvm-project?rev=259271&view=rev, clang
points out that _mm_set1_epi16(33050) causes an overflow in the short
argument to _mm_set1_epi16(). Since there's no version that takes an
unsigned short, add an explicit cast to tell the compiler that this is
intentional.
No behavior change.
Change-Id: I6b4e3401b15cfbcc895f9e81b5c2dc59d43ffb9b
The code and logic is unified when computing bit entropy + Huffman cost.
Speed-wise, we gain 8% for lossless encoding.
Logic-wise, the beginning/end of the distributions are handled properly
and the compression ratio does not change much.
Change-Id: Ifa91d7d3e667c9a9a421faec4e845ecb6479a633
The functions containing magic constants are moved out of ./dsp .
VP8LPopulationCost got put back in ./enc
VP8LGetCombinedEntropy is now unrefined (refinement happening in ./enc)
VP8LBitsEntropy is now unrefined (refinement happening in ./enc)
VP8LHistogramEstimateBits got put back in ./enc
VP8LHistogramEstimateBitsBulk got deleted.
Change-Id: I09c4101eebbc6f174403157026fe4a23a5316beb
This implementation brings:
- an SSE implementation of packing / unpacking
- bigger buffers processed at the same time
The speedup is of 4% on lossy decoding (YUV to RGB), 0.5% on
lossy encoding (RGB to YUV was already optimized).
Change-Id: Iec677ee17f91c08614d1adab67c6df551925767f