Commit Graph

863 Commits

Author SHA1 Message Date
Parag Salasakar
1ebf193c2c Added MSA optimized chroma edge filtering functions
1. VFilter8
2. HFilter8
3. VFilter8i
4. HFilter8i

Change-Id: Iea5f0107178809dc31f3d9ba817e2474bd73fc0a
2016-06-22 13:51:29 +00:00
Parag Salasakar
607510967f Added MSA optimized edge filtering functions
1. VFilter16
2. HFilter16
3. VFilter16i
4. HFilter16i


Change-Id: I6a302c5ab40329c9e9bd1501a611d7267a983d81
2016-06-22 09:35:49 +00:00
Vincent Rabaud
9e8e1b7b2a Inline GetResidual for speed.
Change-Id: Ib4228e87dc448866229c0795ca68dabe777ef31c
2016-06-21 16:04:53 +02:00
Parag Salasakar
5e60c42a76 Added MSA optimized transform functions
1. TransformWHT
2. TransformTwo
3. TransformDC
4. TransformAC3

Change-Id: Ia3624cb4aed215bcaffce542b28794e643207039
2016-06-16 09:04:27 +00:00
James Zern
4c59aac0f9 Merge "mips msa webp configuration" 2016-06-09 05:39:53 +00:00
Parag Salasakar
e11da081f9 mips msa webp configuration
Change-Id: I886164d6d3d560b1249603d47391fddf20b5a3d4
2016-06-07 23:49:41 -07:00
Jovan Zelincevic
50a486656d Sync mips32 and dsp_r2 YUV->RGB code with C verison
Change-Id: Ibe12f5ef596b8922225b95c36b67955a3f8b9ae4
2016-06-03 10:42:11 +02:00
Pascal Massimino
ca8d951980 remove some obsolete TODOs
Change-Id: Ied77b2dd7e3e5bb65524c0ac7b9a3fb6585cac57
2016-06-01 16:23:16 +02:00
Pascal Massimino
77f21c9c39 Move DitherCombine8x8 to dsp/dec.c
To be later optimized in SSE2

Change-Id: I0de9c89eb5166f3319bb4b0500150de271ecac05
2016-05-24 23:14:41 -07:00
Marcin Kowalczyk
f2e1efbeb7 Improve near lossless compression when a prediction filter is used.
The old implementation in enc/near_lossless.c performing a separate
preprocessing step is used only when a prediction filter is not used,
otherwise a new implementation integrated into lossless_enc.c is used.

It retains the same logic for converting near lossless quality into max
number of bits dropped, and for adjusting the number of bits based on
the smoothness of the image at a given pixel. As before, borders are not
changed.

Then, instead of quantizing raw component values, the residual after
subtract green and after prediction is quantized according to the
resulting number of bits, taking care to not cross the boundary between
255 and 0 after decoding. Ties are resolved by moving closer to the
prediction instead of by bankers’ rounding.

This results in about 15% size decrease for the same quality.

Change-Id: If3e9c388158c2e3e75ef88876703f40b932f671f
2016-05-18 20:59:02 +00:00
James Zern
e15afbce5d dsp.h: fix ubsan macro name
copy and paste error in the previous commit, change
no_sanitize("unsigned-integer-overflow") from WEBP_UBSAN_IGNORE_UNDEF ->
WEBP_UBSAN_IGNORE_UNSIGNED_OVERFLOW

Change-Id: Id178ee14df1f2c4923a91ce423241e26b60b5d32
2016-05-13 11:09:57 -07:00
James Zern
e53c9ccb24 dsp.h: add WEBP_UBSAN_IGNORE_UNSIGNED_OVERFLOW
for suppressing expected failures with -fsanitize=integer

Change-Id: I954cba45f0c96478b770ed7a6ac7491359cae075
2016-05-12 23:51:23 -07:00
James Zern
ea0be354a0 dsp.h: remove utils.h include
include utils.h directly where needed to allow utils.h to rely on
defines from dsp.h in a follow-up.

Change-Id: I32e26aaeb0b04ba60b3332f685f9a2be5a0a8d3d
2016-05-11 23:17:21 -07:00
James Zern
ea24e026aa Merge "dsp.h: add WEBP_UBSAN_IGNORE_UNDEF" 2016-05-11 06:21:45 +00:00
James Zern
369e264e2e dsp.h: add WEBP_UBSAN_IGNORE_UNDEF
only defined when WEBP_FORCE_ALIGNED isn't. use it to quiet alignment
warnings VP8LoadNewBytes().

Change-Id: I710a74bb9375285974e97022540551a3f4eda414
2016-05-10 22:45:13 -07:00
James Zern
0d020a7892 Merge "add runtime NEON detection" 2016-05-11 05:42:13 +00:00
James Zern
5ee2136a71 Merge "add VP8LAddPixels() to lossless.h" 2016-05-10 22:39:29 +00:00
Pascal Massimino
47435a6162 add VP8LAddPixels() to lossless.h
Change-Id: I67f9118f875affa32c47adfedf9df28b0ac9957b
2016-05-10 20:30:30 +00:00
Pascal Massimino
8fa6ac68f0 remove two ubsan warnings
(regarding uint overflow)

Change-Id: I1a76e4b1268370b6b7d6a1aa93b99e57f55fd02e
2016-05-10 18:40:18 +00:00
James Zern
74fb56fb5d add runtime NEON detection
configure gets 2 new options:
--enable-neon / --enable-neon-rtcd

the NEON modules are split to their own convenience lib and built with
auto-detected flags if none are given via CFLAGS.

the /proc/cpuinfo check will only be used for armv7 targets whose
toolchain does not enable NEON by default or didn't have NEON forced by
the CFLAGS from the environment.

Change-Id: I2755bc1d065d5d6ee6143b44978c2082f8bef1c5
2016-05-06 15:32:48 -07:00
Jovan Zelincevic
4154a8395d MIPS update to new Unfilter API
Change-Id: I2b5960812954dfcabc84663382b9e032fd1eeb43
2016-05-05 15:50:34 +02:00
Pascal Massimino
2102ccd091 update the Unfilter API in dsp to process one row independently
This will allow to work in-place on cropped area later.

Also sped up the inverse gradient filtering in SSE2 (~4%)

Change-Id: I463149eee95d36984328f163a1e17f8cabd87441
2016-04-21 08:10:45 +00:00
James Zern
875aec7044 enc_neon,cosmetics: break long comment
Change-Id: I88dff0271fef1cc6dd5888572bfe0f09f467b028
2016-03-08 23:33:21 -08:00
Pascal Massimino
a90edffb7e fix missing 'extern' for SSIM function in dsp/
Change-Id: Id8143120f01065dc088f4e90bd930f8ea7c3ae5a
2016-03-08 10:27:46 -08:00
Pascal Massimino
423ecaf484 move some SSIM-accumulation function for dsp/
This is in preparation for some SSE2 code.

And generally speaking, the whole SSIM code needs some
revamp: we're not averaging the SSIM value at each pixels
but just computing the overall SSIM value once, for the whole
plane. The former might be better than the latter.

Change-Id: I935784a917f84a18ef08dc5ec9a7b528abea46a5
2016-03-08 07:50:09 +01:00
James Zern
0d40cc5ea3 enc_neon,Disto4x4: remove an unnecessary transpose
based on the sse2 change in:
9960c31 Remove an unnecessary transposition in TTransform.

~9-10.5% faster at the function-level, < 1% overall

Change-Id: I44413369b230b250fb0dbc51ff2f17cfeda609b7
2016-03-03 16:18:59 -08:00
Pascal Massimino
6753f35cac Merge "FTransformWHT optimization." 2016-02-19 09:38:04 +00:00
Vincent Rabaud
6583bb1a42 Improve SSE4.1 implementation of TTransform.
SSE4.1 is slower than the SSE2 implementation and this seems to
be due to a slow _mm_loadl_epi64 implementation by gcc
(hence a bug with my gcc 4.8) and a very slow _mm_hadd_epi32. Both
got confirmed by IACA and experiments.

Change-Id: I05607f66b7ccd8f4f42e000693aea583ffd5768f
2016-02-19 09:11:53 +01:00
Vincent Rabaud
7561d0c338 FTransformWHT optimization.
Data is packed sooner in the functions.

Change-Id: I018cfeca43f015ac755c7f209f9a97984cc0517b
2016-02-18 17:44:05 +01:00
Vincent Rabaud
8aa352b256 Merge "Remove an unnecessary transposition in TTransform." 2016-02-18 08:15:10 +00:00
Vincent Rabaud
9960c31685 Remove an unnecessary transposition in TTransform.
Change-Id: Ib715c2d5ba659cb2db9c6832875ba508cc2fca3e
2016-02-17 21:41:28 +01:00
Vincent Rabaud
6e36b51188 Small speedup in FTransform.
It removes two _mm_unpacklo_epi32 and two _mm_sub_epi16.

Change-Id: Icdf86259f796ba855d1cda5e9c0e99cb396cb351
2016-02-17 21:26:36 +01:00
Vincent Rabaud
bf2b4f114f Regroup common SSE code + optimization.
The transpose refactoring will help removing a transpose in a
later CL.

The horizontal add function helps removing a _mm_sad_epu8 in DC8uv
=> the latency/throughput went from 29/25 to 23/19

Change-Id: I5f3dfd4aad614eb079b1e83631e6a7cef49a3766
2016-02-16 18:34:34 +01:00
Nico Weber
3ef1ce98b9 yuv_sse2: fix -Wconstant-conversion warning
'implicit conversion from 'int' to 'short' changes value from 33050 to
-32486'

original patch:

https://codereview.chromium.org/1657313003/

Make libwebp build with -Wconstant-conversion from newer clangs.

After http://llvm.org/viewvc/llvm-project?rev=259271&view=rev, clang
points out that _mm_set1_epi16(33050) causes an overflow in the short
argument to _mm_set1_epi16().  Since there's no version that takes an
unsigned short, add an explicit cast to tell the compiler that this is
intentional.

No behavior change.

Change-Id: I6b4e3401b15cfbcc895f9e81b5c2dc59d43ffb9b
2016-02-02 14:52:11 -08:00
Pascal Massimino
6c1d763119 avoid Yoda style for comparison
Change-Id: I8ff9f96951e5e8a619f7132455dd281cbf91aa4d
2016-01-15 23:52:29 -08:00
Vincent Rabaud
8ce975ac82 SSE optimization for vector mismatch.
Change-Id: I564b822033b59d86635230f29ed6197e306a2c4f
2016-01-07 18:23:45 +01:00
Pascal Massimino
7e7b6ccc7f faster rgb565/rgb4444/argb output
SSE2 and NEON implementation.

Change-Id: I342a1c3d84937b8497f0aaecb7ce9bdb7f50296b
2015-12-17 23:38:58 -08:00
James Zern
99a01f4f8b Merge "Unify some entropy functions." 2015-12-17 22:35:29 +00:00
James Zern
4b025f10f7 Merge "configure: disable asserts by default" 2015-12-17 22:28:37 +00:00
Vincent Rabaud
ca509a3362 Unify some entropy functions.
The code and logic is unified when computing bit entropy + Huffman cost.

Speed-wise, we gain 8% for lossless encoding.
Logic-wise, the beginning/end of the distributions are handled properly
and the compression ratio does not change much.

Change-Id: Ifa91d7d3e667c9a9a421faec4e845ecb6479a633
2015-12-17 17:00:08 +01:00
Pascal Massimino
b0547ff0b4 move back common constants for lossless_enc*.c into the .h
Change-Id: I11bc979db691f6518d85e2e1c3ac7f05d69681b0
2015-12-17 15:11:56 +01:00
Vincent Rabaud
47ddd5a4cc Move some codec logic out of ./dsp .
The functions containing magic constants are moved out of ./dsp .
VP8LPopulationCost got put back in ./enc
VP8LGetCombinedEntropy is now unrefined (refinement happening in ./enc)
VP8LBitsEntropy is now unrefined (refinement happening in ./enc)
VP8LHistogramEstimateBits got put back in ./enc
VP8LHistogramEstimateBitsBulk got deleted.

Change-Id: I09c4101eebbc6f174403157026fe4a23a5316beb
2015-12-17 07:03:25 +00:00
James Zern
357f455dec yuv_sse2: fix 32-bit visual studio build
src\dsp\yuv_sse2.c : C2719: 'in': formal parameter with
  __declspec(align('16')) won't be aligned
src\dsp\yuv_sse2.c : C2719: 'out': formal parameter with
  __declspec(align('16')) won't be aligned

Change-Id: Ifd79e33b35c70748faff19cd64eba4a8ffce5a5a
2015-12-16 15:04:36 -08:00
James Zern
b9d80fa4e8 configure: disable asserts by default
--enable-asserts can be used to avoid defining NDEBUG

Change-Id: I6216668e3f79f69bd8c453f0b36cecb3b585688e
2015-12-16 13:15:53 -08:00
Vincent Rabaud
80ce27d34e Speed up 24-bit packing / unpacking in YUV / RGB conversions.
This implementation brings:
- an SSE implementation of packing / unpacking
- bigger buffers processed at the same time
The speedup is of 4% on lossy decoding (YUV to RGB), 0.5% on
lossy encoding (RGB to YUV was already optimized).

Change-Id: Iec677ee17f91c08614d1adab67c6df551925767f
2015-12-16 11:06:42 +01:00
Pascal Massimino
2dee2966df remove few obsolete TODO about aligned loads in SSE2
Change-Id: I3628602942ea2ce34dbcb85975d15afc1041f76c
2015-12-15 23:00:41 -08:00
James Zern
b105921c7d yuv_sse2, cosmetics: fix indent
+ remove unneeded header

Change-Id: I3247378fd3315d95bb3345625d3575aa9e05c1b8
2015-12-15 17:29:04 -08:00
Sriraman Tallam
b275e598b5 fix optimized build with -mcmodel=medium
INFO: From Compiling src/dsp/cpu.c:
src/dsp/cpu.c: In function 'x86CPUInfo':
src/dsp/cpu.c:36:3: inconsistent operand constraints in an 'asm'

With PIC and mcmodel=medium, the %rbx register must be saved and
restored which causes this problem.  This was also solved in GCC-4.9 with
this patch:
https://gcc.gnu.org/ml/gcc-patches/2012-12/msg01484.html

Tested:
Builds fine with this change.

Change-Id: Icca8eea7bf5af3ef9f17f6ae2886e3430143febf
2015-12-11 16:49:10 -08:00
Vincent Rabaud
2835089d6a Provide an SSE2 implementation of CombinedShannonEntropy.
CombinedShannonEntropy takes 30% for lossless compression.
This implementation speeds up the overall process by 2 to 3 %.

Change-Id: I04a71743284c38814fd0726034d51a02b1b6ba8f
2015-12-11 15:12:19 +01:00
Pascal Massimino
202a710b26 fix undefined behaviour during shift, using a cast
Change-Id: Ibca261d01092cecf8b37c54e9fcc920c9527c0a9
2015-12-10 08:09:23 +01:00
Pascal Massimino
cb1ce9969c Merge "10% faster table-less SSE2/NEON version of YUV->RGB conversion" 2015-12-09 10:41:24 +00:00
Pascal Massimino
ac761a3738 10% faster table-less SSE2/NEON version of YUV->RGB conversion
* Precision is slightly different
* also implemented in SSE2 the missing WebPUpsamplers for MODE_ARGB, MODE_Argb, MODE_RGB565, etc.
* removing yuv_tables_sse2.h saved ~8k of binary size
* the mips32/mips_dsp_r2 code is disabled for now, since it has drifted away
* the NEON code is somewhat tricky

Change-Id: Icf205faa62cf46c2825d79f3af6725dc1ec7f052
2015-12-08 20:05:56 -08:00
Lode Vandevenne
6938111357 Improved alpha cleanup for the webp encoder when prediction transform is used.
Gives 0.9% smaller (2.4% compared to before alpha cleanup) size on the 1000 PNGs dataset:
Alpha cleanup before: 18856614
Alpha cleanup after: 18685802
For reference, with no alpha cleanup: 19159992

Note: WebPCleanupTransparentArea is still also called in WebPEncode. This cleanup still helps
preprocessing in the encoder, and the cases when the prediction transform is not used.

Change-Id: I63e69f48af6ddeb9804e2e603c59dde2718c6c28
2015-12-04 13:50:56 +00:00
Pascal Massimino
2c08aac81a introduce WebPMemToUint32 and WebPUint32ToMem for memory access
it uses memcpy() when unaligned memory write is tricky

Change-Id: I5d966ca9d19e9b43ac90140fa487824116982874
2015-12-04 13:43:01 +00:00
James Zern
0837512964 Merge "Make a separate case for low_effort in CopyImageWithPrediction" 2015-12-03 08:46:31 +00:00
James Zern
aa2eb2d4a1 Merge "cosmetics: fix indent" 2015-12-03 08:44:54 +00:00
James Zern
b7551e90e1 cosmetics: fix indent
Change-Id: I67e5a0308a964bc37b2314d96f3691fc0550e9bc
2015-12-03 00:34:15 -08:00
Lode Vandevenne
5bda52d4e8 Make a separate case for low_effort in CopyImageWithPrediction
for more speed.

This gives a roughly a 1% speedup for low_effort. But actually this is a
preparation for the upcoming CL that changes RGB values of transparent pixels
based on prediction, which should not be done for low_effort because that would
slightly hurt its performance.

On 1000 PNGs, with quality 0, method 0:
Before:
Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.034 MP/s
After:
Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.428 MP/s

Change-Id: I5ed9f599bbf908a917723f3c780551ceb7fd724d
2015-12-03 00:22:50 -08:00
Pascal Massimino
363babe255 Merge "fix some warning about unaligned 32b reads" 2015-12-02 10:29:40 +00:00
Vincent Rabaud
829bd14145 Combine Huffman cost and bit entropy into one loop
The same computation was done for both values: go over two buffers,
sum them up, and take a decision on the sum at each iteration.

MIPS32 code has been disabled for now, pending a code update.

Change-Id: I997984326f7092b3dbb8cfa1e524bd8132b2ab9d
2015-11-30 13:57:25 +01:00
James Zern
a7a954c851 Merge "lossless: make prediction in encoder work per scanline" 2015-11-25 20:40:44 +00:00
Lode Vandevenne
239421c5ef lossless: make prediction in encoder work per scanline
instead of per block. This prepares for a next CL that can make the
predictors alter RGB value behind transparent pixels for denser
encoding. Some predictors depend on the top-right pixel, and it must
have been already processed to know its new RGB value, so requires per
scanline instead of per block.

Running the encode speed test on 1000 PNGs 10 times with default
settings:
Before:
Compression (output/input): 2.3745/3.2667 bpp, Encode rate (raw data): 1.497 MP/s
After:
Compression (output/input): 2.3745/3.2667 bpp, Encode rate (raw data): 1.501 MP/s

Same but with quality 0, method 0 and 30 iterations:
Before:
Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.379 MP/s
After:
Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.462 MP/s

No effect on compressed size, this produces exactly same files. No
significant measured effect on speed. Expected faster speed from better
memory layout with scanline processing but slower speed due to needing
to get predictor mode per pixel, may compensate each other.

Change-Id: I40f766f1c1c19f87b62c1e2a1c4cd7627a2c3334
2015-11-25 00:38:27 -08:00
Pascal Massimino
f5ca40e05f fix of undefined multiply (int32 overflow)
the problem was the incorporation of the extra constant 1<<16 in the kC1
constant, to emulate the addition. It's now removed and the addition is
performed explicitly.

No real speed difference observed.

cf. issue #278

Change-Id: I2c6499031571d98afff392fb5ebe21a5fa60722d
2015-11-24 23:18:31 -08:00
Lode Vandevenne
b8c44f1aa4 3% speed improvement for lossless webp encoder for low effort mode:
prevent updating unused histogram.

Benchmark on 1000 PNGs, 30 iterations, lossless, quality 0, method 0:
before: Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 34.578 MP/s
after: Compression (output/input): 2.9120/3.2667 bpp, Encode rate (raw data): 36.980 MP/s

Change-Id: Id62759d4d111a6ba41c85c611a15d4f6ffc9f935
2015-11-22 09:12:54 +01:00
Pascal Massimino
bfd3fc02df ~2x faster SSE2 RGB24toY, BGR24toY, ARGBToY|UV
global effect is ~2% faster encoding from JPG source
and ~8% faster lossless-webp source decoding to PGM (e.g.)

Also revamped the YUVA case to first accumulate R/G/B value into 16b
temporary buffer, and then doing the UV conversion.
-> New function: WebPConvertRGBA32ToUV

Change-Id: I1d7d0c4003aa02966ad33490ce0fcdc7925cf9f5
2015-11-06 15:02:01 -08:00
Pascal Massimino
52fdbdfe66 extract some RGB24 to Luma conversion function from enc/ to dsp/
Just for RGB24/BGR24 for now, which are the hard-to-optimize ones.
SSE2 implementation coming next.

ConvertRowToY() should go into dsp/ too, at some point.

Change-Id: Ibc705ede5cbf674deefd0d9332cd82f618bc2425
2015-10-30 00:28:11 -07:00
Pascal Massimino
ab8c2300b6 add missing \n
Change-Id: I0c9236bbeef5868629d4dc02e3fae6e79ca55949
2015-10-30 00:02:27 -07:00
Pascal Massimino
8f1fcc15af Merge "Move ARGB->YUV functions from dec/vp8l.c to dsp/yuv.c" 2015-10-29 06:38:52 +00:00
Pascal Massimino
25bf2ce5cc fix some warning about unaligned 32b reads
on x86 + gcc, the assembly code is the same.

Change-Id: Ib0d23772ccf928f8d9ebcb0e157c0573d1f6a786
2015-10-28 15:51:55 -07:00
Pascal Massimino
fa8927efe4 Move ARGB->YUV functions from dec/vp8l.c to dsp/yuv.c
also switch to using ExtractAlpha() instead of hard-coding the loop.

The ARGBToY/UV functions are rather easy to port to SSE2 / NEON.

Change-Id: I8f1346a9ca427a36ce2d6c848369ca7964d8b3c7
2015-10-28 01:45:08 -07:00
Pascal Massimino
14e4043b67 remove unnecessary #include "yuv.h"
Change-Id: I8b277433663e063e7a182f66818afec1654a39bd
2015-10-27 01:27:36 -07:00
Pascal Massimino
5aa8d61f75 Merge "MIPS: rescaler code synced with C implementation" 2015-10-17 07:52:36 +00:00
Djordje Pesut
e7fb267df7 MIPS: rescaler code synced with C implementation
Change-Id: I4cec115d3fe6f3f825084d7388249694c500256a
2015-10-17 00:16:27 -07:00
James Zern
65726cd3a7 dsp/lossless: Average2, make a constant unsigned
use 'u' rather than the unnecessary 'l' as a suffix. this prevents a
conversion warning with some toolchains

Change-Id: I21c33ce08819b3c839c75e03a8f7f3a6041d0695
2015-10-16 18:39:42 -07:00
Johann
d26d9def80 Use __has_builtin to check clang support
Older versions of Xcode with clang reporting versions 4.[012] and 5.0
did not include support for __builtin_bswap16. Checking in this manner
avoids using brittle version checks.

Matches a change to libvpx:
https://chromium-review.googlesource.com/305573
to fix:
https://code.google.com/p/webm/issues/detail?id=1082

Change-Id: I23ea466ee1b53b12cd3fb45f65a2186c8dda95a1
2015-10-14 17:48:08 -07:00
Pascal Massimino
67c547fdcd rescaler: ~20% faster SSE2 implementation for lossless ImportRowExpand
lossy (1-channel) speed-up is more on the 5% side.

Change-Id: Id19d97b9e9a34804b59604a5b48f94a37fdafd62
2015-10-14 07:32:12 +02:00
Pascal Massimino
99e3f8128a Merge "large re-organization of the delta-palettization code" 2015-10-14 05:11:47 +00:00
Pascal Massimino
95509f9914 large re-organization of the delta-palettization code
same functionality, but better code layout.

What changed:
  * don't trash the palette_[] in EncodePalette(), so it can be re-used
  * split generation of image from bit-stream coding
  * move all the delta-palette code to delta_palettization.c, and only have 1 entry point there WebPSearchOptimalDeltaPalette()
  * minimize the number of "#ifdef WEBP_EXPERIMENTAL_FEATURES" in vp8l.c
  * clarify the TransformBuffer stuff. more clean-up to come here...

This should make experimenting with delta-palettization easier and more compartimentalized.

Change-Id: Iadaa90e6c5b9dabc7791aec2530e18c973a94610
2015-10-14 00:25:42 +02:00
Pascal Massimino
74fb458bbc fix for weird msvc warning message
" warning C4098: 'RescalerImportRowShrinkSSE2' : 'void' function returning a value"

Change-Id: Ifa893502e3e4b394910e142d954393dda9d59d1a
2015-10-10 22:35:59 -07:00
Pascal Massimino
932fd4df61 SSE2 implementation of ImportRowShrink
some limitations: only for RGBA output,
and if reduction factor is not too small (dst_width > src_width / 128)

20-25% faster, ~4-6% global improvement total decoding.

Change-Id: I95366ddaa4a38e0a96bed754dfe790126f7bb84a
2015-10-09 13:04:54 -07:00
skal
b4e731cd93 neon-implementation for rescaler code
It's better to stay with a 32b fixed-point precision overall, otherwise
the C-version on ARM gets *slower*.
Actually, gcc ARM compiler optimizes some instructions pretty
well when WEBP_RESCALER_FIX is exactly 32, even in C.

Change-Id: I0eea97f7db5947470f5af355dee098eca81e178d
2015-10-07 21:18:39 -07:00
Mislav Bradac
48f66b6687 Add delta_palettization feature to WebP
Change-Id: Ibaf4e49aa67d63d0eb11848cca4fd0c60815864a
2015-10-02 14:29:54 -07:00
James Zern
5a84460d6d rescaler_mips_dsp_r2: cosmetics, fix indent
Change-Id: I59a432a66a658a74f383bd81b6f9abb5e5bb409e
2015-09-25 18:35:16 -07:00
James Zern
acde0aae5a rescaler: cosmetics, join two lines
Change-Id: Ic231dd048c82a934122ce4884180a2339f7ce2f8
2015-09-25 18:34:45 -07:00
Pascal Massimino
306ce4fde1 rescaler: move the 1x1 or 2x1 handling one level up
=> no need to handle it in the sub-functions.

Change-Id: I4b0211ecfafbc9c80a73bf2206809a13c94e7911
2015-09-25 14:35:35 -07:00
Pascal Massimino
cced974bb2 remove _mm_set_epi64x(), which is too specific
Change-Id: I4b1035f9c548b804f31c68a00b0a1aa8e13550bb
2015-09-25 14:35:33 -07:00
Pascal Massimino
56668c9fc5 fix warnings about uint64_t -> uint32_t conversion
Change-Id: Iee027979b404d4b7edda506b844d354aa1026dae
2015-09-25 17:36:11 +02:00
Pascal Massimino
76a7dc39e5 rescaler: add some SSE2 code
The rounding and arithmetic is not the same as previously, to prevent overflow cases for large upscale factors.

We still rely on 32b x 32b -> 64b multiplies. Raised the fixed-point precision to 32b
so that we have some nice shifts from epi64 to epi32.
Changed rescaler_t type to 'uint32_t' in order to squeeze in all the precision required.

The MIPS code has been disabled because it's now out-of-sync. Will be fixed in
a subsequent CL when the dust settles.
~30-35% faster

Change-Id: I32e4ddc00933f1b1aa3463403086199fd5dad07b
2015-09-25 15:07:13 +02:00
James Zern
1df1d0eedb rescaler: harmonize function protos
Change-Id: I13b5f9add83c1225c82a650f3ef717582b057247
2015-09-19 22:57:25 -07:00
Pascal Massimino
9ba1894b9b rescaler: simplify ImportRow logic
incorporates the loop over 'channel' and removes one parameter

Change-Id: I4e3b33c111ca825fe96461583420413b17326409
2015-09-19 10:07:26 -07:00
Pascal Massimino
5ff0079ece fix rescaler vertical interpolation
* vertical expansion now uses bilinear interpolation
  * heavily assumes that the alpha plane is decoded in full, not row-by-row
  * split the RescalerExportRow and RescalerImportRow methods into Shrink
    and Expand variants.
  * MIPS implementation of ExportRowExpand is missing.

There's room for extra speed optim and code re-org, but let's keep that for later patches.

addresses https://code.google.com/p/webp/issues/detail?id=254

Change-Id: I8f12b855342bf07dd467fe85e4fde5fd814effdb
2015-09-18 17:32:11 -07:00
James Zern
d623a8706f dec_neon: add whitespace around stringizing operator
prevents unintentional side-effects (though unlikely in this case) with
future compilers, cf:
eebaf97 dsp/mips: add whitespace around stringizing operator

Change-Id: I0537091fcc97b4f54d0a156c3c83a28c51456b17
2015-09-03 23:13:56 -07:00
James Zern
29377d55b6 dsp/mips: cosmetics: add whitespace around XSTR macro
normalizes formatting after:
eebaf97 dsp/mips: add whitespace around stringizing operator

Change-Id: I1e3986b6d08195d79072747eb99d7e0549aece72
2015-09-03 23:09:13 -07:00
James Zern
eebaf97f5a dsp/mips: add whitespace around stringizing operator
fixes compile with gcc 5.1
BUG=259

Change-Id: Ideb39c6290ab8569b1b6cc835bea11c822d0286c
2015-09-02 23:21:13 -07:00
James Zern
14efabbf1c Android: limit use of cpufeatures
cpufeatures is only used with armeabi-v7a.*

Change-Id: I80284061d71d9defa50d139c7f1bda67c00f567e
2015-08-19 18:44:33 -07:00
skal
bd55604d1b SSE2: add yuv444 converters, re-using yuv_sse2.c
Change-Id: I4d5c9df8a4c8e8cb8b5daa537af07382894503a8
2015-08-17 21:15:37 -07:00
James Zern
155c1b222b Merge changes I76f4d6fe,I45434639
* changes:
  lossless_enc_neon: add VP8LTransformColor
  lossless_neon: add VP8LTransformColorInverse
2015-08-06 23:00:03 +00:00
Djordje Pesut
717e4d5a7c mips32/mipsDSPr2: function ImportRow rebased
Change-Id: Id58d266040fdb5fe1e507cd0f6370ea625156e4d
2015-08-06 17:09:10 +02:00
Pascal Massimino
7df93893dc fix rescaling bug (uninitialized read, see bug #254).
the x_add/x_sub increments were wrong for u/v in the upscaling case.
They shouldn't be left to the caller's discretion, but set up by
WebPRescalerInit to their exact necessary values.

-> Cleaned-up WebPRescalerInit() param list.
-> added safety asserts
-> removed the mips32/mips_r2 variant of "ImportRow" which were buggy prior

Change-Id: I347c75804d835811e7025de92a0758d7929dfc09
2015-08-05 23:00:00 -07:00
James Zern
5cdcd561e2 lossless_enc_neon: add VP8LTransformColor
based on SSE2, ~32% faster

Change-Id: I76f4d6fe456baceba46ffebf2f699e98691eefdf
2015-08-05 00:15:13 -07:00
James Zern
a53c336919 lossless_neon: add VP8LTransformColorInverse
based on SSE2, only ~11% faster

Change-Id: I45434639d81e153f01f77c1f5d2da510b542170e
2015-08-04 23:22:36 -07:00
James Zern
99131e7f8c Merge changes I9fb25a89,Ibc648e9e
* changes:
  lossless_neon: remove predictors 5-13
  ll_enc_neon: enable VP8LSubtractGreenFromBlueAndRed
2015-08-04 02:24:15 +00:00
Pascal Massimino
c455676680 simplify the main loop for downscaling
(part of bug #254 investigation)

no speed change observed.

Change-Id: Ie21b33171def367f37643fef6a0bd378e49468c7
2015-08-03 16:57:35 +02:00
James Zern
2a010f992a lossless_neon: remove predictors 5-13
operating on single uint32's isn't helped by NEON.
this improves aarch64 performance by ~4%

Change-Id: I9fb25a8962de7b80e893e756ee7c76393cfd40c7
2015-07-28 19:44:58 -07:00
James Zern
ca221bbc48 ll_enc_neon: enable VP8LSubtractGreenFromBlueAndRed
this moves the function outside the WEBP_USE_INTRINSICS check.
there's no alternative version and it's ~54% faster at the
function level and mildly faster overall

Change-Id: Ibc648e9ee35021d48901e05aa596aa01067796a2
2015-07-28 19:44:45 -07:00
Jyrki Alakuijala
85b44d8a69 lossless: encoding, don't compute unnecessary histo
share the computation between different modes

3-5 % speedup for lossless alpha
1 % for lossy alpha

no change in compression density

Change-Id: I5e31413b3efcd4319121587da8320ac4f14550b2
2015-07-07 20:24:26 -07:00
Pascal Massimino
0ae2c2e4b2 SSE2/SSE41: optimize SSE_16xN loops
After several trials at re-organizing the main loop and accumulation scheme,
this is apparently the faster variant.

removed the SSE41 version, which is no longer faster now.
For some reason, the AVX variant seems to benefit most for the change.

Change-Id: Ib11ee18dbb69596cee1a3a289af8e2b4253de7b5
2015-07-02 20:55:04 +02:00
James Zern
39216e59d9 cosmetics: fix indent after 32462a07
Change-Id: If9a5d91c25e981bc4cd81adb476244e63fc7c3c8
2015-07-01 23:49:20 -07:00
James Zern
559e54ca60 Merge "SSE2: slightly faster FTransformWHT" 2015-07-02 06:36:33 +00:00
Pascal Massimino
8ef9a63b45 SSE2: slightly faster FTransformWHT
goes from 0.3% to 0.1% overall CPU time, but...

Change-Id: I4c9a92b1e1d6b58ed57c6b890366f1dbeaf84f84
2015-07-01 23:03:17 -07:00
James Zern
f27f773576 lossless_neon: enable VP8LAddGreenToBlueAndRed
this moves the function outside the WEBP_USE_INTRINSICS check.
there's no alternative version and it's ~70% faster at the
function level and 1-2% faster overall

Change-Id: I59fb4918ec86b1ac3a47cbd5d05ce62f007461cb
2015-07-01 22:50:54 -07:00
Pascal Massimino
36e9c4bc50 SSE2: minor cosmetrics on in-loop filter code
Change-Id: Ic0e6502081d7063bb2841df74e05c450d708aaf2
2015-06-28 11:59:22 +02:00
James Zern
4741fac42e dsp/lossless_*sse2: remove some unnecessary inlines
TransformColor / TransformColorInverse are the top-level function
pointer calls

Change-Id: Ieabdb4005ff3e4f9bb3ebcb140ccb6bef5d28f8b
2015-06-25 21:02:01 -07:00
Pascal Massimino
1819965e0a fix warning ("left shift of negative value") using a cast
Change-Id: Ie99e8ff87924a1d15e2c5d83bd9adf07dab04e94
2015-06-24 23:46:09 -07:00
Pascal Massimino
7017001462 SSE2: speed-up some lossless-encoding functions
optimized: CollectColorRedTransforms, CollectColorBlueTransforms, SubtractGreenFromBlueAndRed

overall effect is sub-1% speed-up, though.

Change-Id: I9cb49af5c56e4c03db417929b0a2cf575d60a5c6
2015-06-24 20:09:13 -07:00
Pascal Massimino
abcb012841 Merge "SSE2: slightly faster (~5%) AddGreenToBlueAndRed()" 2015-06-24 09:37:46 +00:00
Pascal Massimino
2df5bd30a6 Merge "Speedup to HuffmanCostCombinedCount" 2015-06-24 07:42:26 +00:00
Pascal Massimino
9e356d6b25 SSE2: slightly faster (~5%) AddGreenToBlueAndRed()
Change-Id: Ie147010b66544c4e959f26966ad588394302d418
2015-06-24 09:36:44 +02:00
Pascal Massimino
fc6c75a2a2 SSE2: 53% faster TransformColor[Inverse]
Changed the code (again) to process 4 pixels at a time. Loop is more
involved, but overall it's faster.

Removed the SSE4.1 implementation which is now slower than SSE2.

Change-Id: I7734e371033ad8929ace7f7e1373ba930d9bb5f1
2015-06-23 14:52:01 -07:00
Pascal Massimino
49073da6d6 SSE2: 46% speed-up of TransformColor[Inverse]
Change-Id: If3bf26dc8ed32a7c03cb438e5d5fc996e2e96b5e
2015-06-23 20:09:04 +02:00
Pascal Massimino
32462a072c Speedup to HuffmanCostCombinedCount
~3% speedup for lossless encoding
Improves compression ratio by ~0.03%

Change-Id: Ic6d05fb0b1099b5ca56689b92b1c6515d54a5d6b
2015-06-23 16:41:03 +02:00
Pascal Massimino
f3d687e3fa SSE4.1 implementation of some lossless encoding functions
New implementations: SubtractGreenFromBlueAndRed and TransformColor

around 1-2% faster lossless encoding.

Change-Id: I1668e36fdc316ba55b3b798b91b4a3e36ce62861
2015-06-23 08:46:57 +02:00
Pascal Massimino
bfc300c7ff SSE4.1 implementation of some alpha-processing functions
DispatchAlpha* functions are hard to speed up, compared to SSE2.
ExtractAlpha sees a ~15% speed-up though.

Change-Id: I8715c2defecbc832f469eed7e6ffd012146b52de
2015-06-19 14:17:39 -07:00
Pascal Massimino
7f9c98f21d Merge "sse2 in-loop: simplify SignedShift8b() a bit" 2015-06-12 07:37:32 +00:00
James Zern
ef314a5d6c dec_sse2/GetNotHEV: micro optimization
trade 2 subtractions + logical or for 1 max + 1 subtraction

Change-Id: I7d1f25f7cda2a89bc8247f3d3d5417f6b0e3d96c
2015-06-11 22:46:24 -07:00
Pascal Massimino
a729cff987 sse2 in-loop: simplify SignedShift8b() a bit
Change-Id: Ida3e096bb41451194d03dc7a97753a222ff0135c
2015-06-11 15:26:31 -07:00
Pascal Massimino
422ec9fb62 simplify Load8x4() a bit
Change-Id: I68cf09c432f48e34bbe1d47dd091417cfd40cf4e
2015-06-10 12:35:50 -07:00
James Zern
8df238ec8a Merge "remove some duplicate FlipSign()" 2015-06-06 05:25:04 +00:00
Pascal Massimino
751506c484 remove some duplicate FlipSign()
ApplyFilter2NoFlip is the new variant of ApplyFilter2 without the sign-flip

Change-Id: I2af54bd1499118c8321183e42251d265ba76219c
2015-06-05 17:20:29 +02:00
James Zern
65ef5afc27 Merge "lossless: 0.13% compression density gain" 2015-06-03 03:02:09 +00:00
Jyrki Alakuijala
2beef2f245 lossless: 0.13% compression density gain
over a 1000 image corpus

Single photograph benchmark:
Before:
Q=20: 2.560 MP/s
Q=40: 2.593 MP/s
Q=60: 1.795 MP/s
Q=80: 1.603 MP/s
Q=99: 1.122 MP/s

After:
Q=20: 3.334 MP/s
Q=40: 2.464 MP/s
Q=60: 2.009 MP/s
Q=80: 1.871 MP/s
Q=99: 1.163 MP/s

This CL allows for some further improvements that would not be possible
otherwise.

Change-Id: I61ba154beca2266cb96469281cf96e84a4412586
2015-06-02 17:27:36 -07:00
Pascal Massimino
3033f24c26 lossless: 0.06 % compression density improvement
Change-Id: Ib662e6aec53b40d6bc736d3ecfd6475bb005c790
2015-06-02 14:51:51 +02:00
James Zern
64960da9e1 dec_neon: add VE8uv / VE16
VE8uv/VE16: ~25%/~33% faster over 20M pixels

Change-Id: Ifac1114091527a05ed10edfcc43852edff012d14
2015-05-30 13:40:00 -07:00
James Zern
14dbd87bed dec_neon: add HE8uv / HE16
HE8uv/HE16: ~91%/~83% faster over 20M pixels

Change-Id: Ib0a776f7c193593ea0993e92cfa6e6be000fb810
2015-05-30 13:39:24 -07:00
skal
ac76801159 introduce FTransform2 to perform two transforms at a time.
FTransform goes from ~12.0% to 11.5% total CPU time.

Change-Id: Ibcb23155324f4fd8b235563f80668531c781f624
2015-05-18 21:06:15 -07:00
James Zern
aa6065aedd dec_neon: use vld1_dup(mem) rather than vdup(mem[0])
should result in slightly less general purpose register use

Change-Id: I6069f49541392e56c8db2c28c8d1fdf88c1a1726
2015-05-16 11:24:32 -07:00
Pascal Massimino
8b63ac78e0 Merge "dec_neon: add TM16" 2015-05-16 10:56:07 +00:00
Pascal Massimino
f51be09e1f Merge "dec_neon/TrueMotion: simply left border load" 2015-05-16 10:54:05 +00:00
James Zern
dc48196bd9 dec_neon: add TM16
over 20M pixels ~78% faster

Change-Id: I420d5d590f275f19e08f86df1d1caa6b82fffbde
2015-05-15 12:50:11 -07:00
James Zern
ea95b305ca dec_neon/TrueMotion: simply left border load
use vld1_dup_u8() rather than a separate ld+dup after the values were
zero extended; mildly faster at the function level

Change-Id: I1b3666a6aeb465722a1214dbc6d71c27689a7f89
2015-05-15 12:48:13 -07:00
Pascal Massimino
f262d6120e speed-up SetResidualSSE2
(was unnecessarily complicated)

Before:
VP8SetResidualCoeffs: checksum = 1127918   elapsed = 475 ms.

Change-Id: Ia54bef86c45f9f474622ff16e594bf1da4f67ebd
After:
VP8SetResidualCoeffs: checksum = 1127918   elapsed = 404 ms.
2015-05-14 21:24:24 -07:00
James Zern
bf46d0acff fix mips2 build target
tested with mips1 and mips2; this should cover 3/4 as well.
fixes an ftbfs reported on the debian issue tracker:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785000

Change-Id: I2458487c92bd638589fdfec5adb4f22102a5960c
2015-05-13 10:36:22 -07:00
James Zern
929a0fdccd enc_sse2/TTransform: simplify abs calculation
max(b, 0 - b) works as well as (b ^ sign) - b

Change-Id: Iad923236fd70db85ff58a64d3c8e25e4f42a525d
2015-05-08 19:50:29 -07:00
James Zern
17dbd05819 enc_sse2/CollectHistogram: simplify abs calculation
max(out, 0 - out) works as well as (out ^ sign) - out

Change-Id: Id820ab9b296512cb0d56c8026b986bf98e3d3909
2015-05-08 19:49:08 -07:00
James Zern
a6c1593645 dec_neon: add DC16 intra predictors
improvement over 20M pixels:
DC16: ~77%
DC16NoTop: ~78%
DC16NoLeft: ~83%
DC16NoTopLeft: ~83%

Change-Id: I4c4ee16a8fa0eb466eee45dfa6f6bbce5ce64b99
2015-05-08 00:12:48 -07:00
James Zern
f274a96ce9 dsp/enc_sse2: add luma4 intra predictors
VP8EncPredLuma4 improvement over ~20M pixels: ~39%

Change-Id: I9cd841250771276d2d1bef3991215a56e83f7f20
2015-05-05 23:51:19 -07:00
James Zern
040b11bdf6 dsp/enc_sse2: add chroma intra predictors
VP8EncPredChroma8 improvements over ~20M pixels
left/top: ~67%
left-only: ~52%
top-only: ~57%
none: ~61%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: Iee7e387fb2570b4eb5af5bfd123e9c2e9ea49c76
2015-05-05 23:51:14 -07:00
James Zern
aee021bbb1 dsp/enc_sse2: add luma16 intra predictors
VP8EncPredLuma16 improvements over ~20M pixels
left/top: ~75%
left-only: ~47%
top-only: ~59%
none: ~63%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: I7548be7214fa85c38fd11d30f5b8b271f437657d
2015-05-05 23:51:07 -07:00
James Zern
4c9af02326 dec_neon: add DC8uvNoTopLeft
~93% faster

Change-Id: Icf0fd5f85ac53c306a1b69d84275023e5b24a602
2015-05-01 20:03:57 -07:00
Pascal Massimino
9287761d95 Merge "GetResidualCostSSE2: simplify abs calculation" 2015-04-30 06:30:58 +00:00
James Zern
0e009366f8 dsp/cpu.c(x86): check maximum supported cpuid feature
structured extended feature flags require eax = 7; avoids incorrectly
detecting avx2 on some older processors that support avx.
for completeness also check for value=1 support used by the other
checks.

from [1]:
INPUT EAX = 0: Returns CPUID’s Highest Value for Basic Processor
Information and the Vendor Identification String

[1]
http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html

Change-Id: I60b20d661a978d551614dbf7acdc25db19cb6046
2015-04-29 23:22:53 -07:00
James Zern
b243a4bc30 GetResidualCostSSE2: simplify abs calculation
max(coeff, 0 - coeff) works as well as min/max/sub or
(coeff ^ sign) - coeff

Change-Id: I9b11715372e49cd83820677bf4beba4a1c04931c
2015-04-21 20:29:12 -07:00
James Zern
0768b252fa dsp/enc.c: cosmetics: move DST() def closer to use
Change-Id: Iccbcf046412426c2893b71eced517f611d2ffc3f
2015-04-15 20:03:39 -07:00
James Zern
9904e365a8 dsp/dec_sse2: DC8uv / DC8uvNoLeft speedup
use psadbw to perform top row summation; left remains in C as repacking
it into a vector to apply the same operation is too costly.

DC8uv: ~19% faster
DC8uvNoLeft: ~12% faster

Change-Id: I707c4f6177a65b5d1f2d3deeca87d2bb740185e2
2015-04-08 23:12:53 -07:00
James Zern
7df2049785 dsp/dec_sse2: DC16 / DC16NoLeft speedup
use psadbw to perform top row summation; left remains in C as repacking
it into a vector to apply the same operation is too costly.

DC16: ~20% faster
DC16NoLeft: ~14% faster

Change-Id: I7ec3f8a6e5923f88a530f79fceb88d5001bef691
2015-04-08 23:10:39 -07:00
James Zern
b44eda3f60 dsp: add DSP_INIT_STUB
generates a stub function when the specific architecture is not enabled,
exposing a symbol in the module, avoiding a compiler warning

Change-Id: Ia9336e57466a9b5241b85c1c95838e91c9283147
2015-04-02 23:55:35 -07:00
James Zern
1a338fb306 enc_sse41: add Disto4x4 / Disto16x16
direct translation from sse2; minor gain, fewer instructions

Change-Id: I60288a842fac1a686b82b5cab637931789fe29f2
2015-03-25 23:28:46 -07:00
Pascal Massimino
94055503e3 encoding SSE4.1 stub for StoreHistogram + Quantize + SSE_16xN
Visible speed-up, thanks to pshufb and pabsw and psignw use.

had to tweak configure.ac to make "smmintri.h" presence correctly
detected (we need to set the CPPFLAGS instead of the CFLAGS!)

Change-Id: I2ab99e16a27a64fdf1f09b2b4e30a5e74ccca080
2015-03-25 20:23:51 -07:00
Pascal Massimino
c64659e1b4 remove duplicate variables after the lossless{_enc}.c split
clang was giving "duplicate symbols" error messages at link time.

Change-Id: I2b77b55222fe033cc1d4636567902e80d814aab6
2015-03-25 11:10:21 +01:00
James Zern
67ba7c7acc enc_sse2: call local FTransform in CollectHistogram
allows the former to be inlined; negligible speed-up in most cases,
however this is structure is consistent with the rest of the optimized
modules

Change-Id: Ib080240b06f7a995b47f1906627850c355b82901
2015-03-24 20:22:24 -07:00
James Zern
182497993b dsp: s/VP8LSetHistogramData/VP8SetHistogramData/
this function is for lossy encoding; the VP8L prefix is used by lossless

Change-Id: I147590a91477a77af51ed79cc640546dfe53abdb
2015-03-24 18:27:41 -07:00
James Zern
ede5e1584c cosmetics: dsp/lossless.h: reorder prototypes
group decoding / encoding functions together, followed by their
respective Init() function.

Change-Id: Ib4d22f8ec2369efec752faf733ecf53acc67b1ca
2015-03-24 17:52:42 -07:00
James Zern
553051f741 dsp/lossless: split enc/dec functions
adds lossless_enc*.c; reduces the size of the decode-only so: ~78K
w/gcc-4.8.2 on x86_64.

Change-Id: If5e4610b67d05eba5896bc64bab79e9df92b2092
2015-03-23 22:57:50 -07:00
James Zern
cecf509662 dsp/yuv*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: I42e621481be7305bb7c426b4d0b279619195611e
2015-03-20 19:19:46 -07:00
James Zern
6584d398eb dsp/upsampling*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: I3c753915eefe900987c9720733efb720ebe6bfa7
2015-03-20 19:19:46 -07:00
James Zern
808094228c dsp/rescaler*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: Ife9c7cd363b3692b64a7ade1960cfce3a76c3ba2
2015-03-20 19:19:46 -07:00
James Zern
1d93ddec19 dsp/lossless*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: If8b4459556e6bfaa36ef046f66520558b9444fc2
2015-03-20 19:19:46 -07:00
James Zern
73805ff270 dsp/filters*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: Idf08ffeb2aef1392a6d69596d897a59deebb64cf
2015-03-20 19:19:46 -07:00
James Zern
fbdcef2401 dsp/enc*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: I0cf40b500f9b3eed55a3211213db180c7c0dd43b
2015-03-20 19:19:46 -07:00
James Zern
66de69c1fe dsp/dec*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: I319bc7714f36b8a3d8b35f6474e5592a439aaf24
2015-03-20 19:19:37 -07:00
James Zern
48e4ffd15e dsp/cost*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: Ie9bee5eaf9daebe0909ab1dda1cf1aa4ee1ef03e
2015-03-20 19:18:50 -07:00
James Zern
29fd6f90c0 dsp/argb*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: I46b89909a0279172d37dbda70f731c7b9f052dad
2015-03-20 19:18:50 -07:00
James Zern
80ff38130e dsp/alpha*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: I9e7f187daffe1a3b1bc92953dce980c38d1a6269
2015-03-20 19:18:41 -07:00
Pascal Massimino
e9570dd987 stub for SSE4.1 support.
Change-Id: I0c845a98d2871cc8907ff7b914bab7747a92c7ed
2015-03-20 00:26:35 -07:00
James Zern
cabf4bd2bc dsp: add sse4.1 detection
bit 19 in ecx
no targets or code

https://software.intel.com/en-us/articles/using-cpuid-to-detect-the-presence-of-sse-41-and-sse-42-instruction-sets

Change-Id: Ie61b004dd5b6a3639b30bd9d2a09e6d7359b8040
2015-03-18 19:16:47 -07:00
Sam Clegg
ac4f5784a0 Disable NEON code on Native Client
The NEON assember in libwebp has not yet been ported
to Native Client. This changes disables it.
Related issue:
https://code.google.com/p/nativeclient/issues/detail?id=3205

Change-Id: I200291db7aa79d40c1f10cff7622c9b8599e6886
2015-03-10 16:17:25 -07:00
Djordje Pesut
241bb5d9d9 MIPS: dspr2: added optimization for TrueMotion
affected functions:
      TM4 - TrueMotion4
      TM8uv - TrueMotion8
      TM16 - TrueMotion16

Change-Id: Iff4377c4b0ae94716789c03fe1cd5bfd91f79188
2015-02-26 10:22:19 +01:00
Djordje Pesut
b5e79422d5 MIPS: dspr2: Added optimization for some convert functions
affected functions:
      VP8LConvertBGRAToRGBA4444_C
      VP8LConvertBGRAToRGB565_C
      VP8LConvertBGRAToBGR_C

Change-Id: I81513d242d33ebb9fef397ee6a2ca75d17f66e97
2015-02-24 10:51:34 +01:00
Djordje Pesut
0f595db60c MIPS: dspr2: Added optimization for some convert functions
affected functions:
  VP8LConvertBGRAToRGB_C
  VP8LConvertBGRAToRGBA_C

Change-Id: I5f25795c385688f2432d0710296e589f3793cb2b
2015-02-23 17:44:06 +01:00
Djordje Pesut
8a218b4a96 MIPS: [mips32|dspr2]: GetResidualCost rebased
Change-Id: Ie15524c773f7a8c79e002097881a508187ca7cc6
2015-02-23 10:43:42 +01:00
James Zern
602a00f93f fix iOS arm64 build with Xcode 6.3
the standard vtbl functions are available there [1][2].
based on a patch from: aaroncrespo
fixes issue #243.

[1]
http://adcdownload.apple.com//Developer_Tools/Xcode_6.3_beta/Xcode_6.3_beta_Release_Notes.pdf
[2] Apple LLVM Compiler Version 6.1
- Xcode 6.3 updates the Apple LLVM compiler to version 6.1.0.
[...]
Support for the arm64 architecture has been significantly revised to
align with ARM's implementation, where the most visible impact is that a
few of the vector intrinsics have changed to match ARM's specifications.

Change-Id: I79a0016f44b9dbe36d0373f7f00a50ab3c2ca447
2015-02-19 12:16:58 -08:00
Pascal Massimino
2382050748 1-2% faster encoding by removing an indirection in GetResidualCost()
The MIPS code for cost is not updated yet, that's why i keep Residual::*cost
around for now. Should be removed in favor of *costs later.

Change-Id: Id1d09a8c37ea8c5b34ad5eb8811d6a3ec6c4d89f
2015-02-19 08:44:35 +01:00
Djordje Pesut
eddb7e70be MIPS: dspr2: added otpimization for DC8uv, DC8uvNoTop and DC8uvNoLeft
added macros for load/store

Change-Id: I151d4d49bf1fab87fc3a82cb8e8e0835fe10b690
2015-02-18 18:24:10 +01:00
Djordje Pesut
73ba29158f MIPS: dspr2: added optimization for functions RD4 and LD4
Change-Id: I71216c1300f4eb254de4ae940ea9dcdba50aa080
2015-02-18 15:11:34 +01:00
Pascal Massimino
c7129da5b6 Merge "4-5% faster encoding using SSE2 for GetResidualCost" 2015-02-18 04:46:53 -08:00
Djordje Pesut
94380d00d9 MIPS: dspr2: added optimizaton for functions VE4 and DC4
Change-Id: I118adc6d3872742d8b1f9dbac438cba6fc90b7a9
2015-02-18 11:25:08 +01:00
Pascal Massimino
2a407092ab 4-5% faster encoding using SSE2 for GetResidualCost
new file: cost_sse2.c

Change-Id: I4896c07f5ff2443ef743f4435fe2758d95a672ed
2015-02-18 09:41:02 +01:00
James Zern
17e1986214 Merge "MIPS: dspr2: added optimization for simple filtering functions" 2015-02-17 14:57:05 -08:00
pascal massimino
3ec404c47a Merge "dsp: normalize WEBP_TSAN_IGNORE_FUNCTION usage" 2015-02-14 01:57:08 -08:00
James Zern
b969f5dfac dsp: normalize WEBP_TSAN_IGNORE_FUNCTION usage
the attribute is only necessary in one location; remove it from the
prototypes.

Change-Id: I3820a3c34fbb029fd7ac69a1b0a9b76091bdbde2
2015-02-13 15:23:40 -08:00
Djordje Pesut
d7b8e71126 MIPS: dspr2: added optimization for simple filtering functions
affected functions: SimpleVFilter16, SimpleHFilter16,
                    SimpleVFilter16i and SimpleHFilter16i

noticed bug in FilterLoop26 (fix included in this patch)

Change-Id: I72d9c1e45cbac6393eba52bb549b04924d463e30
2015-02-13 11:18:43 +01:00
Djordje Pesut
42a8a6280c MIPS: dspr2: Added optimization for function VP8LTransformColorInverse_C
Change-Id: I8b60e22c9f6c0badab6267a33751dfc28750f457
2015-02-13 08:57:20 +01:00
James Zern
3030f11525 Merge "dsp/mips: add some missing TSan annotations" 2015-02-12 14:55:32 -08:00
pascal massimino
dfcf4593fe Merge "MIPS: dspr2: Added optimization for function VP8LAddGreenToBlueAndRed_C" 2015-02-12 14:48:17 -08:00
James Zern
55c75a25f0 dsp/mips: add some missing TSan annotations
Change-Id: I3c832aefdeac26c6c75c35b19b45c1a2f67493c5
2015-02-12 14:36:33 -08:00
Djordje Pesut
2cb879f0c6 MIPS: dspr2: Added optimization for function VP8LAddGreenToBlueAndRed_C
Change-Id: If897c6c2f1c4b8405789298e135d6a1e4bf13012
2015-02-12 09:06:49 +01:00
James Zern
e15560107c move some cost tables from enc/ to dsp/
removes circular dependency between dsp and enc.

since:
a987fae MIPS: dspr2: added optimization for function GetResidualCost

Change-Id: Ifeb8fc02de89e2ba982ed7ffacd925d649bfec3c
2015-02-11 16:10:06 -08:00
pascal massimino
39537d7cfe Merge "VP8LDspInitMIPSdspR2: add missing TSan annotation" 2015-02-10 00:02:41 -08:00
James Zern
43fd3543df VP8LDspInitMIPSdspR2: add missing TSan annotation
Change-Id: Ic0d84e95daf063976b40fb5ba1e94d3547e2afba
2015-02-09 23:55:30 -08:00
pascal massimino
c7233dfcdc Merge "VP8LDspInit: remove memcpy" 2015-02-09 23:48:44 -08:00
James Zern
35579a4902 VP8LDspInit: remove memcpy
without this change the TSan annotation is useless

Change-Id: Ief511379f3aad75889815d4fe8362aed5c1abac7
2015-02-09 23:41:24 -08:00
James Zern
97f6aff874 VP8YUVInit: add missing TSan annotation
Change-Id: I7f8868de425e1aac3721b3e328844725104d14db
2015-02-09 22:50:31 -08:00
James Zern
f9016d6662 dsp/enc::InitTables: add missing TSan annotation
Change-Id: I262b9071417a0ec502c7c0380f27da6413cc74e4
2015-02-09 22:40:45 -08:00
James Zern
e3d9771aa1 VP8EncDspCostInit*: add missing TSan annotations
Change-Id: I4cdb84bc8c9a8c6aa34b5773c8fb69e5810a9809
2015-02-09 22:39:14 -08:00
Djordje Pesut
309b790867 MIPS: mips32: Added optimization for function SetResidualCoeffs
Change-Id: If67c10285df71ba7dd1aff6c24c2145c280dd2bf
2015-02-09 13:17:49 +01:00
Pascal Massimino
a987faedfa MIPS: dspr2: added optimization for function GetResidualCost
set/get residual C functions moved to new file in src/dsp
mips32 version of GetResidualCost moved to new file

Change-Id: I7cebb7933a89820ff28c187249a9181f281081d2
2015-02-07 02:13:26 -08:00
James Zern
c24d8f144f cosmetics: upsampling_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: Ia6df6690c85f580b20f19ce85cc6ec7b52620aee
2015-02-05 23:51:57 -08:00
James Zern
1829c42c58 cosmetics: lossless_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: I2405b18c6bb829b76c3a9814057ccbe6e14220d9
2015-02-05 23:51:44 -08:00
James Zern
183168f332 cosmetics: enc_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: Ib85d63abbb9fc33096f893c2524d3ce8ae3ebd03
2015-02-05 23:51:29 -08:00
James Zern
860badcacc cosmetics: dec_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: I77decb55f1382bea4b646a11b77dfa40bf1ef94d
2015-02-05 23:51:16 -08:00
James Zern
0254db9793 cosmetics: argb_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: I87fa2de11dafcc77767aab64e13b8c5585ebf5cd
2015-02-05 23:51:07 -08:00
James Zern
1aadf856c9 cosmetics: alpha_processing_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: I00ba15af2f43d125ceb2620e82fd43d420fbb9d3
2015-02-05 23:50:39 -08:00
Pascal Massimino
19f0ba0eb9 Implement true-motion prediction in SSE2
(along with DC/HE/VE for chroma/luma16)

Overall effect is ~1% faster decoding.

Change-Id: I90917e050d61874cbc8da0e88f26b5dd6131c265
2015-02-04 17:02:22 +01:00
Pascal Massimino
774d4cb758 make VP8PredLuma16[] array non-const
Change-Id: I0ce7e4e847f9fffefb6544db9636068442a2d264
2015-02-04 17:00:22 +01:00
Djordje Pesut
6ce296da12 MIPS: dspr2: Added optimization for function CollectHistogram
Change-Id: Id6b87ea1c9d21fee9494ad6c53ffc84ef60d5974
2015-02-03 14:11:20 +01:00
James Zern
b7de794622 Merge "lossless_neon: enable subtract green for aarch64" 2015-02-02 16:01:40 -08:00
Pascal Massimino
77724f70e9 SSE2 version of GradientUnfilter
somewhat 1-2% faster decoder for lossy+alpha

Change-Id: Ib317e26e9fcb8d37af02668ffbfccc4664e659fe
2015-01-31 23:18:00 +01:00
James Zern
416e1cea9b lossless_neon: enable subtract green for aarch64
similar to:
1ba61b0 enable NEON intrinsics in aarch64 builds

vtbl1_u8 is available everywhere but Xcode-based iOS arm64 builds, use
vtbl1q_u8 there.

performance varies based on the input, 1-3% on encode was observed

Change-Id: Ifec35b37eb856acfcf69ed7f16fa078cd40b7034
2015-01-31 11:32:05 -08:00
Pascal Massimino
022d2f886c add SSE2 variants for alpha filtering functions
The 'inverse' variants are harder to parallelize, since
the result of filtering is used for prediction.
The 'direct' way is relatively easier.

The heavy bottleneck left for optimization is still GradientUnfilter()

Change-Id: I358008f492a887e8fff6600cb27857b18dee86e9
2015-01-29 08:46:22 +01:00
Pascal Massimino
7afdaf8496 Alpha coding: reorganize the filter/unfiltering code
Move the filtering code to their own dsp/ spot
New function: VP8FiltersInit()

Change-Id: I0b2041eab42346c59b972f2575b05509e6a8f7b1
2015-01-28 08:02:41 +01:00
Djordje Pesut
da0912126b MIPS: dspr2: Added optimization for function FTransformWHT
Change-Id: I918366cd1908304068c66da9965efb0aa63320cd
2015-01-19 10:15:13 +01:00
pascal massimino
daeb276a2b Merge "MIPS: dspr2: Added optimization for MultARGBRow function" 2015-01-17 08:56:01 -08:00
James Zern
0de5f33e31 dsp/cpu: (msvs) add include for __cpuidex
and only use it on x86 / x64 where it's available.
has the side-effect of quieting a msvs /analyze warning:
C6001: Using uninitialized memory 'cpu_info'.

Change-Id: Iae51be3b22b2ee949cfc473eeea9fd9fb6b3c2cb
2015-01-16 18:16:10 -08:00
Djordje Pesut
7d850f7b9a MIPS: dspr2: Added optimization for MultARGBRow function
Change-Id: Ide549ae0d80413bea8c19fe091d97bffe8b17985
2015-01-16 15:56:34 +01:00
Djordje Pesut
5487529368 MIPS: dspr2: added optimization for function QuantizeBlock
Change-Id: Id217116890b7408d23464216608ce67ae545688a
2015-01-16 12:51:13 +01:00
James Zern
4fbe9cf202 dsp/cpu: (msvs) avoid immintrin.h on _M_ARM
_xgetgv() isn't relevant there anyway

broken since:
279e661 Merge "dsp/cpu: add include for _xgetbv() w/MSVS"

Change-Id: Iaa7bc0c5be9c06bfffab39e194c64c09bf5b5a27
2015-01-15 23:04:08 -08:00
Pascal Massimino
3fd59039bd simplify/reorganize arguments for CollectColorBlueTransforms
and other various call sites too.

Change-Id: Icb8f828dfe25672662de18d0e48e7d3144b1f38d
2015-01-15 18:12:12 -08:00
Djordje Pesut
a7e7caa486 MIPS: dspr2: added optimization for function TransformColorRed
added new function CollectColorRedTransforms to C, which calls
TransformColorRed and it is realized via pointer to function

Change-Id: Ia68d73bfcf1ca2cb443dc2825910946221f87835
2015-01-15 09:32:09 +01:00
pascal massimino
2cb39180cc Merge "MIPS: dspr2: added optimization for function TransformColorBlue" 2015-01-15 00:06:01 -08:00
pascal massimino
279e66138d Merge "dsp/cpu: add include for _xgetbv() w/MSVS" 2015-01-15 00:05:35 -08:00
James Zern
b6c0428e8c dsp/cpu: add include for _xgetbv() w/MSVS
explicitly add immintrin.h instead of transitively picking it up via
windows.h presumably. makes the code easier to move around.

Change-Id: If70d5143ac94fc331da763ce034358858e460e06
2015-01-14 23:31:35 -08:00
Djordje Pesut
7b16197361 MIPS: dspr2: added optimization for function TransformColorBlue
added new function CollectColorBlueTransforms to C, which calls
TransformColorBlue and it is realized via pointer to function

Change-Id: Ia488b7a7a689223b5d33aae9724afab89b97fced
2015-01-13 10:39:38 +01:00
James Zern
d7c4b02a57 cpu: fix AVX2 detection for gcc/clang targets
ecx needs to be set to 0; the visual studio builds were already doing
this.

https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family

Change-Id: I95efb115b4d50bbdb6b14fca2aa63d0a24974e55
2015-01-12 17:58:57 -08:00
Pascal Massimino
d581ba40ba follow-up: clean up WebPRescalerXXX dsp function
by removing redundant RFIX macros and using a plain-C fallback.

Change-Id: I52436c672bf20780b6fe3bcf43fe73e1abac10ff
2015-01-12 15:26:55 -08:00
James Zern
f8740f0d6c dsp: s/USE_INTRINSICS/WEBP_USE_INTRINSICS/
for consistency with other defines shared across modules

Change-Id: I30cdb9f892e9ea48265883f560500ffb1d6799ee
2015-01-12 14:27:36 -08:00
Pascal Massimino
ab66becaae introduce a separate WebPRescalerDspInit to initialize pointers
so that we keep the details of WebPRescaler in utils/rescaler.c
when possible.

Change-Id: Ib6c1029a09b84cbc7a7d2f70dafa4d4d9132cecc
2015-01-12 13:58:30 -08:00
pascal massimino
cbcdd5ffaf Merge "move rescaler functions to rescaler* files in src/dsp/" 2015-01-10 05:41:45 -08:00
pascal massimino
bf586e8844 Merge changes I230b3532,Idf3057a7
* changes:
  enable NEON for Windows ARM builds
  Makefile.vc: add rudimentary Windows ARM support
2015-01-10 02:14:48 -08:00
James Zern
4f43d38ca8 enable NEON for Windows ARM builds
Change-Id: I230b353214ce44ab29ffd2df6ccd14345d6578e8
2015-01-09 19:11:55 -08:00
James Zern
e7c5954c10 dec_neon: remove returns from void functions
Change-Id: I3c66a5dfe3de2bb3653cbbf1b92b0328aba62881
2015-01-09 18:08:05 -08:00
Djordje Pesut
cbcbedd0de move rescaler functions to rescaler* files in src/dsp/
Change-Id: I906add1b1010a59ebfcc2dd81e15745433cc206b
2015-01-09 16:47:09 +01:00
Djordje Pesut
a28c4b363d MIPS: move WORK_AROUND_GCC define to appropriate place
Change-Id: I3055eca57dc4e9d39533a5b8170bbf7af9cd818f
2015-01-08 15:55:41 +01:00
Djordje Pesut
012d2c60fa MIPS: dspr2: added optimization for functions SSEAxB
list of optimized functions: SSE16x16, SSE8x8, SSE16x8, SSE4x4

Change-Id: Ie99e7cdd73b0d4ff855977315a5d0db9ffaa5f04
2015-01-08 13:49:17 +01:00
Djordje Pesut
9241ecf45d MIPS: dspr2: added optimization for function Average
Change-Id: I7ca316bc3f5fbdaf8dcaf9a2d2227a5134bf4f63
2015-01-08 11:46:15 +01:00
pascal massimino
c6d3292738 argb_sse2: cosmetics
clarify some variable names in PackARGB() + add some comments

Change-Id: I2bb91d6c52dcbcdebe0f92d5f2136c2d7d11af2a
2015-01-08 00:18:54 -08:00
James Zern
67f601cd46 make the 'last_cpuinfo_used' variable names unique
allows the sources to be #include'd in some hackish builds (don't do
that!)

Change-Id: I0c7a43acbebd0e2d5068845e6daa8ce47361cd91
2015-01-07 23:38:53 -08:00
Pascal Massimino
9592053859 Merge "multi-thread fix: lock each entry points with a static var" 2015-01-07 00:03:51 -08:00
Pascal Massimino
4c1b300ada Merge "SSE2 implementation of VP8PackARGB" 2015-01-06 23:53:50 -08:00
James Zern
04c20e75ea Merge "MIPS: dspr2: added optimization for function Intra4Preds" 2015-01-06 16:15:10 -08:00
Pascal Massimino
a437694a17 multi-thread fix: lock each entry points with a static var
we compare the current VP8GetCPUInfo pointer to the last used.
This is less code overall and each implementation is still
testable separately (by just changing VP8GetCPUInfo, but not
a separate threads!)

Change-Id: Ia13fa8ffc4561a884508f6ab71ed0d1b9f1ce59b
2015-01-05 07:48:49 -08:00
Pascal Massimino
ca7f60db5f SSE2 implementation of VP8PackARGB
Change-Id: I40c0e26a6a2701216e4ddebcf793aa535677f437
2015-01-05 05:17:51 -08:00
Pascal Massimino
72d573f693 simplify the PackARGB signature
Change-Id: I51570e362126b2681f93211a4f59a3fedb5fd4b5
2015-01-05 02:10:04 -08:00
James Zern
f8abb112f2 Merge changes I109ec4d9,I73fe7743
* changes:
  dec_neon: add DC8uvNoTop / DC8uvNoLeft
  dec_neon: add DC8uv
2014-12-23 09:11:22 -08:00
Djordje Pesut
ae2188a435 MIPS: dspr2: added optimization for function Intra4Preds
Change-Id: Ie2a23c356a8715817b020fbee2b40e878e2946de
2014-12-23 17:32:27 +01:00
James Zern
14108d7878 dec_neon: add DC8uvNoTop / DC8uvNoLeft
adds do_top/do_left flags to DC8uv; ~88% / ~92% faster respectively
no change in DC8uv speed.

Change-Id: I109ec4d9ad13c9db64516e98ed4693a21a3e9b54
2014-12-22 15:47:38 -05:00
James Zern
d8340da756 dec_neon: add DC8uv
~87% faster.

Change-Id: I73fe77437792f1361ce8ab0b411132c6ec0fa021
2014-12-22 14:36:45 -05:00
Djordje Pesut
7ce8788b06 MIPS: dspr2: added optimization for function MakeARGB32
inline function MakeARGB32 calls changed to call
via pointers to functions which make (a)rgb for
entire row

Change-Id: Ia4bd4be171a46c1e1821e408b073ff5791c587a9
2014-12-22 12:31:36 +01:00
Pascal Massimino
87c3d53180 method=0: Don't evaluate any predictor
and apply Paeth predictor (predictor#11) for the low effort (m=0) mode.

For 1000 image PNG corpus (m=0), this change yields speedup of 25% at lower quality
range and about 10% for higher quality range.

Change-Id: I0f036b8ffe45c241e63a067cbf01527b13d8de93
2014-12-17 18:41:08 +01:00
Pascal Massimino
31a9cf6417 Speedup WebP lossless compression for low effort (m=0) mode with following:
- Disable Cross-Color transform.
- Evaluate predictors #11 (paeth), #12 and #13 only.

Change-Id: I857264c85c61c3957d4fb45ae32d261d947c8bed
2014-12-17 11:52:11 +01:00
Djordje Pesut
9275d91c79 MIPS: dspr2: added optimization for function TrueMotion
Change-Id: Id006d9591c0c922e28f7f4c01e4006f0f07bdd56
2014-12-12 14:38:55 +01:00
James Zern
a3946b8956 enc_neon: fix building with non-Xcode clang (iOS)
check for __apple_build_version__ to distinguish the two; a version
check could work as Apple bumped Xcode's to 5.x/6.x, but it's unclear
how upstream will deal with their versioning as they go 3.6+, so avoid
it for now.

Change-Id: I67cda67c4f68e262a92d805a63cc1496374be063
2014-12-10 15:50:26 -08:00
Pascal Massimino
8ed9c00d5e Merge "simplify the Histogram struct, to only store max_value and last_nz" 2014-12-10 02:02:05 -08:00
Pascal Massimino
bad775715a simplify the Histogram struct, to only store max_value and last_nz
we don't need to store the whole distribution in order to compute the alpha

Later, we can incorporate the max_value / last_non_zero bookkeeping
in SSE2 directly.

Change-Id: I748ccea4ac17965d7afcab91845ef01be3aa3e15
2014-12-10 10:44:57 +01:00
Djordje Pesut
3cca0dc7f0 MIPS: dspr2: Added optimization for DCMode function
Change-Id: I8ea31907c1ea1259ec4db8cee1a479bd13a025a1
2014-12-09 13:58:39 +01:00
Djordje Pesut
37e395fd1c MIPS: fix functions to use generic BPS istead of hardcoded value
Change-Id: I2d68abef886eff7f8df230f155b758dccd7d04fd
2014-12-05 15:55:47 +01:00
Pascal Massimino
4a279a680e cosmetics: add some missing != NULL comparisons
Change-Id: I55f8da527e5e8ee4b49c7e7aa0d61ea4a6c80904
2014-12-04 14:54:11 +01:00
Pascal Massimino
66ad372500 factorize BPS definition in dsp.h and add VP8Copy16x8
Change-Id: Id73a1e968c96455808755df4d131d74e3e2e135d
2014-12-04 13:45:14 +01:00
Pascal Massimino
57606047ec encoder: switch BPS to 32 instead of 16
this is a first step to unifying encoding/decoding cache stride
and possibly sharing the prediction functions in dsp/

With this layout, there's a little (~7%) space lost with unused samples.
But no speed change was observed.

Change-Id: I016df8cad41bde5088df3579e6ad65d884ee711e
2014-12-04 09:17:18 +01:00
Djordje Pesut
1b66bbe998 MIPS: dspr2: added optimization for function TransformColor_C
Change-Id: Idbf5cecf6775340585b0fd7e6ddcb29c2fcbea36
2014-12-01 15:46:06 +01:00
James Zern
9de9074c92 dec_neon: add TM8uv
~68% faster

reuses TM4() adding support for the additional rows, the columns were
already being done.

Change-Id: I6eac17e58cd1c636082bf7281f70f884ec399a6b
2014-11-25 14:40:17 -08:00
James Zern
e18571393d dsp: initialize VP8PredChroma8 in VP8DspInit()
the table becomes non-const to allow for platform-specific optimizations

Change-Id: I32d2b51480020dc653ecfafd20b6b0f096af349f
2014-11-24 22:12:42 -08:00
Vikas Arora
e0c809ad23 Move Entropy methods to lossless.c
Move all the Entropy evaluation methods to lossless.c (from histogram.c).
There's slight difference in the way entropy is computed for evaluating
entropy in prediction methods and histogram (literal) for huffman trees.
Plan (later) to merge few (static) methods and reduce the code size.

This change has no impact on the compression speed/density.

Change-Id: Ife3d96a3c4a8d78a91723d9e0a8d1b78c0256a15
2014-11-20 13:48:05 -08:00
Djordje Pesut
2f0e2ba826 MIPS: dspr2: added optimization for function Select
Change-Id: I22470d8b9ab8c5e90c5330ff12c9852676da1a3d
2014-11-07 09:44:16 +01:00
Djordje Pesut
54f2c14cce MIPS: dspr2: added optimization for function FTransform
Change-Id: Ib5850edbc2a586ec9781f494b2337f024e22af78
2014-11-06 14:21:33 +01:00
Djordje Pesut
aa42f4231f MIPS: dspr2: Added optimization for function VP8LSubtractGreenFromBlueAndRed
Change-Id: I683c73cceee4a40ca810deba15e54fbf7dbe8918
2014-11-06 10:56:18 +01:00
Djordje Pesut
95ca44a718 MIPS: dspr2: added optimization for Disto4x4
enc/dec common macros moved to mips_macro.h

Change-Id: I38d491e772554ac663dd5eb4d15485c0343f23b1
2014-11-05 12:06:15 +01:00
Djordje Pesut
5798eee6be MIPS: dspr2: unfilters bugfix (Ie7b7387478a6b5c3f08691628ae00f059cf6d899)
Change-Id: I78d97960efbd1ec1af51a5426e38dc01bdb48140
2014-11-03 15:39:00 +01:00
James Zern
572022a350 filters_mips_dsp_r2.c: disable unfilters
the output does not match the C-code.

Change-Id: Ie7b7387478a6b5c3f08691628ae00f059cf6d899
2014-10-30 11:10:11 +01:00
Djordje Pesut
a28e21b141 MIPS: dspr2: Added optimization for function ClampedAddSubtractFull
Change-Id: Iee98eaf007158f44a299dd5ba8d972d0d4108380
2014-10-29 13:08:06 +01:00
Djordje Pesut
18d5a1efa8 MIPS: dspr2: added optimization for function ClampedAddSubtractHalf
Change-Id: Iec22e897a4f56e79c18ec00f8caa9cefac67f186
2014-10-29 11:08:37 +01:00
Djordje Pesut
829a8c19a0 MIPS: dspr2: added optimization for ITransform
Change-Id: I3534fca143535c53d18a3749b3a1b0c8a7563463
2014-10-28 14:28:14 +01:00
James Zern
22881c999e dec_neon: add RD4 intra predictor
based on the SSE2 version; a bit rough around the loads, but still ~38%
faster.

Change-Id: I22426d939a7354cbc9a85ca8c68235d6081b882f
2014-10-24 21:22:07 +02:00
James Zern
1304eb3418 Merge "dec_neon: DC4: use pair-wise adds for top row" 2014-10-23 08:08:34 -07:00
James Zern
0db9031c79 dsp/dec_{neon,sse2}: VE4: normalize variable names
use '0' rather than '_' when dealing with variables that result from a
shift

Change-Id: I29280c0dead645ce39dc4bb42c3e19929b302fd4
2014-10-23 16:04:13 +02:00
James Zern
b5bc15305b dec_neon: DC4: use pair-wise adds for top row
reduces load count, slightly faster

Change-Id: I880340ef8ef75ce4ce321c330f56f86b758bda08
2014-10-23 15:48:49 +02:00
James Zern
eba6ce06c3 dec_neon: add DC4 intra predictor
~70% faster

Change-Id: I2e06907b8d69be71a8c5581832c931923c24bab0
2014-10-23 14:21:08 +02:00
James Zern
79abfbd9df dec_neon: add TM4 intra predictor
~21% faster

Change-Id: Ia9ed4ca650f9d544821fa1faf3173611806a272a
2014-10-23 14:21:08 +02:00
James Zern
fe395f0e4d dec_neon: add LD4 intra predictor
based on SSE2 version, ~55% faster

Change-Id: I782282ffc31dcf238890b3ba0decccf1d793dad0
2014-10-23 14:20:47 +02:00
James Zern
32de385eca dec_neon: add VE4 intra predictor
based on SSE2 version, ~59% faster

Change-Id: Iaa2181eb51bd975de0e9fe5c7b66ed18188f0e3b
2014-10-23 11:46:08 +02:00
Pascal Massimino
b7a33d7e91 implement VE4/HE4/RD4/... in SSE2
(30% faster prediction functions, but overall speed-up is ~1% only)

Change-Id: I2c6e7074aa26a2359c9198a9015e5cbe143c2765
2014-10-22 18:25:36 +02:00
Pascal Massimino
97c76f1f30 make VP8PredLuma4[] non-const and initialize array in VP8DspInit()
also convert 'type *dst' to 'type* dst'

Change-Id: I41ab66ad15b548cc45d1cb8b10bbca4fe1528cae
2014-10-22 18:14:20 +02:00
James Zern
f85ec712b0 PrintReg: output to stderr
allows use of '-o -' while testing

Change-Id: Ibc02d7cede2df4eb8be0a28c0ca4bf5e91864191
2014-10-22 17:28:19 +02:00
James Zern
d1c359ef29 fix shared object build with -fvisibility=hidden
set WEBP_EXTERN to visibility=default
+ explicitly mark VP8GetCPUInfo as it's referenced within the examples

Change-Id: Ie3d2b15088e888f0b55203b205993eba75899d99
2014-10-17 11:50:52 +02:00
James Zern
a4c3a31b8f WEBP_TSAN_IGNORE_FUNCTION: fix gcc compat warning
move the attribute to the front of the function to quiet clang warning:
GCC does not allow no_sanitize_thread attribute in this position on a
function definition

Change-Id: Ie4cc6e35a07bd00eab67d9cd6801bd2be9cfe676
2014-10-16 18:06:43 +02:00
Pascal Massimino
80247291c6 mark some init function as being safe for thread_sanitizer.
introduces the macro WEBP_TSAN_IGNORE_FUNCTION

Change-Id: I3de2b6c1a2076fba4da7ae50322551e026b2082b
2014-10-16 16:34:07 +02:00
James Zern
0ce27e715e enc_mips32: workaround gcc-4.9 bug
avoids an ICE with NDK r10b + NDK_TOOLCHAIN_VERSION=4.9

In function 'SSE16x16':
enc_mips32.c (684) internal compiler error: Segmentation fault

Change-Id: I1a3d33c0a9534c97633ab93bcdf9bf59d3a7e473
2014-10-15 19:14:04 +02:00
pascal massimino
32f67e309f Merge "enc_neon: initialize vectors w/vdup_n_u32" 2014-10-09 12:23:18 -07:00
Pascal Massimino
fabc65da32 1-3% faster encoding optimizing SSE_NxN functions
got rid of the |a-b|^|b-a| method and went back
to just (a-b)^2 instead.

quality | size(bytes) after/before | time (ms) after/before

Change-Id: Ia3e0e6507b3f903deb1e182f78dad6df07380fd0
2014-10-09 07:20:00 -07:00
James Zern
7534d71640 enc_neon: initialize vectors w/vdup_n_u32
replaces {} initialization gnu-ism

Change-Id: I5a7b2d4246f0205e4bfb7f4b77d720c47d8674ec
2014-10-09 12:35:41 +02:00
Pascal Massimino
2d9b0a4472 add WebPDispatchAlphaToGreen() to dsp
SSE2 version is 2.1x faster

This is used to transfer the alpha plane to green channel before lossless compression.

Change-Id: I01d9df0051c183b1ff5d6eb69961d4f43e33141a
2014-10-06 23:15:44 +02:00