Commit Graph

2353 Commits

Author SHA1 Message Date
Vincent Rabaud
fa63a96603 Fix the random generator in HistogramCombineStochastic.
It was a bad implementation of a Lehmer random number generator
(the saturation was done wrong and mostly & was used instead of % .....).

That lead to "for" loop stuck with the same values given a specific seed,
hence wasted "for" loops (e.g. seed getting at 374988608 and modulo of 64
later leads to 0 even when updating the seed with the old formula).

As the "for" loops now always return a proper pair of histograms, their
number can greatly be reduced, hence a speedup.

Change-Id: I9f5b44d66cc96fd4824189d92276c3756c8ead5b
2017-02-19 10:49:16 +01:00
Pascal Massimino
4105d565d3 disable WEBP_USE_XXX optimisations when EMSCRIPTEN is defined
Currently, none are available. If WEBP_HAVE_SSE2 eventually works,
we'll have to refine this conditionals.

BUG=webp:261

Change-Id: Ibc63ee1c013f2a4169eeb85cc8b6317b6420c2ad
2017-02-08 15:44:20 +00:00
Vincent Rabaud
868aa6901f Perform greedy histogram merge in a unified way.
Previously, the stochastic method for histogram
combination could finish in a greedy way
if the number of iterations to perform so was smaller.

Except that another greedy combination was performed
afterwards ... hence wasted CPU in some cases.

Change-Id: Ic0f26873e6dc746679486b91cb35d73efee91931
2017-02-07 11:57:20 +01:00
Pascal Massimino
b494fdec45 optimize the ARGB->ARGB Import to use memcpy
(instead of the generic VP8PackARGB call)

Change-Id: I86edeb5934e7c062593f0248de7607cca5f1027c
2017-02-03 16:54:52 +01:00
Vincent Rabaud
5cfd4ebc5e LZ77 interval speedups. Faster, smaller, simpler.
The initial re-writing of this part of the code with intervals
had to be done with a complex logic (mostly intervals with a
lower and upper bound, not a constant value like now) to properly
deal with the inefficiencies of the then LZ77 algorithm.
The improvements made to LZ77 since, now allow for a simpler logic.

There were also small errors in the interval insertion logic
that lead to small inefficiencies (hence a slightly better
compression rate).

Change-Id: If079a0cafaae7be8e3f253485d9015a7177cf973
2017-02-02 11:51:30 +01:00
Pascal Massimino
be73378684 Merge "Add clang build fix for MSA" 2017-02-01 12:43:09 +00:00
Parag Salasakar
aa893914fc Add clang build fix for MSA
Change-Id: If139f4ecbdce756c69ba4ae032a70f81179683f8
2017-02-01 17:45:17 +05:30
Jehan
32ed856f60 Fix "all|no frames are keyframes" settings.
Documentation says: "if kmin == 0, then key-frame insertion is disabled;
and if kmax == 0, then all frames will be key-frames."
Reading this, you'd expect that if kmax == 0, then with any kmin <= 0
all frames will be key-frames. But actually the kmin <= 0 test is caught
first and you get the opposite (no keyframes but the first). You'd have
instead to set kmax == 0 and any value kmin > 0, which is absolutely
counter-intuitive (reversing order).
Moreover kmax == 1 has no valid kmin (kmin == 1 conflicts with the
`kmax > kmin` rule and kmin == 0 conflicts with `kmin >= kmax / 2 + 1`).
So it should be considered an exception too.

Instead I propose this new logic:
- kmax == 1 means that all frames are keyframes (you are explicitly
  requesting a keyframe every 1 frame at most, i.e. all frames).
- kmax == 0 means no keyframes (you ask for a keyframe every 0 frames,
  i.e. never).
This is more "logical" language-wise, and also does not involve any
conflicts about what if both kmax and kmin are 0, since now a single
property value is meaningful for the 2 exceptional cases.

Change-Id: Ia90fb963bc26904ff078d2e4ef9f74b22b13a0fd
(cherry picked from commit 2dc0bdcaee)
2017-01-26 22:31:16 -08:00
James Zern
1c3190b6ed Merge "Fix "all|no frames are keyframes" settings." 2017-01-27 00:02:04 +00:00
Pascal Massimino
f4dc56fd77 disable GradientUnfilter_NEON
Compile with XCode, it appears quite slower than the C-version,
especially for arm64.

Change-Id: Ic46dba184a36be454fef674129d2f909003788fc
(cherry picked from commit 4f3e3bbd44)
2017-01-25 20:30:15 -08:00
Pascal Massimino
4f3e3bbd44 disable GradientUnfilter_NEON
Compile with XCode, it appears quite slower than the C-version,
especially for arm64.

Change-Id: Ic46dba184a36be454fef674129d2f909003788fc
2017-01-25 16:33:26 -08:00
Jehan
2dc0bdcaee Fix "all|no frames are keyframes" settings.
Documentation says: "if kmin == 0, then key-frame insertion is disabled;
and if kmax == 0, then all frames will be key-frames."
Reading this, you'd expect that if kmax == 0, then with any kmin <= 0
all frames will be key-frames. But actually the kmin <= 0 test is caught
first and you get the opposite (no keyframes but the first). You'd have
instead to set kmax == 0 and any value kmin > 0, which is absolutely
counter-intuitive (reversing order).
Moreover kmax == 1 has no valid kmin (kmin == 1 conflicts with the
`kmax > kmin` rule and kmin == 0 conflicts with `kmin >= kmax / 2 + 1`).
So it should be considered an exception too.

Instead I propose this new logic:
- kmax == 1 means that all frames are keyframes (you are explicitly
  requesting a keyframe every 1 frame at most, i.e. all frames).
- kmax == 0 means no keyframes (you ask for a keyframe every 0 frames,
  i.e. never).
This is more "logical" language-wise, and also does not involve any
conflicts about what if both kmax and kmin are 0, since now a single
property value is meaningful for the 2 exceptional cases.

Change-Id: Ia90fb963bc26904ff078d2e4ef9f74b22b13a0fd
2017-01-25 13:12:52 -08:00
James Zern
36c42ea415 bump version to 0.6.0
libwebp{,decoder} - 0.6.0
libwebp libtool - 7.0.0
libwebpdecoder libtool - 3.0.0

mux - 0.4.0
libtool - 3.0.0

demux - 0.3.2
libtool - 2.2.0

Change-Id: Ie46dc70df1e283df0ccef6eb07c5694feb4d4a2b
2017-01-23 18:07:00 -08:00
James Zern
919f9e2fd6 Merge "add .rc files for windows dll versioning" 2017-01-20 19:29:50 +00:00
Pascal Massimino
4689ce1635 cwebp: add a -sharp_yuv option for 'sharp' RGB->YUV conversion
Change-Id: I6edd5b44d693da50f702fa8218f14872874d91ba
2017-01-20 16:54:54 +01:00
Pascal Massimino
79bf46f120 rename the pretentious SmartYUV into SharpYUV
Change-Id: Ifeeb9cb85896c5f3ba0cc1c2c821f8d00295f69e
2017-01-20 14:36:21 +01:00
Pascal Massimino
eb1dc89a5f silently expose use_delta_palette in the WebPConfig API
is just a placeholder for now, unless WEBP_USE_EXPERIMENTAL_FEATURES
is defined.

Change-Id: I087cb49781560bc1a7fbb01b136d36115c97ef72
2017-01-20 10:25:19 +01:00
James Zern
43d3f01a2f add .rc files for windows dll versioning
BUG=webp:323

Change-Id: Id415a32b63618d39af2e599cec0d40f64c35bbce
2017-01-20 00:35:15 -08:00
James Zern
668e1dd44f src/{dec,enc,utils}: give filenames a unique suffix
this avoids duplicates between these trees and dsp/, e.g., enc/tree.c,
dec/tree.c, making pulling the whole library source tree into one target
possible

BUG=webp:279

Change-Id: I060a614833c7c24ddd37bf641702ae6a5eef1775
2017-01-19 19:09:48 -08:00
Pascal Massimino
71c53f1aeb NEON: speed-up strong filtering
The sub-expression trick removes two constants and
two vmlal_s8 instructions.

Change-Id: I200022573b4880871b528b13a11a8f3d95def113
2017-01-19 20:46:48 +00:00
Pascal Massimino
a345068aba ARM: speed up bitreader by avoiding tables
(and using BitsLog2Floor() from utils.h instead)

9-10% speed-up, apparently

Change-Id: I9acae4a4dceb1ddcc99306f99b722079bb06f6f8
2017-01-17 23:52:37 -08:00
Pascal Massimino
1dc82a6bba Merge "introduce a generic GetCoeffs() function pointer" 2017-01-18 07:44:36 +00:00
Pascal Massimino
8074b89eb3 introduce a generic GetCoeffs() function pointer
We can switch at run-time between the standard GetCoeffs() critical
function, that uses a fast variant of VP8GetBit().
However, some platforms have slow instructions that make standard
VP8GetBit() slow. GetCoeffs() is the right level of branching to
switch to GetCoeffsAlt() that avoids these slow instructions in some
not-frequent cases.

Next patch will upgrade VP8GetBit() to use clz, after this one
is proved to be neutral speed-wise.

Change-Id: Ia6cef5de9de6131574d2202bbc0bea8559c9b693
2017-01-17 16:24:00 +01:00
Pascal Massimino
749a45a520 Merge "NEON: implement alpha-filters (horizontal/vertical/gradient)" 2017-01-17 15:13:08 +00:00
Pascal Massimino
74c053b57d Merge "NEON: fix overflow in SSE NxN calculation" 2017-01-17 15:10:54 +00:00
Pascal Massimino
0a3aeff75b Merge "dsp: WebPExtractGreen function for alpha decompression" 2017-01-17 15:08:20 +00:00
Pascal Massimino
1de931c669 NEON: implement alpha-filters (horizontal/vertical/gradient)
gradient-filter code is not much faster, but maybe improvable in the future.

Change-Id: Ia16070e409fe8703b02276166f19526917df6b35
2017-01-17 15:44:46 +01:00
Pascal Massimino
9b3aca404d NEON: fix overflow in SSE NxN calculation
vmlal_u8() is prone to overflow during the accumulation.
There was a mismatch happening at low q mostly. Because in this
case the distortion is important and the accumulated sum was
later than 16bit-unsigned.

Change-Id: I1a08a2f744bcdf0b26647e61b9ee92a0c2e28fe8
2017-01-17 11:47:36 +01:00
Pascal Massimino
1c07a3c639 dsp: WebPExtractGreen function for alpha decompression
+ NEON implementation

Change-Id: I67204f99d6e4c5974718bdf21dad30381978f72c
2017-01-17 09:33:25 +00:00
Pascal Massimino
9ed5e3e5dd use pointers for WebPRescaler's in WebPDecParams
This makes the structure more generic, without the hard-coded
internal structure.

This is a borderline incompatible ABI change, even if WebPIDecoder structure
is opaque.

Change-Id: I518765c3f76fc17a136cef045a5a8aa70ed70e85
2017-01-16 22:30:29 -08:00
James Zern
db013a8d5c Merge "ARM: don't use USE_GENERIC_TREE" 2017-01-13 22:15:04 +00:00
Pascal Massimino
fcd4784dcd use a 8b table for C-version for clz()
30% faster on x86, 5% faster on N5.

New generic function: WebPLog2FloorC()
This function is called as fallback for BitsLog2Floor() when there's
no clz() available.

Change-Id: Ica15c6092112e514c0e200fab89c434de48d4b19
2017-01-13 15:36:26 +01:00
Pascal Massimino
fbb5c473b4 ARM: don't use USE_GENERIC_TREE
It's 1-2% faster to use hard-coded tree on ARM

Change-Id: I54403a70f6c692e50148c33f36833588957c20ee
2017-01-13 10:05:21 +01:00
Pascal Massimino
8fda56126e Merge "add a kSlowSSSE3 feature for CPUInfo" 2017-01-13 07:01:48 +00:00
Pascal Massimino
86bbd24552 add a kSlowSSSE3 feature for CPUInfo
This is meant to be used for run-time detection of slow platforms
regarding instructions like pshufb and bsr.

Adapted from libvpx patch: https://chromium-review.googlesource.com/#/c/367731

Change-Id: I2c22fbb9aae699d87a041393ba1ad5f1f21ff640
2017-01-13 06:19:27 +00:00
Vincent Rabaud
7c2779e95a Get code to fully compile in C++.
Change-Id: I6d8490c8c9b955d90dcc89ee8a9cf29ca0f93b08
2017-01-12 18:03:55 +01:00
Vincent Rabaud
250c358662 Merge "When compiling as C++, avoid narrowing warnings." 2017-01-12 13:00:56 +00:00
Vincent Rabaud
c0648ac2ae When compiling as C++, avoid narrowing warnings.
The gcc compilation warning was: narrowing conversion from ‘int’ to ‘int8_t’

Change-Id: I4803dd60ad04060cdb5d61a1aa98b25215b9d4eb
2017-01-12 13:39:22 +01:00
Pascal Massimino
0d55f60c91 40% faster ApplyAlphaMultiply_SSE2
process four pixels at a time

Change-Id: I1dee7f70772be4915654fc6638ef4729a1a239d4
2017-01-12 02:33:09 -08:00
Pascal Massimino
49d0280df1 NEON: implement several alpha-processing functions
- ApplyAlphaMultiply
 - DispatchAlpha
 - DispatchAlphaToGreen
 - ExtractAlpha

Decoding to Argb / rgbA / ... is 10-15% faster (measured on N4)

new file: alpha_processing_neon.c

Change-Id: I40f1a809e9885d1031ff0bc886d8d001efa66bca
2017-01-11 17:39:29 +01:00
Pascal Massimino
48b1e85fbe SSE2: 15% faster alpha-processing functions
ApplyAlphaMultiply / MultARGBRow / MultRow

we use now: x/255 = (x * 0x8081) >> (16 + 7)
and x/255 + .5 = ((x + 128) * 0x0101) >> 16

Change-Id: I8931091316ffc8bbf65aa3402f2e7d2b800e1971
2017-01-11 15:35:16 +01:00
Pascal Massimino
e3b8abbc9b fix warning from static analysis.
"-1 cannot be represented in type 'unsigned int'"

Change-Id: I05abcb44af68f702ead5a7f24dc14aab31a2e4d9
2017-01-10 22:59:47 -08:00
Pascal Massimino
28fe054e73 SSE2: 30% faster ApplyAlphaMultiply()
and 15% faster MultARGBRow()

by switching to formulae:
    X / 255 = (X + 1 + (X >> 8)) >> 8 for any 16bit value X.
   (X / 255 + .5) = (XX + (XX >> 8)) >> 8, with XX = X + 128

Change-Id: Ia4a7408aee74d7f61b58f5dff304d05546c04e81
2017-01-10 23:34:22 +01:00
Vincent Rabaud
f44acd253b Merge "Properly compute the optimal color cache size." 2017-01-10 21:14:16 +00:00
Vincent Rabaud
527844fee0 Properly compute the optimal color cache size.
The previous optimization was performing dichotomy on a function that
is anything in practice, hence a bit of randomness.
Also, two magic constants were used, one for an extra constant cost,
one for an extra linear cost. Both values/models were empirical.

A brute force search for the best cache size is now performed.

To have less CPU impact, a speed optimization is also made by not
inserting a value again and again.
This makes sense but it's also the most common case of when LZ77 is
useful hence an overall improvement sometimes.

Change-Id: I57de5750ad2313b2feecbcd15cd6e4feeb98e5c8
2017-01-10 21:44:53 +01:00
Pascal Massimino
be0ef6395f fix a comment typo
Change-Id: I0fabd08cd8abd3cea7ddfd2e498507adb0d3c67e
2017-01-10 21:17:13 +01:00
Vincent Rabaud
8874b16275 Fix a non-deterministic color cache size computation.
In case of impossible allocation, some value was returned while
computation should be stopped.

Change-Id: I5f85e264575be825e4261ab6fa63840c157cf5c2
2017-01-10 18:53:19 +01:00
Vincent Rabaud
d712e20de0 Do not allow a color cache size bigger than the number of colors.
This is purely for speed optimization.

Change-Id: Ie4b4380df8a5afa90574012bacdb1ddad03f320e
2017-01-10 09:25:02 +01:00
Vincent Rabaud
ecff04f625 re-introduce some comments in Huffman Cost.
Change-Id: I2396bbc58628dd12a2d36068f7193e2a6eb4d166
2017-01-06 13:17:14 +01:00
Pascal Massimino
00b08c88c0 Merge "NEON: 5% faster conversion to RGB565 and RGBA4444" 2016-12-22 08:39:01 +00:00