The 32-bit buffers are actually rarely 64-bit aligned.
The new solution uses memcmp and is alignment agnostic.
It is also slightly faster.
Change-Id: I863003e9ee4ee8a3eed25b7b2478cb82a0ddbb20
Arrays were compared 32 bits at a time, it is now done 64 bits at a time.
Overall encoding speed-up is only of 0.2% on @skal's small PNG corpus.
It is of 3% on my initial 1.3 Mp desktop screenshot image.
Change-Id: I1acb32b437397a7bf3dcffbecbcd4b06d29c05e1
This makes the chains more efficient and a larger variety of data is tested.
0.02 % compression gain at q 100, 0.05 % at default quality. 0.8 % speedup by
callgrind.
0.16 % compression gain for lossy alpha ?!
Change-Id: I888120133352799eb14f5f602c7f40ab404bd665
a total impact of 1 % on encoding speed
This allows for performance neutral removal of the binary search
in cache bits selection. This will give a small improvement in
compression density.
Change-Id: If5d4d59460fa1924ce71af977320834a47c2054a
0.21 % compression density improvement for 1000 png corpus in
lossless mode
0.50 % compression density improvement for 1000 png corpus in
lossy mode
Change-Id: I14ee8c427ae5d3e116b0ee6695fcdea3321a319d
do not do length 2 matches far away
speedup for non compressible data by inserting two literals at a time
when no matches are found
Change-Id: Ia8e033071f4186bb8148bb2bf13ca37586734aa3
Speed-wise equivalent on x86 and ARM (maybe a tad faster, hard to tell).
Note that the two 32-bit multiples are not strictly equivalent
to the 64-bit one, since we're missing one carry propagation.
In practice, no observable difference was seen because of this
slightly different hashing result.
Change-Id: I8f2381175eae1cb20dabf149e6b27e1768fba6ab
Simplify and speedup backward references for low-effort settings by evaluating
LZ77 references only. This change speeds up compression by 10-25% at lower
(q <= 25) quality range with a slight drop (0.2%) in the compression density.
Change-Id: Ibd6f03b1a062d8ab9191786c2a425e9132e4779f
Disable costly TraceBackwards heuristic for computing the backward references
for low_effort (method=0) compression.
The TraceBackwards heuristic is already disabled for lower (q < 25) quality
range. Following is the compression data for 1000 image corpus for q >= 25.
This speeds up compression (q >= 25) by a factor of 2.5-3X with slight loss of
compression density (0.7% for lower quality range and 1.2% for higher qualities).
Change-Id: I256c9e2137c7de4083f423ea32ee12d3b0f46253
- Lower the threshold parameters for HashChainFindCopy.
For 1000 image PNG corpus (m=0), this change yields speedup of 15-20% at
lower quality range (0.25% drop in compression density) and about 10%
for higher quality range without any drop in the compression density.
Following is the compression stats (before/after) for method = 0:
Before After
bpp/MPs bpp/MPs
q=0 2.8615/18.000 2.8651/18.631
q=5 2.8615/18.216 2.8650/20.517
q=10 2.8572/18.070 2.8650/21.992
q=15 2.8519/18.371 2.8584/21.747
q=20 2.8454/18.975 2.8515/20.448
q=25 2.8230/8.531 2.8253/9.585
// Compression density remains same for q-range [30-100]
q=30 2.7310/7.706 2.7310/8.028
q=35 2.7253/6.855 2.7253/7.184
q=40 2.7231/6.364 2.7231/6.604
q=45 2.7216/5.844 2.7216/6.223
q=50 2.7196/5.210 2.7196/5.731
q=55 2.7208/4.766 2.7208/4.970
q=60 2.7195/4.495 2.7195/4.602
q=65 2.7185/4.024 2.7185/4.236
q=70 2.7174/3.699 2.7174/3.861
q=75 2.7164/3.449 2.7164/3.605
q=80 2.7161/3.222 2.7161/3.038
q=85 2.7153/2.919 2.7153/2.946
q=90 2.7145/2.766 2.7145/2.771
q=95 2.7124/2.548 2.7124/2.575
q=100 2.6873/2.253 2.6873/2.335
Change-Id: I0e17581fb71f6094032ad06c6203350bd502f9a1
Update BackwardRefsWithLocalCache to do in-place update of backward
references w.r.t local color cache index.
No impact on the compression density or compression speed.
Change-Id: Ie066251464c3928c044e037b43df3af28b48ca30
Use the refs_lz77 computed (with cache_bits=0) in the method 'CalculateBestCacheSize'
to regenerate the LZ77 references corresponding to the optimum cache_bits and avoid
calling costly 'BackwardReferencesLz77' one extra time.
This change leaves the compression density unchanged and speeds up compression
by 10-15%.
Change-Id: I5a92e11788d3c3f656aa7e1fba54fb5d96ee0027
- The optimal cache bits is evaluated inside the method 'VP8LGetBackwardReferences'.
- The input cache_bits to 'VP8LGetBackwardReferences' sets the maximum cache
bits to use (passing 0 implies disabling the local color cache).
- The local color cache is disabled for lowerf (<= 25) quality levels (as before).
- Enabled local color cache for palette images as well. This saves additional
0.017% bytes with a slight (2-3%) improvement in the compression speed.
- Removed 'use_2d_locality' parameter from method VP8LGetBackwardReferences, as
this option is not an option now (after we freeze the lossless bit-stream).
Change-Id: I33430401e465474fa1be899f330387cd2b466280
Updated BackwardReferencesRle method by utilizing the local color cache.
Also changed the name of method BackwardReferencesHashChain to
BackwardReferencesLz77 to reflect the LZ77 coding.
For the 1000 image corpus, this change saves 0.2% bytes
(at default settings) and is 2-5% faster to encode.
Change-Id: Ic3f288253b3bbb101a69945a80994c3fd0917f8b
Optimize backwardreferences (about 0.1% byte savings) with almost same
compression speed (3% faster on defaut compression settings).
1.) Simplified iteration logic for HashChainFindCopy.
- Remapped the iter_max constant.
2.) Simplified main for loop for BackwardReferencesHashChain
- Removed 'if' conditions for corner cases in the main loop.
- Refactored the method(AddSingleLiteral) for adding one pixel.
Change-Id: I1bc44832fd81f11e714868a13e606c8f83157e64
Speed up BackwardReferencesHashChainDistanceOnly method by:
1.) Remove for loop for shortmax code path.
2.) Execute the shortmax code path after regular call to
HashChainFindCopy, only if HashChainFindCopy() returns length > 2 (MIN_LENGTH).
3.) Also for shortmax, call method HashChainFindOffset (for length = 2),
instead of expensive method HashChainFindCopy().
4.) Handling first pixel (i==0) outside main loop and removing one if
condition (i > 0) per pixel.
5.) Handle the last pixel outside the main 'for' loop.
Overall compression speedup observed is around 5% (+/- noise).
Change-Id: Ifa30c4035f8d26e6e43e3c4881244d777961c22b
The method VP8LCalculateEstimateForCacheSize is not evaluating the all possible
range for cache_bits.
Also added a small penality for choosing the larger cache-size. This is done to
strike a balance between additional memory/CPU cost (with larger cache-size) and
byte savings from smaller WebP lossless files.
This change saves about 0.07% bytes and speeds up compression by 8% (default
settings). There's small speedup at Q=50 along with byte savings as well.
Compression at Quality=25 is not effected by this change.
Change-Id: Id8f87dee6b5bccb2baa6dbdee479ee9cda8f4f77
Instead of calling HashChainFindMethod, call a new (subset) method
HashChainFindOffset to get the offset/distance for a given length.
The encoding is tad faster at default compression
Before After
bpp/rate bpp/rate
442 Palette 0.2720/5.270 MP/s 0.2720/5.790 MP/s
558 non-palette 3.7607/0.797 MP/s 3.7607/0.816 MP/s
Change-Id: If4041a9c18f7e972f49fcbab8c3e2f013d8bf1cf
Updated VP8LGetBackwardReferences and HashChainFindCopy method with following:
- Remove the recursive CostModelBuild.
- Reuse the lz77 backward refs in CostModelBuild, instead of evaluating it
again (as it was done for recursion_level=0).
- Consolidated the Match-length logic inside FindMatchLength method.
- Removed the logic for altering best_length/val based on the 2D distance.
The additional 162 value (+= 9 * 9 + 9 * 9 - y * y - x * x) can't change the
best_val eval computation to choose a different curr_length, as best_val was
set to 'curr_length << 16'.
Following is the impact on the compression speed/density at default & max
quality, overall this speeds up compression by 5-15% (q=100 -> 75) with a tad
drop (0.02-0.03%) in compression density for the non-palette images.
Before After
bpp/Rate(MP/s) bpp/Rate(MP/s)
q=75 (def)
All 1000 2.4492/1.049 MP/s 2.4498/1.230 MP/s
Palette 0.2719/5.060 MP/s 0.2719/6.110 MP/s
non-Palette 3.7597/0.732 MP/s 3.7607/0.840 MP/s
q=100
All 1000 2.4134/0.125 MP/s 2.4142/0.131 MP/s
Palette 0.2692/2.585 MP/s 0.2692/2.885 MP/s
non-Palette 3.7040/0.079 MP/s 3.7053/0.083 MP/s
Change-Id: I27a5eff3356d876c3e949fd32262244b25678b7a
(typecast uint32 pointer to uint64).
The proposed change is little (0.05%) slower but avoids uint32 to uint64
pointer conversion.
Change-Id: I6b8828077ea1324fabd04bfa7e7439e324776250
Sometimes, the error-code was not set correctly.
We now return OUT_OF_MEMORY everytimes it's appropriate
(tested using MALLOC_FAIL_AT mechanism)
Took the opportunity to clean-up the code and dust the error
code returned (some were erroneously set to INVALID_CONFIGURATION)
Change-Id: I56f7331e2447557b3dd038e245daace4fc82214c
Non-photo source produce far less literal reference and their
buffer is usually much smaller than the picture size if its compresses
well. Hence, use a block-base allocation (and recycling) to avoid
pre-allocating a buffer with maximal size.
This can reduce memory consumption up to 50% for non-photographic
content. Encode speed is also a little better (1-2%)
Change-Id: Icbc229e1e5a08976348e600c8906beaa26954a11
the unique instance of VP8LHashChain (1MB size corresponding to hash_to_first_index_)
is now wholy part of VP8LEncoder, instead of maintaining the pointer to VP8LHashChain
in the encoder.
Change-Id: Ib6fe52019fdd211fbbc78dc0ba731a4af0728677
We use automatic int->uint64_t promotion where applicable.
(uint64_t should be kept only for overflow checking and memory alloc).
Change-Id: I1f41b0f73e2e6380e7d65cc15c1f730696862125
* merged the two HistogramAdd/AddEval() into a single call
(with detection of special case when b==out)
* added a SSE2 variant
* harmonize the histogram type to 'uint32_t' instead
of just 'int'. This has a lot of ripples on signatures.
* 1-2% faster
Change-Id: I10299ff300f36cdbca5a560df1ae4d4df149d306
Reduce calls to Malloc (WebPSafeMalloc/WebPSafeCalloc) for:
- Building HashChain data-structure used in creating the backward references.
- Creating Backward references for LZ77 or RLE coding.
- Creating Huffman tree for encoding the image.
For the above mentioned code-paths, allocate memory once and re-use it
subsequently.
Reduce the foorprint of VP8LHistogram struct by changing the Struct
field 'literal_' from an array of constant size to dynamically allocated
buffer based on the input parameter cache_bits.
Initialize BitWriter buffer corresponding to 16bpp (2*W*H).
There are some hard-files that are compressed at 12 bpp or more. The
realloc is costly and can be avoided for most of the WebP lossless
images by allocating some extra memory at the encoder initializaiton.
Change-Id: I1ea8cf60df727b8eb41547901f376c9a585e6095
there's still some malloc/free in the external example
This is an encoder API change because of the introduction
of WebPMemoryWriterClear() for symmetry reasons.
The MemoryWriter object should probably go in examples/ instead
of being in the main lib, though.
mux_types.h stil contain some inlined free()/malloc() that are
harder to remove (we need to put them in the libwebputils lib
and make sure link is ok). Left as a TODO for now.
Also: WebPDecodeRGB*() function are still returning a pointer
that needs to be free()'d. We should call WebPSafeFree() on
these, but it means exposing the whole mechanism. TODO(later).
Change-Id: Iad2c9060f7fa6040e3ba489c8b07f4caadfab77b
Refactor code for HistogramCombine and optimize the code by calculating
the combined entropy and avoid un-necessary Histogram merges.
This speeds up lossless encoding by 1-2% and almost no impact on compression
density.
Change-Id: Iedfcf4c1f3e88077bc77fc7b8c780c4cd5d6362b
Optimize 'VP8LCalculateEstimateForCacheSize' for lower quality ranges (Q < 50).
The entropy is generally lower for higher cache_bits, so start searching from
higher cache_bits and settle for a local minima, instead of evaluating all
values.
This speeds up the lossless encoding at lower qualities by 10-15%.
Change-Id: I33c1e958515a2549f2e6f64b1aab3f128660dcec
Also created variant VP8LPrefixEncodeBits that returns the
code & extra_bits only.
There's no impact on compression density and compression speed.
Change-Id: I2cafdd3438ac9270cd72ad9d57b383cdddfdfa4c
Speed up HashChainFindCopy by optimizing on number of calls to
FindMatchLength method.
This change speeds up the lossless & lossy (Alpha) encoding by 20%
without loss of compression density.
At method=3, lossy (Alpha) compression speed (and density) remains
unchanged, as at that settings, the costly Backward Refs method is not
called
Change-Id: Ia1797148e9e4ee2787011837fa248afbae2242cb