Increase the initial buffer size for VP8L Bit Writer from 4bpp to 8bpp.
The resize buffer is expensive (requires realloc and copy) and this additional
memory (0.5 * W * H) doesn't add much overhead on the lossless encoder.
Change-Id: Ic1fe55cd7bc3d1afadc799e4c2c8786ec848ee66
Optimize 'VP8LCalculateEstimateForCacheSize' for lower quality ranges (Q < 50).
The entropy is generally lower for higher cache_bits, so start searching from
higher cache_bits and settle for a local minima, instead of evaluating all
values.
This speeds up the lossless encoding at lower qualities by 10-15%.
Change-Id: I33c1e958515a2549f2e6f64b1aab3f128660dcec
* simplify the endian logic
* remove the need for memset()
* write 16 or 32 at a time (likely aligned)
Makes the code a bit faster on ARM (~1%)
Change-Id: I650bc5654e8d0b0454318b7a78206b301c5f6c2c
-> remove the 'color_transform' multiplier, use more constants, etc.
This function is particularly critical, mostly because of
GetBestColorTransformForTile().
Loop is a bit faster (maybe ~1%)
Change-Id: I90c96a3437cafb184773acef55c77e40c224388f
The WEBP_SWAP_16BIT_CSP flag needs to be honored while filling the Alpha (4 bits)
data in the destination buffer and while pre-multiplying the alpha to RGB colors.
Change-Id: I3b07307d60963db8d09c3b078888a839cefb35ba
(instead of per-macroblock)
speed unchanged.
simplified the context-saving for incremental decoding
Change-Id: I301be581bab581ff68de14c4ffe5bc0ec63f34be
VP8GetThreadMethod() may be called with a NULL headers param; correct an
assert.
broken since:
8a2fa09 Add a second multi-thread method
Change-Id: If7b6d1b8f4ec874d343a806cee5f5e6bb6438620
This makes the segmentation overall less prone to
local-optimum or boundary effect.
(and overall, encoding is a little faster)
Change-Id: I35688098b0f43c28b5cb81c4a92e1575bb0eddb9
The registers and instructions are quite different to 32bit
and the assembly code needs a rewrite.
more info: http://people.linaro.org/~rikuvoipio/aarch64-talk/
Change-Id: Id75dbc1b7bf47f43a426ba2831f25bb8fa252c4f
the -alpha_cleanup flag was ineffective since we switched cwebp
to using ARGB input always.
Original idea by David Eckel (dvdckl at gmail dot com)
Change-Id: I0917a8b91ce15a43199728ff4ee2a163be443bab
New API options: WebPDecoderOptions.flip and 'dwebp -flip ...'
it uses negative stride trick.
Also changed the decoder code to support user-supplied
buffers with negative stride, independently of the
WebPDecoderOptions.flip value.
Change-Id: I4dc0d06f0c87e51a3f3428be4fee2d6b5ad76053
this partially reverts
f626fe2 Detect canvas and image size mismatch in decoder.
the original change would cause calls to e.g., WebPGetInfo to fail until
a portion of the image chunk was available. With lossy+alpha this meant
waiting for the entire ALPH chunk to be received.
this change restores the original behavior -- reporting the values from
VP8X if available -- while retaining some of the added canvas/image size
checks if the image data is available
Change-Id: I6295b00a2e2d0d4d8847371756af347e4a80bc0e
add TransformDC special case, and make the switch function inlined.
Recovers a few of the CPU lost during the addition of TransformAC3
(only on ARM)
Change-Id: I21c1f0c6a9cb9d1dfc1e307b4f473a2791273bd6
the *quantized* level should be clipped to 2047, not the
original coeff.
(similar problem was fixed in the regular quantize function
quite some time ago)
Change-Id: I2fd2f8d94561ff0204e60535321ab41a565e8f85
WHT is somewhat a special case: no sharpen[] bias, etc.
Will be useful in a later CL when precision of input is changed.
Change-Id: I851b06deb94abdfc1ef00acafb8aa731801b4299
This is in preparation for a future change where input will
be 16bit instead of 12bit
No speed diff observed.
Note that the NEON implementation was using 32bit calc already.
Change-Id: If06935db5c56a77fc9cefcb2dec617483f5f62b4
* remove the sharpening for non luma-AC coeffs
* adjust the bias a little bit to compensate for this
Using the multiply-by-reciprocal doesn't always give the same result
as the exact divide, given the QFIX fixed-point precision we use.
-> removed few now-unneeded SSE2 instructions (and checked for
bit-exactness using -noasm)
Change-Id: Ib68057cbdd69c4e589af56a01a8e7085db762c24
Even at high quality setting, the U/V quantizer step is limited
to 4 which can lead to banding on gradient.
This option allows to selectively apply some randomness to
potentially flattened-out U/V blocks and attenuate the banding.
This option is off by default in 'dwebp', but set to -dither 50
by default in 'vwebp'.
Note: depending on the number of blocks selectively dithered,
we can have up to a 10% slow-down in decoding speed it seems.
Change-Id: Icc2446007f33ddacb60b3a80a9e63f2d5ad162de
RGBToU/V calls expects two extra precision bits, they were only
given one by SUM2H and SUM2H macros.
For rounding coherency, also changed SUM1 macro.
Change-Id: I05f96a46f5d4f17b830d0420eaf79b066cdf78d4
otherwise make sure that all frames are marked as a fragment. there's
still some work to do with validation if fragments are expected to cover
the entire canvas.
Change-Id: Id59e95ac01b9340ba8c6039b0c3b65484b91c42f
this avoids local-minima that look bad, even if the distortion
looks low (e.g. gradients, sky,...). Mostly visible in the q=50-80 range.
Output size is mostly unchanged.
Change-Id: I425b600ec45420db409911367cda375870bc2c63
Earlier we were only testing for bit_pos == LBITS. But this is not
sufficient,
as bit_pos can jump from < LBITS to > LBITS.
This was resulting in some bit-stream truncation errors not being
caught.
Note: Not a security bug though, as br->pos wasn't incremented in such
cases
and so we weren't reading beyond the buffer.
Change-Id: Idadcdcbc6a5713f8fac3470f907fa37a63074836
* raise U/V quantization bias to more neutral values
* also raise the non-zero AC bias for Y1/Y2 matrices
(we need all the precision we can for U/V leves, which are often empty)
This will increase quality in the higher range (q >= 90) mostly.
Files size is exacted to raise a little (5-7%). and SSIM accordingly of course.
Change-Id: I8a9ffdb6d8fb6dadb959e3fd392e66dc5aaed64e
kLevelsFromDelta[sharpness][delta] is an inverse look-up table
that tells the minimum filtering strength needed to trigger the
filtering of a step with amplitude 'delta'. We use this table
in various situations:
a) when computing the initial (/global) filtering
strength for each segment. We look at the quantization
step and deduce the proper filtering strength needed
to result this quantization noise (talking the -f option
into account).
b) during intra16 calculation, when a block ends up
very empty (only DC coeffs are non-zero, all ACs have
vanished). We'll rely on the in-loop filtering to
restore the smoothness (if the source was gradient-like
smooth. That's why we look at the distortion too before
triggering the filtering).
Step b) goes _in addition_ to a), potentially raising
the filtering strength if blockiness is likely.
Change-Id: Icaeca93ef21da195b079e6587a44d9edfc8e9efa
Earlier "f = f->next_" was executing for both inner and outer loop, thus
skipping validation of some frames.
Change-Id: Ice5cdb4ff5da78384aa0573addd3a5e5efa0b10c
-> helps debanding (sky, gradients, etc.)
This dithering can only be triggered when using -preset photo
or -pre 2 (as a preprocessing). Everything is unchanged otherwise.
Note that this change is likely to make the perceived PSNR/SSIM drop
since we're altering the input internally.
Change-Id: Id8d4326245d9b828141de162c94ba381b1fa5813
method 1 grouping: [parse + reconstruction] // [filtering + output]
method 2 grouping: [parse] // [reconstruction+filtering + output]
Depending on some heuristics (see VP8ThreadMethod()), we
can pick one of the other when -mt flag (or option.use_threads)
is selected.
Conservatively, we always use method #2 for now until the heuristic
is refined (so, timing should be the same the before this patch)
+ replace 'use_threads' by 'mt_method'
+ define MIN_WIDTH_FOR_THREADS constant
+ fix comment alignment
Change-Id: I11a756dea9070d6e21b1a9481d357a1e8aa0663e
Mostly visible for large images.
Reconstruction+filtering is now done in parallel to bitstream-parsing.
Change-Id: I4cc4483d803b255f4d97a2fcd9158b1c291dd900
Needs more memory but allows for future parallelization.
Noticeably faster on ARM, slightly faster on x86
also: remove dec->filter_row_ unnecessary field
Change-Id: I044a808839b4e000c838a477e3e8688820436d9a
happens surprisingly often at low quality, so we might
as well hard-code a simplified TransformWHT() directly.
Change-Id: Ib7a858ef74e8f334bd59d6512bf5bd3e455c5459
happens when decoding is partial (past Partition0), without error and
interrupted by calling WebPIDelete()
WebPIDelete() needs to call VP8ExitCritical() to free in-flight resources
Change-Id: Id4faef1b92f7edd8c17d642c58860e70dd570506
"src\enc\frame.c(88) : warning C4244: '=' : conversion from 'const double' to 'float', possible loss of data"
Change-Id: I143cb0bb6b69e1b8befe9b4f24b71adbc28095c2
The convergence algo is noticeably faster and more accurate.
Try it with: 'cwebp -size xxxxx -pass 8 ...' or 'cwebp -psnr 39 -pass 8 ...'
for instance
Allow full-looping with TokenBuffer case, and make the non-TokenBuffer
case match too.
In case Partition0 is likely to overflow, retry encoding with harder
limits on max_i4_header_bits_.
This CL should make -partition_limit option somewhat useless,
since the fix made automatically (albeit in a non-optimal way yet).
Change-Id: I46fde3564188b13b89d4cb69f847a5f24b8c735b
* fix VP8FixedCostsI4ÆÅ table
(the constant cost '211' was erronenously included)
* use the rd-score for '211' correctly (calling SetRDScore() for good)
* count partition0 bits separately during rd-opt
No meaningful difference in rd-curve.
Change-Id: I6c49a150cf28928d9a92c32fff097600d7145ca4
use of uint8_t type was causing error like:
src/dsp/upsampling.c:223:1: internal compiler error: in vect_determine_vectorization_factor, at tree-vect-loop.c:349
with gcc 4.6.3
Change-Id: Ieb6189a1375c47fc4ff992e6c09b34a7f1f605da
When -mt is used, the analysis pass will be split in two
and each halves performed in parallel. This gives a 5%-9% speed-up.
This was a good occasion to revamp the iterator and analysis-loop
code. As a result, the default (non-mt) behaviour is a tad (~1%) faster.
Change-Id: Id0828c2ebe2e968db8ca227da80af591d6a4055f
-pass 2 can be useful sometimes. More passes usually don't help more.
This change is a step toward being able to re-code the whole picture
with varying parameter (when token buffer is used).
Change-Id: Ia2538e2069a53c080e2ad248c18a1e04623a9304
* move yuv_in_/out_* scratch buffers to iterator
* add y_top_/uv_top_ shortcuts in iterator
That's ~3k of stack size instead of heap.
But it allows having several iterators work in parallel.
Change-Id: I6a437c0f2ef1e5d398c1d6a2fd4974fa0869f0c1
in_bits is const. Trying to apply bswap on it, one gets the error message:
error: read-only variable 'in_bits' used as 'asm' output
Change-Id: I0bef494b822c83d8ea87b1938b0e486d94de4742
The C-version gets ~7-8% slower in order to match the SSE2
output exactly. The old (now off-by-1) code is kept under
the WEBP_YUV_USE_TABLE flag for reference.
(note that calc rounding precision is slightly better ~= +0.02dB)
on ARM-neon, we somehow recover the ~4% speed that was lost by mimicking
the initial C-version (see https://gerrit.chromium.org/gerrit/#/c/41610)
Change-Id: Ia4363c5ed9b4c9edff5d932b002e57bb7814bf6f
If 'top' was meant to be NULL, then bottom and top can be
swapped. Logic is simpler.
+ fix compilation in non-FANCY_UPSAMPLING mode
Change-Id: I7c62bbb59454017f072c0945d1ff2d24d89286ff
Also created variant VP8LPrefixEncodeBits that returns the
code & extra_bits only.
There's no impact on compression density and compression speed.
Change-Id: I2cafdd3438ac9270cd72ad9d57b383cdddfdfa4c
WebPDemuxPartial() returns NULL for both of the following cases:
- There was a parsing error.
- It doesn't have enough data to start parsing.
Now, one can differentiate between these two cases by checking the value
of 'state' returned by WebPDemuxPartial().
Change-Id: Ia2377f0c516b3fcfae475c0662c4932d2eddcd0b
Earlier, all lossless images were assumed to contain alpha.
Now, we use the 'alpha_is_used' bit from the VP8L bitstream to determine
the
same.
Detecting an absence of alpha can sometimes lead to much more efficient
rendering, especially for animated images.
Related: refine mux code to read width/height/has_alpha information only
once
per frame/fragment. This avoid frequent calls to VP8(L)GetInfo().
Change-Id: I4e0eef4db7d94425396c7dff6ca5599d5bca8297
Speed up HashChainFindCopy by optimizing on number of calls to
FindMatchLength method.
This change speeds up the lossless & lossy (Alpha) encoding by 20%
without loss of compression density.
At method=3, lossy (Alpha) compression speed (and density) remains
unchanged, as at that settings, the costly Backward Refs method is not
called
Change-Id: Ia1797148e9e4ee2787011837fa248afbae2242cb
Disable costly 'BackwardReferencesTraceBackwards' for encoding Alpha plane.
Increase the threshold for triggering 'BackwardReferencesTraceBackwards' to
quality 25 and above. Also lower the Alpha quality (at method 3) to be
lesser than this threshold (25).
Change-Id: Ic29fb2e6943472c564223df9fe099b19ccda0f31
This speeds up WebP lossless decoding by 20%. In particular, the
photographic images get 35% speedup.
Change-Id: Idb94750342a140ec05df52c07e12be4bba335adc
speeds up those codes that are not part of the main lookup.
This gives a 10 % speedup for a photographic image.
Change-Id: Ief54b0ad77db790a01314402ad351b40ac9a7be4
+ some revamp and cleanup of the alpha-filter trial loop
+ EncodeAlphaInternal() now just takes a FilterTrial param
Change-Id: Ief84385083b1cba02678bbcd3dbf707245ee962f
Specialize and simplify the alpha-decoding case, which is used when:
- no color-cache is use
- all red/blue/alpha values are the same (and hence their Huffman tree has
only 1 symbol. We don't need to consume any bits for reading these).
+ revamped the loop to use size_t and offsets instead of pointers.
~2-3% faster on Unix (gcc) but up to 25% faster lossy+alpha decoding
on Mac (llvm) and ARM.
Change-Id: I43c9688d1e4811cab0ecf0108a5b8f45781083e6
* 0.3.0: (57 commits)
update ChangeLog
Regression fix for alpha channels using color cache:
wicdec: silence a format warning
muxedit: silence some uninitialized warnings
update ChangeLog
update NEWS
bump version to 0.3.1
Revert "add WebPBlendAlpha() function to blend colors against background"
Simplify forward-WHT + SSE2 version
probe input file and quick-check for WebP format.
configure: improve gl/glut library test
update copyright text
configure: remove use of AS_VAR_APPEND
fix EXIF parsing in PNG
add doc precision for WebPPictureCopy() and WebPPictureView()
remove datatype qualifier for vmnv
fix a memory leak in gif2webp
fix two minor memory leaks in webpmux
remove some cruft from swig/libwebp.jar
README: update swig notes
...
Conflicts:
NEWS
examples/gif2webp.c
src/dec/alpha.c
src/dec/idec.c
src/dec/vp8l.c
src/enc/alpha.c
src/enc/vp8l.c
Change-Id: Ib202fad7825a090c3b3a5169acd171369cface47
+ split AllocateInternalBuffers() into two 32b/8b variants instead of
trying to do everything in one function.
Change-Id: I35cac9fcd990a2194c95da4b2a4046ca3a514343
Considering the fact that insert to/lookup from the color cache is always 32
bit, use DecodeImageData() variant in that case.
Conflicts:
src/dec/vp8l.c
Change-Id: I6c665a6cfbd9bd10651c1e82fa54e687cbd54a2b
(cherry picked from commit a37eff47d6)
src/mux/muxedit.c:490: warning: 'x_offset' may be used uninitialized in this function
src/mux/muxedit.c:490: warning: 'y_offset' may be used uninitialized in this function
Change-Id: I4fd27f717e59a556354d0560b633d0edafe7a4d8
(cherry picked from commit 14cd5c6c40)
Considering the fact that insert to/lookup from the color cache is always 32
bit, use DecodeImageData() variant in that case.
Change-Id: I6c665a6cfbd9bd10651c1e82fa54e687cbd54a2b
src/mux/muxedit.c:490: warning: 'x_offset' may be used uninitialized in this function
src/mux/muxedit.c:490: warning: 'y_offset' may be used uninitialized in this function
Change-Id: I4fd27f717e59a556354d0560b633d0edafe7a4d8
src\dec\vp8l.c(816) : warning C4244: '=' : conversion from '__int64' to
'int', possible loss of data
src\dec\vp8l.c(817) : warning C4244: '=' : conversion from '__int64' to
'int', possible loss of data
Change-Id: I1d376d5dea909395bff8741aba16e8eed83a6e8f
no precision loss observed
speed is not really faster (0.5% at max), as forward-WHT isn't called often.
also: replaced a "int << 3" (undefined by C-spec) by a "int * 8"
( supersedes https://gerrit.chromium.org/gerrit/#/c/48739/ )
Change-Id: I2d980ec2f20f4ff6be5636105ff4f1c70ffde401
(cherry picked from commit 9c4ce971a8)
rather than symlink the webm/vpx terms, use the same header as libvpx to
reference in-tree files
based on the discussion in:
https://codereview.chromium.org/12771026/
Change-Id: Ia3067ecddefaa7ee01550136e00f7b3f086d4af4
(cherry picked from commit d640614d54)
output picture object is overwritten, not free'd or destroyed.
Change-Id: Ibb47ab444063e7ad90ff3d296260807ffe7ddbf9
(cherry picked from commit 23d28e216d)
The auto-infer logic of detecting the 'Alpha' use case
(via check '(palette[i] & 0x00ff00ffu) != 0' is failing
for this corner case image with all black pixels (rgb = 0)
and different Alpha values.
-> switch generic use-LUT detection
Change-Id: I982a8b28c8bcc43e3dc68ac358f978a4bcc14c36
(cherry picked from commit afa3450c11)
Added 1 pixel cache for palette colors for faster lookup.
This will speedup images that require ApplyPalette by 6.5% for lossless
compression.
Change-Id: Id0c5174d797ffabdb09905c2ba76e60601b686f8
(cherry picked from commit 742110ccce)
* "declaration of ‘index’ shadows a global declaration [-Wshadow]"
* "signed and unsigned type in conditional expression [-Wsign-compare]"
Change-Id: I891182d919b18b6c84048486e0385027bd93b57d
(cherry picked from commit 87a4fca25f)
Earlier such images were using roughly 9 * width * height bytes for
decoding. Now, they take 6 * width * height memory.
Change-Id: Ie4a681ca5074d96d64f30b2597fafdca648dd8f7
(cherry picked from commit 64c844863a)
Simply get rid of an intermediate buffer of size width x height, by
using the fact that stride == width in this case.
Change-Id: I92376a2561a3beb6e723e8bcf7340c7f348e02c2
(cherry picked from commit edccd19436)
doing so is not part of ISO C; removes some pedantic warnings
Change-Id: I739ad8c5cacc133e2546e9f45c0db9d92fb93d7e
(cherry picked from commit 96e948d7b0)
The output surface CAN be changed inbetween calls to
WebPIUpdate() or WebPIAppend(), but with precautions.
Change-Id: I899afbd95738a6a8e0e7000f8daef3e74c99ddd8
(cherry picked from commit ff885bfe1f)
This applies to images with optional chunks (e.g. images with ALPH
chunk,
ICCP chunk etc). Before this, the incremental decoding used to work like
non-incremental decoding for such files, that is, no rows were decoded
until
all data was available.
The change is in 2 parts:
- During optional chunk parsing, don't wait for the full VP8/VP8L chunk.
- Remap 'alpha_data' pointer whenever a new buffer is allocated/used in
WebPIAppend() and WebPIUpdate().
Change-Id: I6cfd6ca1f334b9c6610fcbf662cd85fa494f2a91
(cherry picked from commit ead4d47859)
Start VP8EncLoop/VP8EncTokenLoop only if VP8EncStartAlpha succeeded.
Change-Id: Id1faca3e6def88102329ae2b4974bd4d6d4c4a7a
(cherry picked from commit 67708d6701)
new option: -blend_alpha 0xrrggbb
also: don't force picture.use_argb value for lossless. Instead,
delay the YUVA<->ARGB conversion till WebPEncode() is called.
This make the blending more accurate when source is ARGB
and lossy compression is used (YUVA).
This has an effect on cropping/rescaling. E.g. for PNG, these
are now done in ARGB colorspace instead of YUV when lossy compression
is used.
Change-Id: I18571f1b1179881737a8dbd23ad0aa8cddae3c6b
(cherry picked from commit e7d9548c9b)
Tuned the cross_color transform parameter (step) for lower quality
levels. This change gives speedup of 20% at lower qualities (25) and 10% at
moderate quality level (50) with a loss of 0.25% in compression density.
Also removed TODO for cross_color transform. Observed good correlation of
this with the predict transform.
Change-Id: I8a1044e9f24e6a5f84295c030fd444d0eec7d154
rather than symlink the webm/vpx terms, use the same header as libvpx to
reference in-tree files
based on the discussion in:
https://codereview.chromium.org/12771026/
Change-Id: Ia3067ecddefaa7ee01550136e00f7b3f086d4af4
The auto-infer logic of detecting the 'Alpha' use case
(via check '(palette[i] & 0x00ff00ffu) != 0' is failing
for this corner case image with all black pixels (rgb = 0)
and different Alpha values.
-> switch generic use-LUT detection
Change-Id: I982a8b28c8bcc43e3dc68ac358f978a4bcc14c36
Added 1 pixel cache for palette colors for faster lookup.
This will speedup images that require ApplyPalette by 6.5% for lossless
compression.
Change-Id: Id0c5174d797ffabdb09905c2ba76e60601b686f8
'mem' was being offset once by DO_ALIGN() then shifted 'nz_size' which
would end up accounting for more than ALIGN_CST and exceed the allocation.
broken since:
9bf3129 align VP8Encoder::nz_ allocation
Change-Id: I04a4e0bbf80d909253ce057f8550ed98e0cf1054
* "declaration of ‘index’ shadows a global declaration [-Wshadow]"
* "signed and unsigned type in conditional expression [-Wsign-compare]"
Change-Id: I891182d919b18b6c84048486e0385027bd93b57d
Earlier such images were using roughly 9 * width * height bytes for
decoding. Now, they take 6 * width * height memory.
Change-Id: Ie4a681ca5074d96d64f30b2597fafdca648dd8f7