use 'u' rather than the unnecessary 'l' as a suffix. this prevents a
conversion warning with some toolchains
Change-Id: I21c33ce08819b3c839c75e03a8f7f3a6041d0695
Note that ALIGN_CST is still kept different in dec/frame.c for now,
because the values is 31 there, not 15. We might re-unite these two
later.
Change-Id: Ibbee607fac4eef02f175b56f0bb0ba359fda3b87
same functionality, but better code layout.
What changed:
* don't trash the palette_[] in EncodePalette(), so it can be re-used
* split generation of image from bit-stream coding
* move all the delta-palette code to delta_palettization.c, and only have 1 entry point there WebPSearchOptimalDeltaPalette()
* minimize the number of "#ifdef WEBP_EXPERIMENTAL_FEATURES" in vp8l.c
* clarify the TransformBuffer stuff. more clean-up to come here...
This should make experimenting with delta-palettization easier and more compartimentalized.
Change-Id: Iadaa90e6c5b9dabc7791aec2530e18c973a94610
some limitations: only for RGBA output,
and if reduction factor is not too small (dst_width > src_width / 128)
20-25% faster, ~4-6% global improvement total decoding.
Change-Id: I95366ddaa4a38e0a96bed754dfe790126f7bb84a
It's better to stay with a 32b fixed-point precision overall, otherwise
the C-version on ARM gets *slower*.
Actually, gcc ARM compiler optimizes some instructions pretty
well when WEBP_RESCALER_FIX is exactly 32, even in C.
Change-Id: I0eea97f7db5947470f5af355dee098eca81e178d
New palette compresses more than 20% better with minimum quality loss.
Tested on set of wikipedia images with command line:
cwebp -delta_palettization
Change-Id: I82ec7d513136599cd70386f607f634502eb9095d
Earlier, we stored a 1x1 frame for such frames. Now, we drop every such
frame and increase the duration of its previous frame instead.
Also, modify the anim_diff tool to handle animated images that are
equivalent, but have different number of frames.
Change-Id: I2688b1771e1f5f9f6a78e48ec81b01c3cd495403
The rounding and arithmetic is not the same as previously, to prevent overflow cases for large upscale factors.
We still rely on 32b x 32b -> 64b multiplies. Raised the fixed-point precision to 32b
so that we have some nice shifts from epi64 to epi32.
Changed rescaler_t type to 'uint32_t' in order to squeeze in all the precision required.
The MIPS code has been disabled because it's now out-of-sync. Will be fixed in
a subsequent CL when the dust settles.
~30-35% faster
Change-Id: I32e4ddc00933f1b1aa3463403086199fd5dad07b
* vertical expansion now uses bilinear interpolation
* heavily assumes that the alpha plane is decoded in full, not row-by-row
* split the RescalerExportRow and RescalerImportRow methods into Shrink
and Expand variants.
* MIPS implementation of ExportRowExpand is missing.
There's room for extra speed optim and code re-org, but let's keep that for later patches.
addresses https://code.google.com/p/webp/issues/detail?id=254
Change-Id: I8f12b855342bf07dd467fe85e4fde5fd814effdb
This is designed for the simple use-case where one wants to decode all
frames one-by-one in order.
Also, use this API in anim_util library, which is in turn used by
anim_diff tool.
Change-Id: Ie8b653c04e867d40fd23321b3dd41b87689656c7
This makes the chains more efficient and a larger variety of data is tested.
0.02 % compression gain at q 100, 0.05 % at default quality. 0.8 % speedup by
callgrind.
0.16 % compression gain for lossy alpha ?!
Change-Id: I888120133352799eb14f5f602c7f40ab404bd665
this allows scaling to a particular width/height while preserving the
source aspect ratio using WebPRescalerGetScaledDimensions().
Change-Id: I77b11528753290c1e9bb942ac761c215ccfb8701
using a *tmp_plane buffer to split a/r/g/b planes up appeared to
be the easiest route, compared to copy-pasting the whole code and
making it x_stride aware...
Change-Id: I0898ef1df62bd3e1713b77187b31b5eeef3832fe
allows the values to be used in preproc checks, fixing a
-Wunreachable-code warning in 64-bit builds where VP8L_WRITER_BITS != 16
Change-Id: Ie98dff4e8ef896436557c64d5da2c5d70228a730
Slightly faster on -m 0 -q 0, particularly for small images (50 x 75
image was 0.1 % faster on callgrind measurement).
Increases compression density by 0.005 % for the 1000 images, but small
images can improve even 0.5 % (about 4 bytes, depending on the
characteristics of the palette).
Change-Id: I94f568d396ac62a054a829abeeef3eb0af6b3f94
the x_add/x_sub increments were wrong for u/v in the upscaling case.
They shouldn't be left to the caller's discretion, but set up by
WebPRescalerInit to their exact necessary values.
-> Cleaned-up WebPRescalerInit() param list.
-> added safety asserts
-> removed the mips32/mips_r2 variant of "ImportRow" which were buggy prior
Change-Id: I347c75804d835811e7025de92a0758d7929dfc09
this moves the function outside the WEBP_USE_INTRINSICS check.
there's no alternative version and it's ~54% faster at the
function level and mildly faster overall
Change-Id: Ibc648e9ee35021d48901e05aa596aa01067796a2
a total impact of 1 % on encoding speed
This allows for performance neutral removal of the binary search
in cache bits selection. This will give a small improvement in
compression density.
Change-Id: If5d4d59460fa1924ce71af977320834a47c2054a
0.21 % compression density improvement for 1000 png corpus in
lossless mode
0.50 % compression density improvement for 1000 png corpus in
lossy mode
Change-Id: I14ee8c427ae5d3e116b0ee6695fcdea3321a319d
valgrind --tool=callgrind shows a 9 % speedup: 1021201984 ticks before vs.
927917709 after
-q 0 -m 0 -lossless ~/alpi/1.png
22.040 MP/s before
24.796 MP/s after
Change-Id: Iaab928167b3e20fb0d9401c6f8317a26c5a610b4
* changes:
lossless: combine the Huffman code with extra bits
lossless: Inlining add literal
lossless: simplify HashChainFindCopy heuristics
lossless: 0.5 % compression density improvement
lossless: Add zeroes into the predicted histograms.
lossless: encoding, don't compute unnecessary histo
lossless: Remove about 25 % of the speed degradation
Faster alpha coding for webp
lossless: rle mode not to accept lengths smaller than 4.
lossless: Less code for the entropy selection
lossless: 0.37 % compression density improvement
_BitScanReverse() takes an unsigned long*
http://msdn.microsoft.com/en-us/library/fbxyd7zd.aspx
fixes:
C4057: 'function': 'unsigned long *' differs in indirection to slightly
different base types from 'uint32_t *'
fixes issue #253
Change-Id: I0101ef7be18c7ed188b35e9b17e7f71290953786
do not do length 2 matches far away
speedup for non compressible data by inserting two literals at a time
when no matches are found
Change-Id: Ia8e033071f4186bb8148bb2bf13ca37586734aa3
Increases compression density by 0.03 % for lossy.
Speeds up at least one of the lossy alpha images by 20 %.
Palette entropy 'kludge' seems to save 1-2 % on alpha images.
Change-Id: I2116b8d81593ac8173bfba54a7c833997fca0804
share the computation between different modes
3-5 % speedup for lossless alpha
1 % for lossy alpha
no change in compression density
Change-Id: I5e31413b3efcd4319121587da8320ac4f14550b2
introduced in:
"lossless: 0.37 % compression density improvement"
Uses the statistics of red and blue histograms to decide if to run
cross color correction at all.
Improves compression density by 0.02 % or so.
Change-Id: I47429557e9cdbd9fa90c584696f241b17427d73f
No significant size degradation (+0.001 %) for 1000 image corpus
Fixes the 8 ms vs 2 ms degradation from:
"lossless: 0.37 % compression density improvement"
Change-Id: Id540169a305d9d5c6213a82b46c879761b3ca608
counting the entropy expectation for five different configurations:
palette
non-predicted
non-predicted with subtract green
predicted
predicted with subtract green
and choose the strategy with the smallest expected entropy
Change-Id: Iaaf209c0d565660a54a4f9b3959067afb9951960
this should be used in preference to free() for releasing memory
returned from WebPDecode*() / WebPEncode*(). this simplifies memory
management when working through language bindings
Change-Id: I15eb538a45390efc552fda8e5c251a3fbdc13c29
After several trials at re-organizing the main loop and accumulation scheme,
this is apparently the faster variant.
removed the SSE41 version, which is no longer faster now.
For some reason, the AVX variant seems to benefit most for the change.
Change-Id: Ib11ee18dbb69596cee1a3a289af8e2b4253de7b5
this moves the function outside the WEBP_USE_INTRINSICS check.
there's no alternative version and it's ~70% faster at the
function level and 1-2% faster overall
Change-Id: I59fb4918ec86b1ac3a47cbd5d05ce62f007461cb
Changed the code (again) to process 4 pixels at a time. Loop is more
involved, but overall it's faster.
Removed the SSE4.1 implementation which is now slower than SSE2.
Change-Id: I7734e371033ad8929ace7f7e1373ba930d9bb5f1
New implementations: SubtractGreenFromBlueAndRed and TransformColor
around 1-2% faster lossless encoding.
Change-Id: I1668e36fdc316ba55b3b798b91b4a3e36ce62861
DispatchAlpha* functions are hard to speed up, compared to SSE2.
ExtractAlpha sees a ~15% speed-up though.
Change-Id: I8715c2defecbc832f469eed7e6ffd012146b52de
over a 1000 image corpus
Single photograph benchmark:
Before:
Q=20: 2.560 MP/s
Q=40: 2.593 MP/s
Q=60: 1.795 MP/s
Q=80: 1.603 MP/s
Q=99: 1.122 MP/s
After:
Q=20: 3.334 MP/s
Q=40: 2.464 MP/s
Q=60: 2.009 MP/s
Q=80: 1.871 MP/s
Q=99: 1.163 MP/s
This CL allows for some further improvements that would not be possible
otherwise.
Change-Id: I61ba154beca2266cb96469281cf96e84a4412586
use vld1_dup_u8() rather than a separate ld+dup after the values were
zero extended; mildly faster at the function level
Change-Id: I1b3666a6aeb465722a1214dbc6d71c27689a7f89
VP8EncPredChroma8 improvements over ~20M pixels
left/top: ~67%
left-only: ~52%
top-only: ~57%
none: ~61%
based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary
Change-Id: Iee7e387fb2570b4eb5af5bfd123e9c2e9ea49c76
VP8EncPredLuma16 improvements over ~20M pixels
left/top: ~75%
left-only: ~47%
top-only: ~59%
none: ~63%
based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary
Change-Id: I7548be7214fa85c38fd11d30f5b8b271f437657d
structured extended feature flags require eax = 7; avoids incorrectly
detecting avx2 on some older processors that support avx.
for completeness also check for value=1 support used by the other
checks.
from [1]:
INPUT EAX = 0: Returns CPUID’s Highest Value for Basic Processor
Information and the Vendor Identification String
[1]
http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html
Change-Id: I60b20d661a978d551614dbf7acdc25db19cb6046
* make VP8LPrefetchBits() safe wrt past-EOS reads
* set 'BitReader::bits_" to a safe shifting value upon EOS
no visible performance difference on x86
Change-Id: I0a4177928cfa81d5dfc9054b36a686eaa1bf8c65
Speed-wise equivalent on x86 and ARM (maybe a tad faster, hard to tell).
Note that the two 32-bit multiples are not strictly equivalent
to the 64-bit one, since we're missing one carry propagation.
In practice, no observable difference was seen because of this
slightly different hashing result.
Change-Id: I8f2381175eae1cb20dabf149e6b27e1768fba6ab
update makefile.unix to provide 'make extras' building instructions.
note: input ordering depends on WEBP_SWAP_16BIT_CSP for rgb565 and rgb4444
Change-Id: I6f22d32189d9ba2619146a9714cedabfe28e2ad0
When converting from video sources, the duration of current frame
is often unavailable until the next frame. So, we internally convert
timestamps to durations.
Change-Id: I20ad86361c22e014be7eb91f00d5d40108281351
use psadbw to perform top row summation; left remains in C as repacking
it into a vector to apply the same operation is too costly.
DC8uv: ~19% faster
DC8uvNoLeft: ~12% faster
Change-Id: I707c4f6177a65b5d1f2d3deeca87d2bb740185e2
use psadbw to perform top row summation; left remains in C as repacking
it into a vector to apply the same operation is too costly.
DC16: ~20% faster
DC16NoLeft: ~14% faster
Change-Id: I7ec3f8a6e5923f88a530f79fceb88d5001bef691
generates a stub function when the specific architecture is not enabled,
exposing a symbol in the module, avoiding a compiler warning
Change-Id: Ia9336e57466a9b5241b85c1c95838e91c9283147
had to rename few structs.
-> we can now include both vp8i.h and vp8enci.h without naming
conflicts.
Change-Id: Ib41b498f1b57aab3d6b796361afc45210ec75174
Visible speed-up, thanks to pshufb and pabsw and psignw use.
had to tweak configure.ac to make "smmintri.h" presence correctly
detected (we need to set the CPPFLAGS instead of the CFLAGS!)
Change-Id: I2ab99e16a27a64fdf1f09b2b4e30a5e74ccca080
allows the former to be inlined; negligible speed-up in most cases,
however this is structure is consistent with the rest of the optimized
modules
Change-Id: Ib080240b06f7a995b47f1906627850c355b82901
we look at average global improvement and stop when things are
moving slow, or when we had a quite good first iteration already
(means: the picture is "not difficult")
Change-Id: I8ab7d100353039b5b32bb5fac3fe03c8440c78d5
Speedup method StoreImageToBitMask by replacing the code to find histogram
index and Huffman tree codes at every iteration to a more optimal code that
updates these only when the current pixel (to write) crosses the histogram
tile-row boundary.
This change speeds up the StoreImageToBitMask method by 5%.
Change-Id: If01a1ccd7820f9a3a3e5bc449d070defa51be14b
the standard vtbl functions are available there [1][2].
based on a patch from: aaroncrespo
fixes issue #243.
[1]
http://adcdownload.apple.com//Developer_Tools/Xcode_6.3_beta/Xcode_6.3_beta_Release_Notes.pdf
[2] Apple LLVM Compiler Version 6.1
- Xcode 6.3 updates the Apple LLVM compiler to version 6.1.0.
[...]
Support for the arm64 architecture has been significantly revised to
align with ARM's implementation, where the most visible impact is that a
few of the vector intrinsics have changed to match ARM's specifications.
Change-Id: I79a0016f44b9dbe36d0373f7f00a50ab3c2ca447
The MIPS code for cost is not updated yet, that's why i keep Residual::*cost
around for now. Should be removed in favor of *costs later.
Change-Id: Id1d09a8c37ea8c5b34ad5eb8811d6a3ec6c4d89f