* inverse transform is actually slower with intrinsics + gcc-4.6,
so is left disabled for now.
With gcc-4.8, it's a bit faster than inlined assembly.
* Sum of Square error function provide a 2-3% speed up
There's enabled by default (since there's no inlined-asm equivalent)
Change-Id: I361b3f0497bc935da4cf5b35e330e379e71f498a
+ misc cosmetics
* seems 4% slower than inlined-asm with gcc-4.6
* is a tad faster (<1%) with gcc-4.8
(disabled for now)
Change-Id: Iea6cd00053a2e9c1b1ccfdad1378be26584f1095
The nice trick is to pack 8 u + 8 v samples into a single uint8x16x_t
register, and re-use the previous (luma) functions
Change-Id: Idf50ed2d6b7137ea080d603062bc9e0c66d79f38
This change gains back 1% in compression density for method=3 and 0.5% for
method=4, at the expense of 10% slower compression speed.
Change-Id: I491aa1c726def934161d4a4377e009737fbeff82
+ added some work-around gcc-4.6 to make it compile (except one function).
+ lots of revamping
All variants tested ok.
Speed-up is ~5-7%
Change-Id: I5ceda2ee5debfada090907fe3696889eb66269c3
vertical only currently, 2.5-3% faster
placed under USE_INTRINSICS as this change depends on the simple
loopfilter
improves the simple loopfilter slightly thanks to some reorganization
Change-Id: I6611441fa54228549b21ea74c013cb78d53c7155
When 4 pixels are left, they should be processed with SSE2.
Decoding is marginally faster (~0.4%).
Encoding speed: No observable difference.
Change-Id: I3cf21c07145a560ff795451e65e64faf148d5c3e
new file: lossless_neon.c
speedup is ~5%
gcc 4.6.3 seems to be doing some sub-optimal things here,
storing register on stack using 'vstmia' and such.
Looks similar to gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
I've tried adding -fno-split-wide-types and it does help
the generated assembly. But the overall speed gets worse with
this flag. We should only compile lossless_neon.c with it -> urk.
Change-Id: I2ccc0929f5ef9dfb0105960e65c0b79b5f18d3b0
It's disable for now, because it crashes gcc-4.6.3 during compilation
with -O2 or -O3. It's been tested OK with -O1.
Code is still globally disabled with USE_INTRINSICS, though.
Change-Id: I3ca6cf83f3b9545ad8909556f700758b3cefa61c
disabled for now (but tested OK), thanks to the USE_INTRINSICS #define
We'll activate the code when we're on par with non-intrinsics
Change-Id: Idbfb9cb01f4c7c9f5131b270f8c11b70d0d485ff
Tune HistogramCombineBin for hard images that are larger than 1-2 Mega
pixel and represent photographic images.
This speeds up lossless encoding on 1000 image corpus by 10-12% and compression
penalty of 0.1-0.2%.
Change-Id: Ifd03b75c503b9e886098e5fe6f86be0391ca8e81
there's still some malloc/free in the external example
This is an encoder API change because of the introduction
of WebPMemoryWriterClear() for symmetry reasons.
The MemoryWriter object should probably go in examples/ instead
of being in the main lib, though.
mux_types.h stil contain some inlined free()/malloc() that are
harder to remove (we need to put them in the libwebputils lib
and make sure link is ok). Left as a TODO for now.
Also: WebPDecodeRGB*() function are still returning a pointer
that needs to be free()'d. We should call WebPSafeFree() on
these, but it means exposing the whole mechanism. TODO(later).
Change-Id: Iad2c9060f7fa6040e3ba489c8b07f4caadfab77b
expose the predictor array as function pointers instead
of each individual sub-function
+ merged Average2() into ClampedAddSubtractHalf directly
+ unified the signature as "VP8LProcessBlueAndRedFunc"
no speed diff observed
Change-Id: Ic3c45dff11884a8330a9ad38c2c8e82491c6e044
With -bypass_filter switched on, the lossless-compressed
data is decoded ahead of time (before being transformed and
display). Hence, the last row was called twice.
http://code.google.com/p/webp/issues/detail?id=193
Change-Id: I9e13f495f6bd6f75fa84c4a21911f14c402d4b10
(and ~2-3% on ARM)
We don't need to store cost/score for each node, but only for
the current and previous one -> simplify code and save some memory.
Also made the 'Node' structure tighter.
Change-Id: Ie3ad7d3b678992b396242f56e2ac387fe43852e6
all the functions involved return double and later these locals are used
in double calculations. fixes a vs build warning
Change-Id: Idb547104ef00b48c71c124a774ef6f2ec5f30f14
Get back some of the compression gains by extending the search space for
GetBestGreenRedToBlue. Also removed the SkipRepeatedPixels call, as it was not
helping much in yielding better compression density.
Before:
1000 files, 63530337 pixels, 1 loops => 45.0s (45.0 ms/file/iterations)
Compression (output/input): 2.463/3.268 bpp, Encode rate (raw data): 1.347 MP/s
After:
1000 files, 63530337 pixels, 1 loops => 45.9s (45.9 ms/file/iterations)
Compression (output/input): 2.461/3.268 bpp, Encode rate (raw data): 1.321 MP/s
Change-Id: I044ba9d3f5bec088305e94a7c40c053ca237fd9d
Optimize and re-structured VP8LGetHistoImageSymbols method, by using the bin-hash
for merging the Histograms more efficiently, instead of the randomized
heuristic of existing method HistogramCombine.
This change speeds up the Lossless encoding by 40-50% (for method=4 and Q > 50)
with 0.8% penalty in compression density. For lower method, the speed up is 25-30%,
with 0.4% penalty in the compression density.
Change-Id: If61adadb1a041b95def6405aa1fe3b83c3cb25ce
Restructure PredictorInverseTransform & ColorSpaceInverseTransform to remove
one if condition inside the main/critial loop. Also separated TransformColor &
TransformColorInverse into separate functions and avoid one 'if condition'
inside this critical method.
This change speeds up lossless decoding for Lenna image about 5% and 1000 image
corpus by 3-4%.
Change-Id: I4bd390ffa4d3bcf70ca37ef2ff2e81bedbba197d
These are presets for lossless coding, similar to zlib.
The shortcut for lossless coding is now, e.g.:
cwebp -z 5 in.png -o out_lossless.webp
There are 10 possible values for -z parameter:
0 (fastest, lowest compression)
to 9 (slowest, best compression)
A reasonable tradeoff is -z 6, e.g.
-z 9 can be quite slow, so use with care.
This -z option is just a shortcut for some pre-defined
'-lossless -m xx -q yy' combinations.
Change-Id: I6ae716456456aea065469c916c2d5ca4d6c6cf04
(We didn't need the exact value of the max_error properly.
We can work with relative values instead of absolute)
Output is bitwise the same as before.
Change-Id: I67aeaaea5f81bfd9ca8e1158387a5083a2b6c649
Refactor code for HistogramCombine and optimize the code by calculating
the combined entropy and avoid un-necessary Histogram merges.
This speeds up lossless encoding by 1-2% and almost no impact on compression
density.
Change-Id: Iedfcf4c1f3e88077bc77fc7b8c780c4cd5d6362b
mostly by:
- storing a single rd-score instead of cost / distortion separately
- evaluating terminal cost only once
- getting some invariants out of the loops
- more consts behind fewer variables
Change-Id: I79451f3fd1143d6537200fb8b90d0ba252809f8c
incorporate non-last cost in per-level cost table
also: correct trellis-quant cost evaluation at nodes
(output a little bit different now). Method 6 is ~4% faster.
Change-Id: Ic48bd6d33f9193838216e7dc3a9f9c5508a1fbe8
Speedup lossless encoder by 20-25% by optimizing:
- GetBestColorTransformForTile: Use techniques like binary search and
local minima search to reduce the search space.
- VP8LFastSLog2Slow & VP8LFastLog2Slow: Adding the correction factor for
log(1 + x) and increase the threshold for calling the approximate
version of log_2 (compared to costly call to log()).
Change-Id: Ia2444c914521ac298492aafa458e617028fc2f9d
converts 2 s16 vectors to 2 u8 and store to uint8_t destination;
TransformAC3 can reuse this after a rework
Change-Id: Ia9370283ee3d9bfbc8c008fa883412100ff483d0
Separate the C version from the MIPS32 version and have run-time
initialization during RescalerInit()
Change-Id: I93cfa5691c073a099fe62eda1333ad2bb749915b
Increase the initial buffer size for VP8L Bit Writer from 4bpp to 8bpp.
The resize buffer is expensive (requires realloc and copy) and this additional
memory (0.5 * W * H) doesn't add much overhead on the lossless encoder.
Change-Id: Ic1fe55cd7bc3d1afadc799e4c2c8786ec848ee66
Optimize 'VP8LCalculateEstimateForCacheSize' for lower quality ranges (Q < 50).
The entropy is generally lower for higher cache_bits, so start searching from
higher cache_bits and settle for a local minima, instead of evaluating all
values.
This speeds up the lossless encoding at lower qualities by 10-15%.
Change-Id: I33c1e958515a2549f2e6f64b1aab3f128660dcec
* simplify the endian logic
* remove the need for memset()
* write 16 or 32 at a time (likely aligned)
Makes the code a bit faster on ARM (~1%)
Change-Id: I650bc5654e8d0b0454318b7a78206b301c5f6c2c
-> remove the 'color_transform' multiplier, use more constants, etc.
This function is particularly critical, mostly because of
GetBestColorTransformForTile().
Loop is a bit faster (maybe ~1%)
Change-Id: I90c96a3437cafb184773acef55c77e40c224388f