* inverse transform is actually slower with intrinsics + gcc-4.6,
so is left disabled for now.
With gcc-4.8, it's a bit faster than inlined assembly.
* Sum of Square error function provide a 2-3% speed up
There's enabled by default (since there's no inlined-asm equivalent)
Change-Id: I361b3f0497bc935da4cf5b35e330e379e71f498a
+ misc cosmetics
* seems 4% slower than inlined-asm with gcc-4.6
* is a tad faster (<1%) with gcc-4.8
(disabled for now)
Change-Id: Iea6cd00053a2e9c1b1ccfdad1378be26584f1095
The nice trick is to pack 8 u + 8 v samples into a single uint8x16x_t
register, and re-use the previous (luma) functions
Change-Id: Idf50ed2d6b7137ea080d603062bc9e0c66d79f38
This change gains back 1% in compression density for method=3 and 0.5% for
method=4, at the expense of 10% slower compression speed.
Change-Id: I491aa1c726def934161d4a4377e009737fbeff82
+ added some work-around gcc-4.6 to make it compile (except one function).
+ lots of revamping
All variants tested ok.
Speed-up is ~5-7%
Change-Id: I5ceda2ee5debfada090907fe3696889eb66269c3
vertical only currently, 2.5-3% faster
placed under USE_INTRINSICS as this change depends on the simple
loopfilter
improves the simple loopfilter slightly thanks to some reorganization
Change-Id: I6611441fa54228549b21ea74c013cb78d53c7155
When 4 pixels are left, they should be processed with SSE2.
Decoding is marginally faster (~0.4%).
Encoding speed: No observable difference.
Change-Id: I3cf21c07145a560ff795451e65e64faf148d5c3e
new file: lossless_neon.c
speedup is ~5%
gcc 4.6.3 seems to be doing some sub-optimal things here,
storing register on stack using 'vstmia' and such.
Looks similar to gcc.gnu.org/bugzilla/show_bug.cgi?id=51509
I've tried adding -fno-split-wide-types and it does help
the generated assembly. But the overall speed gets worse with
this flag. We should only compile lossless_neon.c with it -> urk.
Change-Id: I2ccc0929f5ef9dfb0105960e65c0b79b5f18d3b0
It's disable for now, because it crashes gcc-4.6.3 during compilation
with -O2 or -O3. It's been tested OK with -O1.
Code is still globally disabled with USE_INTRINSICS, though.
Change-Id: I3ca6cf83f3b9545ad8909556f700758b3cefa61c
disabled for now (but tested OK), thanks to the USE_INTRINSICS #define
We'll activate the code when we're on par with non-intrinsics
Change-Id: Idbfb9cb01f4c7c9f5131b270f8c11b70d0d485ff
Tune HistogramCombineBin for hard images that are larger than 1-2 Mega
pixel and represent photographic images.
This speeds up lossless encoding on 1000 image corpus by 10-12% and compression
penalty of 0.1-0.2%.
Change-Id: Ifd03b75c503b9e886098e5fe6f86be0391ca8e81
there's still some malloc/free in the external example
This is an encoder API change because of the introduction
of WebPMemoryWriterClear() for symmetry reasons.
The MemoryWriter object should probably go in examples/ instead
of being in the main lib, though.
mux_types.h stil contain some inlined free()/malloc() that are
harder to remove (we need to put them in the libwebputils lib
and make sure link is ok). Left as a TODO for now.
Also: WebPDecodeRGB*() function are still returning a pointer
that needs to be free()'d. We should call WebPSafeFree() on
these, but it means exposing the whole mechanism. TODO(later).
Change-Id: Iad2c9060f7fa6040e3ba489c8b07f4caadfab77b
expose the predictor array as function pointers instead
of each individual sub-function
+ merged Average2() into ClampedAddSubtractHalf directly
+ unified the signature as "VP8LProcessBlueAndRedFunc"
no speed diff observed
Change-Id: Ic3c45dff11884a8330a9ad38c2c8e82491c6e044
With -bypass_filter switched on, the lossless-compressed
data is decoded ahead of time (before being transformed and
display). Hence, the last row was called twice.
http://code.google.com/p/webp/issues/detail?id=193
Change-Id: I9e13f495f6bd6f75fa84c4a21911f14c402d4b10
(and ~2-3% on ARM)
We don't need to store cost/score for each node, but only for
the current and previous one -> simplify code and save some memory.
Also made the 'Node' structure tighter.
Change-Id: Ie3ad7d3b678992b396242f56e2ac387fe43852e6
all the functions involved return double and later these locals are used
in double calculations. fixes a vs build warning
Change-Id: Idb547104ef00b48c71c124a774ef6f2ec5f30f14
Get back some of the compression gains by extending the search space for
GetBestGreenRedToBlue. Also removed the SkipRepeatedPixels call, as it was not
helping much in yielding better compression density.
Before:
1000 files, 63530337 pixels, 1 loops => 45.0s (45.0 ms/file/iterations)
Compression (output/input): 2.463/3.268 bpp, Encode rate (raw data): 1.347 MP/s
After:
1000 files, 63530337 pixels, 1 loops => 45.9s (45.9 ms/file/iterations)
Compression (output/input): 2.461/3.268 bpp, Encode rate (raw data): 1.321 MP/s
Change-Id: I044ba9d3f5bec088305e94a7c40c053ca237fd9d