Commit Graph

1350 Commits

Author SHA1 Message Date
Pascal Massimino
b3a616b356 make HistogramAdd() a pointer in dsp
* merged the two HistogramAdd/AddEval() into a single call
  (with detection of special case when b==out)
* added a SSE2 variant
* harmonize the histogram type to 'uint32_t' instead
  of just 'int'. This has a lot of ripples on signatures.
* 1-2% faster

Change-Id: I10299ff300f36cdbca5a560df1ae4d4df149d306
2014-04-28 10:09:34 -07:00
James Zern
c8bbb636ea dec_neon: relocate some inline-asm defines
move simple loop filter defines closer to their use and LOAD* to a
location common with the intrinsics

Change-Id: Iaec506d27bbc9a01be20936e30b68a4b0e690ee3
2014-04-28 00:41:42 -07:00
James Zern
4e393bb9f1 dec_neon: enable intrinsics-only functions
the complex loop filter has no inline equivalent; the simple loop filter
remains conditional on USE_INTRINSICS: it's left undefined for now.

Change-Id: I4f258e10458df53a7a1819707c8f46b450e9d9d2
2014-04-28 00:39:46 -07:00
James Zern
ba99a922ab dec_neon: use positive tests for USE_INTRINSICS
makes Simple* layout consistent with the rest of the file

Change-Id: Ib3108b0f2c694c634210e22027c253ea6236a9c6
2014-04-28 00:38:47 -07:00
James Zern
a7828e8bdb dec_neon: make WORK_AROUND_GCC conditional on version
Change-Id: Ic1b95f8749988de90df7c1ff6c537a21981329db
2014-04-28 00:01:19 -07:00
pascal massimino
3f3d717a6c Merge "enc_neon: enable intrinsics-only functions" 2014-04-27 02:05:53 -07:00
pascal massimino
de3cb6c820 Merge "move LOCAL_GCC_VERSION def to dsp.h" 2014-04-27 02:04:08 -07:00
pascal massimino
ca49e7ad97 Merge "enc_neon: move Transpose4x4 to dsp/neon.h" 2014-04-27 01:11:05 -07:00
Pascal Massimino
ad900abddd Merge "fix warning about size_t -> int conversion" 2014-04-27 01:07:03 -07:00
Pascal Massimino
4825b4360d fix warning about size_t -> int conversion
+ re-order and add some const

Change-Id: I3746520b75699e56e20835d10d1dd9cd9fd6d85d
2014-04-27 00:50:07 -07:00
James Zern
42b35e086b enc_neon: enable intrinsics-only functions
CollectHistogram / SSE* / QuantizeBlock have no inline equivalents,
enable them where possible and use USE_INTRINSICS to control borderline
cases: it's left undefined for now.

Change-Id: I62235bc4ddb8aa0769d1ce18a90e0d7da1e18155
2014-04-26 19:09:04 -07:00
James Zern
f937e01261 move LOCAL_GCC_VERSION def to dsp.h
+ add LOCAL_GCC_PREREQ and use it in lossless_neon.c

Change-Id: Ic9fd99540bc3e19c027d1598e4530dfdc9b9de00
2014-04-26 19:09:04 -07:00
James Zern
5e1a17ef4b enc_neon: move Transpose4x4 to dsp/neon.h
+ reuse it in TransformWHT()

Change-Id: Idfbd0f9b58d6253ac3d65ba55b58989c427ee989
2014-04-26 14:06:04 -07:00
James Zern
c7b92a5a29 dec_neon: (WORK_AROUND_GCC) delete unused Load4x8
using this in Load4x16 was slightly slower and didn't help mitigate any
of the remaining build issues with 4.6.x.

Change-Id: Idabfe1b528842a514d14a85f4cefeb90abe08e51
2014-04-26 12:36:14 -07:00
Vikas Arora
0b896101b4 Reduce memory footprint for encoding WebP lossless.
Reduce calls to Malloc (WebPSafeMalloc/WebPSafeCalloc) for:
- Building HashChain data-structure used in creating the backward references.
- Creating Backward references for LZ77 or RLE coding.
- Creating Huffman tree for encoding the image.
For the above mentioned code-paths, allocate memory once and re-use it
subsequently.

Reduce the foorprint of VP8LHistogram struct by changing the Struct
field 'literal_' from an array of constant size to dynamically allocated
buffer based on the input parameter cache_bits.

Initialize BitWriter buffer corresponding to 16bpp (2*W*H).
There are some hard-files that are compressed at 12 bpp or more. The
realloc is costly and can be avoided for most of the WebP lossless
images by allocating some extra memory at the encoder initializaiton.

Change-Id: I1ea8cf60df727b8eb41547901f376c9a585e6095
2014-04-26 01:14:33 -07:00
Djordje Pesut
1d62acf6af MIPS: MIPS32r1: Added optimization for HuffmanCost functions.
HuffmanCost and HuffmanCostCombined optimized and added
'const' to some variables from ExtraCost functions.

Change-Id: I28b2b357a06766bee78bdab294b5fc8c05ac120d
2014-04-24 11:14:57 +02:00
James Zern
c0220460e9 Merge "Bugfix: Incremental decode of lossy-alpha" 2014-04-22 16:33:12 -07:00
Urvang Joshi
8c7cd722f6 Bugfix: Incremental decode of lossy-alpha
When remapping buffer, br->eos_ was wrongly being set to true for
certain
images.

Also, refactored the end-of-stream detection as a function.

Reported in http://crbug.com/364830

Change-Id: I716ce082ef2b505fe24246b9c14912d8e97b5d84
2014-04-22 16:06:32 -07:00
Djordje Pesut
7955152d58 MIPS: fix error with number of registers.
Some versions of compiler in debug build can't find
a register in class 'GR_REGS' while reloading 'asm'

Number of used registers is decreased in this fix.

Change-Id: I7d7b8172b8f37f1de4db3d8534a346d7a72c5065
2014-04-22 12:06:45 +02:00
skal
b1dabe3767 Merge "Move the HuffmanCost() function to dsp lib" 2014-04-18 12:08:22 -07:00
skal
75b12006e3 Move the HuffmanCost() function to dsp lib
This is to help further optimizations.
(like in https://gerrit.chromium.org/gerrit/#/c/69787/)

There's a small slowdown (~0.5% at -z 9 quality) due to
function pointer usage. Note that, for speed, it's important
to return VP8LStreaks by value, and not pass a pointer.

Change-Id: Id4167366765fb7fc5dff89c1fd75dee456737000
2014-04-18 11:59:48 -07:00
Djordje Pesut
2772b8bd98 MIPS: fix assembler error revealed by clang's debug build
.set at -  Indicates that macro expansions may clobber
           the assembler temporary ($at or $28) register.
           Some macros may not be expanded without this
           and will generate an error message if noat
           is in effect.

"at" also added to the clobber list.

Change-Id: I67feebbd9f2944fc7f26c28496e49e1e2348529d
2014-04-18 18:10:52 +02:00
James Zern
6653b601ef enc_mips32: fix unused symbol warning in debug
move kC1 / kC2 under __OPTIMIZE__
missed in:
8dec120 enc_mips32: disable ITransform(One) in debug builds

Change-Id: Ic9a12e6d73090c8c06b0e7a4bc56dd9c76b8e596
2014-04-17 23:35:36 -07:00
James Zern
8dec120975 enc_mips32: disable ITransform(One) in debug builds
avoids:
src/dsp/enc_mips32.c: In function 'ITransformOne':
src/dsp/enc_mips32.c:123:3: can't find a register in class 'GR_REGS' while reloading 'asm'
src/dsp/enc_mips32.c:123:3: 'asm' operand has impossible constraints

Change-Id: Ic469667ee572f25e502c9873c913643cf7bbe89d
2014-04-17 20:10:31 -07:00
James Zern
98519dd5c1 enc_neon: convert Disto4x4 to intrinsics
Change-Id: I0f00d5af2de2301e8237c2a38a9612d3645abad6
2014-04-17 18:29:31 -07:00
Pascal Massimino
fe9317c9bf cosmetics:
* remove MIPS32 suffix from static function names
* fix a long line in enc_neon.c

Change-Id: Ia1294ae46f471b3eb1e9ba43c6aa1b29a7aeb447
2014-04-16 00:36:19 -07:00
James Zern
953b074677 enc_neon: cosmetics
fix/remove incorrect comments
+ whitespace

Change-Id: Id1b86beb23e5bf946e73c34ab7066b6ca177f33b
2014-04-15 23:57:03 -07:00
skal
a9fc697cb6 Merge "WIP: extract the float-calculation of HuffmanCost from loop" 2014-04-15 11:33:11 -07:00
skal
3f84b5219d Merge "replace some mult-long (vmull_u8) with mult-long-accumulate (vmlal_u8)" 2014-04-15 07:09:12 -07:00
Djordje Pesut
4ae0533f39 MIPS: MIPS32r1: Added optimizations for ExtraCost functions.
ExtraCost and ExtraCostCombined

Change-Id: I7eceb9ce2807296c6b43b974e4216879ddcd79f2
2014-04-15 15:37:06 +02:00
skal
b30a04cf11 WIP: extract the float-calculation of HuffmanCost from loop
new function: VP8FinalHuffmanCost()

Change-Id: I42102f8e5ef6d7a7af66490af77b7dc2048a9cb9
2014-04-15 14:52:52 +02:00
skal
a8fe8ce231 Merge "NEON intrinsics version of CollectHistogram" 2014-04-15 03:00:45 -07:00
skal
95203d2d1b NEON intrinsics version of CollectHistogram
apparently faster, but we might save some load/store to/from memory
once we settle for the intrinsics-based FTransform()

(also: fixed some #ifdef USE_INTRINSICS problems)

Change-Id: I426dea299cea0c64eb21c4d81a04a960e0c263c7
2014-04-14 16:47:20 +02:00
skal
7ca2e74bb4 replace some mult-long (vmull_u8) with mult-long-accumulate (vmlal_u8)
saves few instructions

Change-Id: If8f464bb2894a209bba94825a4db9267df126d47
2014-04-14 15:14:45 +02:00
skal
41c6efbdc5 fix lossless_neon.c
* some extra {xx , 0 } in initializers
* replaced by vget_lane_u32() where appropriate

Change-Id: Iabcd8ec34d7c853920491fb147a10d4472280a36
2014-04-14 14:27:11 +02:00
skal
8ff96a027a NEON intrinsics version of FTransform
as little bit slower than inlined asm it seems.
So disabled for now.

Change-Id: I8c942846f9bedaed57275675ea9dbbcb8dfd9ccd
2014-04-14 09:58:35 +02:00
Jovan Zelincevic
0214f4a908 Merge "MIPS: MIPS32r1: Added optimizations for FastLog2" 2014-04-10 08:54:12 -07:00
Jovan Zelincevic
baabf1ea3a MIPS: MIPS32r1: Added optimizations for FastLog2
Functions VP8LFastLog2Slow and VP8LFastSLog2Slow

also: replaced some "% y" by "& (y-1)" in the C-version
(since y is a power-of-two)

Change-Id: I875170384e3c333812ca42d6ce7278aecabd60f0
2014-04-10 08:32:51 -07:00
skal
3d49871dbe NEON functions for lossless coding
Verified OK, but right now they don't seem faster.
So they are disabled behind a USE_INTRINSICS flag (off for now)

Change-Id: I72a1c4fa3798f98c1e034f7ca781914c36d3392c
2014-04-10 15:32:08 +02:00
Slobodan Prijic
3fe0291530 MIPS: MIPS32r1: Added optimizations for SSE functions.
Change-Id: I1287fa65064192cc2edc5c4be2b1974be665b9b4
2014-04-09 11:02:13 +02:00
skal
c503b485b6 Merge "fix the gcc-4.6.0 bug by implementing alternative method" 2014-04-08 23:25:59 -07:00
skal
abe6f48709 fix the gcc-4.6.0 bug by implementing alternative method
previous functions are a bit faster with gcc-4.8, so we keep them
for now.

Change-Id: I4081e5af66fbf606295d8a83875c1b889729b4dc
2014-04-09 07:53:55 +02:00
James Zern
5598bdecd8 enc_mips32.c: fix file mode
Change-Id: I5a43320e2ea2eebc88c65398acb9ea59b63af1fd
2014-04-08 15:12:54 -07:00
Slobodan Prijic
2b1b4d5ae9 MIPS: MIPS32r1: Add optimization for GetResidualCost
+ reorganize the cost-evaluation code by moving some functions
to cost.h/cost.c and exposing VP8Residual

Change-Id: Id976299b5d4484e65da8bed31b3d2eb9cb4c1f7d
2014-04-08 15:28:49 +02:00
pascal massimino
f0a1f3cd51 Merge "MIPS: MIPS32r1: Added optimization for FTransform" 2014-04-08 04:17:27 -07:00
Djordje Pesut
7231f610aa MIPS: MIPS32r1: Added optimization for FTransform
Change-Id: I9384dac483e8f98bcfdd277a0a3d6ec7c7a7b297
2014-04-08 04:16:44 -07:00
skal
869eaf6c60 ~30% encoding speedup: use NEON for QuantizeBlock()
also revamped the signature to avoid having to pass the 'first' parameter

Change-Id: Ief9af1747dcfb5db0700b595d0073cebd57542a5
2014-04-08 03:08:22 -07:00
James Zern
f758af6b73 enc_neon: convert FTransformWHT to intrinsics
slightly faster than the inline asm
in practice not much faster than the C-code in a full NEON build, but
still better overall in an Android-like one that only enables NEON for
certain files.

Change-Id: I69534016186064fd92476d5eabc0f53462d53146
2014-04-08 00:20:19 -07:00
Djordje Pesut
7dad095bb4 MIPS: MIPS32r1: Added optimization for Disto4x4 (TTransform)
Change-Id: Ieb20c5c52b964247cfe46f45f9a7415725bf7c02
2014-04-07 15:04:23 +02:00
Jovan Zelincevic
2298d5f301 MIPS: MIPS32r1: Added optimization for QuantizeBlock
Change-Id: I6047ab107e4d474e35b5af1dac391d5b3d8c049b
2014-04-07 09:22:35 +02:00
Djordje Pesut
e88150c9b6 Merge "MIPS: MIPS32r1: Add optimization for ITransform" 2014-04-05 10:36:05 -07:00
James Zern
de693f2502 lossless_neon: disable VP8LConvert* functions
due to breakage with NDK/gcc-4.6 builds

Change-Id: Id96258e710ee33e08a023354b3227f27da986620
2014-04-04 20:38:29 -07:00
skal
4143332b22 NEON intrinsics for encoding
* inverse transform is actually slower with intrinsics + gcc-4.6,
  so is left disabled for now.
  With gcc-4.8, it's a bit faster than inlined assembly.

* Sum of Square error function provide a 2-3% speed up
  There's enabled by default (since there's no inlined-asm equivalent)

Change-Id: I361b3f0497bc935da4cf5b35e330e379e71f498a
2014-04-04 15:02:56 -07:00
Djordje Pesut
0ca2914b23 MIPS: MIPS32r1: Add optimization for ITransform
Change-Id: Ie4c8b9bc3a7826bd443cdebf05386786fafe8c56
2014-04-04 10:50:35 +02:00
James Zern
71bca5ecf3 dec_neon: use vst_lane instead of vget_lane
results in fewer instructions, small speed improvement

Change-Id: I98de632d09ff09f295368c0d744cb4397b585084
2014-04-03 14:56:26 -07:00
skal
bf06105293 Intrinsics NEON version of TransformOne
+ misc cosmetics

* seems 4% slower than inlined-asm with gcc-4.6
* is a tad faster (<1%) with gcc-4.8
(disabled for now)

Change-Id: Iea6cd00053a2e9c1b1ccfdad1378be26584f1095
2014-04-03 14:41:56 -07:00
pascal massimino
19c6f1ba74 Merge "dec_neon: use vld?_lane instead of vset?_lane" 2014-04-03 01:16:29 -07:00
James Zern
7a94c0cf75 upsampling_neon: drop NEON suffix from local functions
Change-Id: I6583ad74aacf78dcbeb5a0ff0218a39bc3460e5a
2014-04-02 23:24:39 -07:00
James Zern
d14669c83c upsampling_sse2: drop SSE2 suffix from local functions
Change-Id: I2349c1a8e5e15e1d204642096f84f3202721c297
2014-04-02 23:24:39 -07:00
James Zern
2ca42a4fb7 enc_sse2: drop SSE2 suffix from local functions
Change-Id: I5d61605a9d410761d50b689b046114f0ab3ba24e
2014-04-02 23:24:36 -07:00
James Zern
d038e6193b dec_sse2: drop SSE2 suffix from local functions
Change-Id: Ie171778b84038d5b04c5dc6972f6015caf555882
2014-04-02 23:10:39 -07:00
James Zern
fa52d7525f dec_neon: use vld?_lane instead of vset?_lane
results in fewer instructions, small speed improvement

Change-Id: I61ab48d09a5ce7c5158eac8244d28287457edc7a
2014-04-02 23:03:18 -07:00
Pascal Massimino
c520e77d94 cosmetic: fix long line
Change-Id: Id04b368aea5784a98c705f323b32d35b362742ea
2014-04-02 23:00:50 -07:00
James Zern
4b0f2dae6f Merge "add intrinsics NEON code for chroma strong-filtering" 2014-04-02 22:57:44 -07:00
skal
e351ec0759 add intrinsics NEON code for chroma strong-filtering
The nice trick is to pack 8 u + 8 v samples into a single uint8x16x_t
register, and re-use the previous (luma) functions

Change-Id: Idf50ed2d6b7137ea080d603062bc9e0c66d79f38
2014-04-03 06:58:21 +02:00
pascal massimino
aaf734b8b0 Merge "Add SSE2 version of forward cross-color transform" 2014-04-02 14:18:59 -07:00
Urvang Joshi
c90a902eff Add SSE2 version of forward cross-color transform
Encoding speed is roughly the same.

Change-Id: I6b976d0eb24e1847714e719762cb8403768da66c
2014-04-02 12:21:20 -07:00
Vikas Arora
bc374ff39e Use histogram_bits to initalize transform_bits.
This change gains back 1% in compression density for method=3 and 0.5% for
method=4, at the expense of 10% slower compression speed.

Change-Id: I491aa1c726def934161d4a4377e009737fbeff82
2014-04-02 11:46:40 -07:00
James Zern
2132992d47 Merge "Add strong filtering intrinsics (inner and outer edges)" 2014-04-02 00:10:01 -07:00
skal
5fbff3a646 Add strong filtering intrinsics (inner and outer edges)
+ added some work-around gcc-4.6 to make it compile (except one function).
+ lots of revamping

All variants tested ok.
Speed-up is ~5-7%

Change-Id: I5ceda2ee5debfada090907fe3696889eb66269c3
2014-04-02 08:28:55 +02:00
Urvang Joshi
d4813f0cb2 Add SSE2 function for Inverse Cross-color Transform
Lossless decoding is now ~3% faster.

Change-Id: Idafb5c73e5cfb272cc3661d841f79971f9da0743
2014-04-01 15:52:25 -07:00
James Zern
26029568b7 dec_neon: add strong loopfilter intrinsics
vertical only currently, 2.5-3% faster
placed under USE_INTRINSICS as this change depends on the simple
loopfilter
improves the simple loopfilter slightly thanks to some reorganization

Change-Id: I6611441fa54228549b21ea74c013cb78d53c7155
2014-04-01 01:13:50 -07:00
James Zern
cca7d7ef0f Merge "add intrinsics version of SimpleHFilter16NEON()" 2014-04-01 00:57:11 -07:00
James Zern
1a05dfa7f5 windows: fix dll builds
WebPSafe* need to be marked external to allow mux/demux to access them
through libwebp.dll

Change-Id: Ib6620e00d376f7aa5a0550e1e244f759977f97a0
2014-03-31 17:46:12 -07:00
skal
d6c50d8ac2 Merge "add some colorspace conversion functions in NEON" 2014-03-31 13:15:18 -07:00
Urvang Joshi
4fd7c82e6a SSE2 variants of Subtract-Green: Rectify loop condition
When 4 pixels are left, they should be processed with SSE2.

Decoding is marginally faster (~0.4%).
Encoding speed: No observable difference.

Change-Id: I3cf21c07145a560ff795451e65e64faf148d5c3e
2014-03-31 10:51:45 -07:00
skal
97e5fac389 add some colorspace conversion functions in NEON
new file: lossless_neon.c
speedup is ~5%

gcc 4.6.3 seems to be doing some sub-optimal things here,
storing register on stack using 'vstmia' and such.
Looks similar to gcc.gnu.org/bugzilla/show_bug.cgi?id=51509

I've tried adding  -fno-split-wide-types and it does help
the generated assembly. But the overall speed gets worse with
this flag. We should only compile lossless_neon.c with it -> urk.

Change-Id: I2ccc0929f5ef9dfb0105960e65c0b79b5f18d3b0
2014-03-31 17:47:46 +02:00
skal
b9a7a45f1f add intrinsics version of SimpleHFilter16NEON()
It's disable for now, because it crashes gcc-4.6.3 during compilation
with -O2 or -O3. It's been tested OK with -O1.

Code is still globally disabled with USE_INTRINSICS, though.

Change-Id: I3ca6cf83f3b9545ad8909556f700758b3cefa61c
2014-03-31 16:31:31 +02:00
Pascal Massimino
daccbf400d add light filtering NEON intrinsics
disabled for now (but tested OK), thanks to the USE_INTRINSICS #define
We'll activate the code when we're on par with non-intrinsics

Change-Id: Idbfb9cb01f4c7c9f5131b270f8c11b70d0d485ff
2014-03-30 22:15:55 -07:00
Pascal Massimino
af44460880 fix typo in STORE_WHT
was working ok because dst == out

Change-Id: I27095129a11f468422250dd2b8fad8b3bd4e5bbd
2014-03-28 10:34:44 -07:00
Vikas Arora
6af6b8e1b6 Tune HistogramCombineBin for large images.
Tune HistogramCombineBin for hard images that are larger than 1-2 Mega
pixel and represent photographic images.

This speeds up lossless encoding on 1000 image corpus by 10-12% and compression
penalty of 0.1-0.2%.

Change-Id: Ifd03b75c503b9e886098e5fe6f86be0391ca8e81
2014-03-28 07:09:59 -07:00
skal
af93bdd6bc use WebPSafe[CM]alloc/WebPSafeFree instead of [cm]alloc/free
there's still some malloc/free in the external example
This is an encoder API change because of the introduction
of WebPMemoryWriterClear() for symmetry reasons.

The MemoryWriter object should probably go in examples/ instead
of being in the main lib, though.
mux_types.h stil contain some inlined free()/malloc() that are
harder to remove (we need to put them in the libwebputils lib
and make sure link is ok). Left as a TODO for now.

Also: WebPDecodeRGB*() function are still returning a pointer
that needs to be free()'d. We should call WebPSafeFree() on
these, but it means exposing the whole mechanism. TODO(later).

Change-Id: Iad2c9060f7fa6040e3ba489c8b07f4caadfab77b
2014-03-27 15:50:59 -07:00
James Zern
51f406a5d7 lossless_sse2: relocate VP8LDspInitSSE2 proto
this is in line with the other dsp files and silences a build warning.

Change-Id: I03ee3032c11d4c731cc10bfa0a2dcb6866756ba2
2014-03-27 15:07:43 -07:00
skal
0f4f721b12 separate SSE2 lossless functions into its own file
expose the predictor array as function pointers instead
of each individual sub-function

+ merged Average2() into ClampedAddSubtractHalf directly
+ unified the signature as "VP8LProcessBlueAndRedFunc"

no speed diff observed

Change-Id: Ic3c45dff11884a8330a9ad38c2c8e82491c6e044
2014-03-27 21:43:55 +01:00
skal
514fc251df VP8LConvertFromBGRA: use conversion function pointers
Change-Id: I863b97119d7487e4eef337e5df69e1ae2a911d4c
2014-03-27 09:00:35 +01:00
James Zern
6d2f35273d dsp/dec: TransformDCUV: use VP8TransformDC
rather than forcing the C version; this is similar to TransformUV

Change-Id: I2778194f05fca33e9b2b71323e92947c0b395e9a
2014-03-26 16:43:47 -07:00
skal
defc8e1b01 Merge "fix out-of-bound read during alpha-plane decoding" 2014-03-26 15:22:42 -07:00
James Zern
fbed36433d Merge "dsp: reuse wht transform from dec in encoder" 2014-03-26 15:13:07 -07:00
skal
d846708400 Merge "Add SSE2 version of ARGB -> BGR/RGB/... conversion functions" 2014-03-26 15:01:46 -07:00
skal
207d03b484 fix out-of-bound read during alpha-plane decoding
With -bypass_filter switched on, the lossless-compressed
data is decoded ahead of time (before being transformed and
display). Hence, the last row was called twice.

http://code.google.com/p/webp/issues/detail?id=193

Change-Id: I9e13f495f6bd6f75fa84c4a21911f14c402d4b10
2014-03-26 22:45:03 +01:00
skal
d1b33ad58b 2-5% faster trellis with clang/MacOS
(and ~2-3% on ARM)

We don't need to store cost/score for each node, but only for
the current and previous one -> simplify code and save some memory.

Also made the 'Node' structure tighter.

Change-Id: Ie3ad7d3b678992b396242f56e2ac387fe43852e6
2014-03-26 22:33:01 +01:00
skal
369c26dd3f Add SSE2 version of ARGB -> BGR/RGB/... conversion functions
~4-6% faster lossless decoding

Change-Id: I3ed1131ff2b2a0217da315fac143cd0d58293361
2014-03-26 22:19:00 +01:00
James Zern
df230f2723 dsp: reuse wht transform from dec in encoder
Change-Id: Ide663db9eaecb7a37fe0e6ad4cd5f37de190c717
2014-03-22 13:25:08 -07:00
Pascal Massimino
59daf08362 Merge "cosmetics:" 2014-03-18 04:02:33 -07:00
Pascal Massimino
536220084c cosmetics:
- use VP8ScanUV, separate from VP8Scan[] (for luma)
 - fix indentation
 - few missing consts
 - change TrellisQuantizeBlock() signature

Change-Id: I94b437d791cbf887015772b5923feb83dd145530
2014-03-18 03:34:56 -07:00
James Zern
3e7f34a3fb AssignSegments: quiet array-bounds warning
nb (enc->segment_hdr_.num_segments_) will be in the range
[1, NUM_MB_SEGMENTS].

Change-Id: I5c2bd0bb82b17c99aff39c98b6b1747fc040dc16
2014-03-14 18:47:52 -07:00
James Zern
3c2ebf58a4 Merge "UpdateHistogramCost: avoid implicit double->float" 2014-03-14 15:50:57 -07:00
James Zern
cf821c821f UpdateHistogramCost: avoid implicit double->float
all the functions involved return double and later these locals are used
in double calculations. fixes a vs build warning

Change-Id: Idb547104ef00b48c71c124a774ef6f2ec5f30f14
2014-03-14 11:18:52 -07:00
Vikas Arora
312e638f30 Extend the search space for GetBestGreenRedToBlue
Get back some of the compression gains by extending the search space for
GetBestGreenRedToBlue. Also removed the SkipRepeatedPixels call, as it was not
helping much in yielding better compression density.

Before:
 1000 files, 63530337 pixels, 1 loops => 45.0s (45.0 ms/file/iterations)
 Compression (output/input): 2.463/3.268 bpp, Encode rate (raw data): 1.347 MP/s

After:
1000 files, 63530337 pixels, 1 loops => 45.9s (45.9 ms/file/iterations)
 Compression (output/input): 2.461/3.268 bpp, Encode rate (raw data): 1.321 MP/s

Change-Id: I044ba9d3f5bec088305e94a7c40c053ca237fd9d
2014-03-14 09:56:00 -07:00
Vikas Arora
1c58526fe1 Fix few nits
Add/remove few casts, fixed indentation.

Change-Id: Icd141694201843c04e476f09142ce4be6e502dff
2014-03-13 13:57:39 -07:00
Vikas Arora
fef22704ec Optimize and re-structure VP8LGetHistoImageSymbols
Optimize and re-structured VP8LGetHistoImageSymbols method, by using the bin-hash
for merging the Histograms more efficiently, instead of the randomized
heuristic of existing method HistogramCombine.

This change speeds up the Lossless encoding by 40-50% (for method=4 and Q > 50)
with 0.8% penalty in compression density. For lower method, the speed up is 25-30%,
with 0.4% penalty in the compression density.

Change-Id: If61adadb1a041b95def6405aa1fe3b83c3cb25ce
2014-03-13 11:48:37 -07:00
Vikas Arora
068b14ac57 Optimize lossless decoding.
Restructure PredictorInverseTransform & ColorSpaceInverseTransform to remove
one if condition inside the main/critial loop. Also separated TransformColor &
TransformColorInverse into separate functions and avoid one 'if condition'
inside this critical method.

This change speeds up lossless decoding for Lenna image about 5% and 1000 image
corpus by 3-4%.

Change-Id: I4bd390ffa4d3bcf70ca37ef2ff2e81bedbba197d
2014-03-13 11:27:12 -07:00
Vikas Arora
5f0cfa80ff Do a binary search to get the optimum cache bits.
This speeds up the lossless encoder by a bit (1-2%), without impacting the
compression density.

Change-Id: Ied6fb38fab58eef9ded078697e0463fe7c560b26
2014-03-13 10:30:32 -07:00
skal
65b99f1c92 add a -z option to cwebp, and WebPConfigLosslessPreset() function
These are presets for lossless coding, similar to zlib.
The shortcut for lossless coding is now, e.g.:
   cwebp -z 5 in.png -o out_lossless.webp

There are 10 possible values for -z parameter:
   0 (fastest, lowest compression)
to 9 (slowest, best compression)

A reasonable tradeoff is -z 6, e.g.
-z 9 can be quite slow, so use with care.

This -z option is just a shortcut for some pre-defined
'-lossless -m xx -q yy' combinations.

Change-Id: I6ae716456456aea065469c916c2d5ca4d6c6cf04
2014-03-11 23:25:35 +01:00
skal
30176619c6 4-5% faster trellis by removing some unneeded calculations.
(We didn't need the exact value of the max_error properly.
We can work with relative values instead of absolute)

Output is bitwise the same as before.

Change-Id: I67aeaaea5f81bfd9ca8e1158387a5083a2b6c649
2014-03-06 15:57:25 +01:00
James Zern
687a58ecc3 histogram.c: reindent after b33e8a0
b33e8a0 Refactor code for HistogramCombine.

Change-Id: Ia1b4b545c5f4e29cc897339df2b58f18f83c15b3
2014-03-04 00:38:14 -08:00
skal
06d456f685 Merge "~3-4% faster lossless encoding" 2014-03-04 00:17:52 -08:00
skal
c60de26099 ~3-4% faster lossless encoding
by re-arranging some code from SkipRepeatedPixel()

Change-Id: I6c1fd7cd9af22cd9be4234217ff67d7b94f44137
2014-03-04 08:12:59 +01:00
James Zern
42eb06fc0e Merge "few cosmetics after patch #69079" 2014-03-03 15:13:25 -08:00
skal
82af82644b few cosmetics after patch #69079
Change-Id: Ifa758420421b5a05825a593f6b43504887603ee7
2014-03-03 23:53:08 +01:00
Vikas Arora
b33e8a05ee Refactor code for HistogramCombine.
Refactor code for HistogramCombine and optimize the code by calculating
the combined entropy and avoid un-necessary Histogram merges.

This speeds up lossless encoding by 1-2% and almost no impact on compression
density.

Change-Id: Iedfcf4c1f3e88077bc77fc7b8c780c4cd5d6362b
2014-03-03 13:50:42 -08:00
skal
ca1bfff53f Merge "5-10% encoding speedup with faster trellis (-m 6)" 2014-03-03 13:09:17 -08:00
skal
5aeeb087d6 5-10% encoding speedup with faster trellis (-m 6)
mostly by:
- storing a single rd-score instead of cost / distortion separately
- evaluating terminal cost only once
- getting some invariants out of the loops
- more consts behind fewer variables

Change-Id: I79451f3fd1143d6537200fb8b90d0ba252809f8c
2014-03-03 22:07:06 +01:00
James Zern
82ae1bf299 cosmetics: normalize VP8GetCPUInfo checks
- use '!= NULL'
+ dec_neon/STORE_WHT: align '\'s

Change-Id: I0f0ce49bd9c58e771bafb24c51c070d5ebd77e53
2014-02-28 18:47:41 -08:00
James Zern
e3dd9243cb Merge "Refactor GetBestPredictorForTile for future tuning." 2014-02-28 18:39:27 -08:00
Vikas Arora
206cc1be5a Refactor GetBestPredictorForTile for future tuning.
This change doesn't impact compression gain or compression speed.

Change-Id: Ia87d8a46c6f1ce0f8974178d75a6b0ba0a6e3696
2014-02-28 11:30:23 -08:00
James Zern
3cb8406262 Merge "speed-up trellis quant (~5-10% overall speed-up)" 2014-02-27 14:34:01 -08:00
Pascal Massimino
b66f2227c1 Merge "lossy encoding: ~3% speed-up" 2014-02-27 11:42:16 -08:00
Pascal Massimino
4287d0d49b speed-up trellis quant (~5-10% overall speed-up)
store costs[] in node instead of context

Change-Id: I6aeb0fd94af9e48580106c41408900fe3467cc54
also: various cosmetics
2014-02-27 00:06:00 -08:00
Pascal Massimino
390c8b316d lossy encoding: ~3% speed-up
incorporate non-last cost in per-level cost table

also: correct trellis-quant cost evaluation at nodes
(output a little bit different now). Method 6 is ~4% faster.

Change-Id: Ic48bd6d33f9193838216e7dc3a9f9c5508a1fbe8
2014-02-26 05:52:24 -08:00
James Zern
9a463c4a51 Merge "dec_neon: convert TransformWHT to intrinsics" 2014-02-25 14:36:44 -08:00
pascal massimino
e8605e9625 Merge "dec_neon: add ConvertU8ToS16" 2014-02-25 08:56:17 -08:00
Djordje Pesut
4aa3e4122b MIPS: MIPS32r1: rescaler bugfix
Change-Id: I6de6e2488bd5bd58c1f705739e4467feb211f8b4
2014-02-25 14:36:48 +01:00
Vikas Arora
c16cd99aba Speed up lossless encoder.
Speedup lossless encoder by 20-25% by optimizing:
- GetBestColorTransformForTile: Use techniques like binary search and
  local minima search to reduce the search space.
- VP8LFastSLog2Slow & VP8LFastLog2Slow: Adding the correction factor for
  log(1 + x) and increase the threshold for calling the approximate
  version of log_2 (compared to costly call to log()).

Change-Id: Ia2444c914521ac298492aafa458e617028fc2f9d
2014-02-21 22:13:50 -08:00
James Zern
9d6b5ff1e6 dec_neon: convert TransformWHT to intrinsics
Change-Id: I34dc1d75ddebab131cfed031764117e3f7b75c6b
2014-02-21 11:23:46 -08:00
James Zern
2ff0aae2fe dec_neon: add ConvertU8ToS16
Change-Id: Ifc4fb8e7f862e72154d2f2739811b1022d2b9416
2014-02-20 15:35:33 -08:00
skal
77a8f91981 fix compilation with USE_YUVj flag
(not that we'll ever need it, but...)

Change-Id: I9af993c62372097846c5ca6bae8362b59c3502dc
2014-02-20 13:23:18 +01:00
James Zern
4acbec1bef Merge changes I3b240ffb,Ia9370283,Ia2d28728
* changes:
  dec_neon: TransformAC3: work on packed vectors
  dec_neon: add SaturateAndStore4x4
  dec_neon.c: convert TransformDC to intrinsics
2014-02-19 14:47:33 -08:00
James Zern
2719bb7e98 dec_neon: TransformAC3: work on packed vectors
pack 2 rows in 1 vector similar to TransformDC

Change-Id: I3b240ffb4f51a632b5c8c2daf54d938333ed4b0d
2014-02-18 19:47:20 -08:00
James Zern
b7b60ca16c dec_neon: add SaturateAndStore4x4
converts 2 s16 vectors to 2 u8 and store to uint8_t destination;
TransformAC3 can reuse this after a rework

Change-Id: Ia9370283ee3d9bfbc8c008fa883412100ff483d0
2014-02-18 19:42:35 -08:00
Pascal Massimino
b7685d73fe Rescale: let ImportRow / ExportRow be pointer-to-function
Separate the C version from the MIPS32 version and have run-time
initialization during RescalerInit()

Change-Id: I93cfa5691c073a099fe62eda1333ad2bb749915b
2014-02-17 00:58:17 -08:00
James Zern
e02f16ef45 dec_neon.c: convert TransformDC to intrinsics
no noticeable difference in performance

Change-Id: Ia2d287289c3865ddd0fc99edaf7a030778aa7025
2014-02-14 12:11:58 -08:00
skal
9cba963f9a add missing file
Change-Id: I17eab2fedc64ee3bba941a592ecef765fcd2b402
2014-02-13 21:56:19 -08:00
skal
8992ddb756 use static clipping tables
(shared with mips32)
removed abs1[] table along the way

sub-1% speed-up, but still...

Change-Id: I8c29a8a0285076cb3423b01ffae9fcc465da6a81
2014-02-13 19:32:59 -08:00
skal
0235d5e44b 1-2% faster quantization in SSE2
C-version is a bit faster too (sub-1% faster on ARM)

Change-Id: I077262042f1d0937aba1ecf15174f2c51bf6cd97
2014-02-13 15:55:30 -08:00
Pascal Massimino
b2fbc36c26 fix VC12-x64 warning
"conversion from 'vp8l_atype_t' to 'uint8_t', possible loss of data"

Change-Id: I7607a688d16aca8fae8ce472450f8423c48f3a26
2014-02-12 12:19:32 -08:00
James Zern
6e37cb942f Merge "cosmetics: backward_references.c: reindent after a7d2ee3" 2014-02-11 16:28:55 -08:00
James Zern
a42ea9742a cosmetics: backward_references.c: reindent after a7d2ee3
a7d2ee3 Optimize cache estimate logic.

Change-Id: I81dd1eea49f603465dc5f3afae8a101e5205e963
2014-02-11 15:52:22 -08:00
skal
6c32744214 Merge "fix missing __BIG_ENDIAN__ definition on some platform" 2014-02-11 14:51:11 -08:00
skal
a8b6aad155 fix missing __BIG_ENDIAN__ definition on some platform
e.g: mips-gcc doesn't define __BIG_ENDIAN__

Change-Id: Ic06bf453164ddddc69a523e7845a4993e14a1af2
2014-02-11 14:43:44 -08:00
Vikas Arora
fde2904b8a Increase initial buffer size for VP8L Bit Writer.
Increase the initial buffer size for VP8L Bit Writer from 4bpp to 8bpp.
The resize buffer is expensive (requires realloc and copy) and this additional
memory (0.5 * W * H) doesn't add much overhead on the lossless encoder.

Change-Id: Ic1fe55cd7bc3d1afadc799e4c2c8786ec848ee66
2014-02-11 11:13:21 -08:00
Vikas Arora
a7d2ee39be Optimize cache estimate logic.
Optimize 'VP8LCalculateEstimateForCacheSize' for lower quality ranges (Q < 50).
The entropy is generally lower for higher cache_bits, so start searching from
higher cache_bits and settle for a local minima, instead of evaluating all
values.

This speeds up the lossless encoding at lower qualities by 10-15%.

Change-Id: I33c1e958515a2549f2e6f64b1aab3f128660dcec
2014-02-11 10:59:01 -08:00
pascal massimino
7fb6095b03 Merge "dec_neon.c: add TransformAC3" 2014-02-11 09:17:34 -08:00
skal
bf182e837e VP8LBitWriter: use a bit-accumulator
* simplify the endian logic
* remove the need for memset()
* write 16 or 32 at a time (likely aligned)

Makes the code a bit faster on ARM (~1%)

Change-Id: I650bc5654e8d0b0454318b7a78206b301c5f6c2c
2014-02-11 09:12:45 -08:00
Djordje Pesut
3f40b4a581 Merge "MIPS: MIPS32r1: clang macro warning resolved" 2014-02-11 00:06:36 -08:00
Urvang Joshi
1684f4ee37 WebP Decoder: Mark some truncated bitstreams as invalid
Specifically, check for truncated RIFF and/or VP8(L) chunks.

For more context, see:
https://code.google.com/p/webp/issues/detail?id=185

Change-Id: I91ca2dbf05080660fbc513244fc53adc57fc04b5
2014-02-10 16:35:27 -08:00
Jovan Zelincevic
acbedac475 MIPS: MIPS32r1: clang macro warning resolved
.set macro - Enables the expansion of macro instructions.

Change-Id: I1e44fe056798aeff803cc97171724d21da1fc2bf
2014-02-10 06:50:11 -08:00
James Zern
228e4877ab dec_neon.c: add TransformAC3
based on SSE2 version

Change-Id: Icc6782955253c98e83d5984153b596ef5f1c0d34
2014-02-08 12:47:54 -08:00
skal
32aeaf115a revamp VP8LColorSpaceTransform() a bit
-> remove the 'color_transform' multiplier, use more constants, etc.

This function is particularly critical, mostly because of
GetBestColorTransformForTile().
Loop is a bit faster (maybe ~1%)

Change-Id: I90c96a3437cafb184773acef55c77e40c224388f
2014-02-05 10:37:06 +01:00
James Zern
0c7cc4ca20 Merge "Don't dereference NULL, ensure HashChain fully initialized" 2014-02-03 22:58:37 -08:00