Commit Graph

2358 Commits

Author SHA1 Message Date
Vikas Arora
6d6865f0db Added SSE2 variants for Average2/3/4
The predictors based on Average2 are tad slower.

Following is the performance data for these predictors normalized to
number of instruction cycles (as per valgrind) per operation:
- Predictor6 & Predictor7 now takes 15 instruction cycles compared to 11
  instruction cycles for the C version.
- Predictor8 & Predictor9 now takes 15 instruction cycles compared to 12
  instruction cycles for the C version.

The predictors based on Average4 is faster and Average3 is tad slower:
- Predictor10 (Average4) now takes 23 instruction cycles compared to 25
  instruction cycles for the C version.
- Predictor5 (Average3) now takes 20 instruction cycles compared to 18
  instruction cycles for the C version.

Maybe SSE2 version of Average2 can be improved further. Otherwise, we can
remove the SSE2 version and always fallback to the C version.

Change-Id: I388b2871919985bc28faaad37c1d4beeb20ba029
2014-04-28 14:47:30 -07:00
Pascal Massimino
b3a616b356 make HistogramAdd() a pointer in dsp
* merged the two HistogramAdd/AddEval() into a single call
  (with detection of special case when b==out)
* added a SSE2 variant
* harmonize the histogram type to 'uint32_t' instead
  of just 'int'. This has a lot of ripples on signatures.
* 1-2% faster

Change-Id: I10299ff300f36cdbca5a560df1ae4d4df149d306
2014-04-28 10:09:34 -07:00
James Zern
c8bbb636ea dec_neon: relocate some inline-asm defines
move simple loop filter defines closer to their use and LOAD* to a
location common with the intrinsics

Change-Id: Iaec506d27bbc9a01be20936e30b68a4b0e690ee3
2014-04-28 00:41:42 -07:00
James Zern
4e393bb9f1 dec_neon: enable intrinsics-only functions
the complex loop filter has no inline equivalent; the simple loop filter
remains conditional on USE_INTRINSICS: it's left undefined for now.

Change-Id: I4f258e10458df53a7a1819707c8f46b450e9d9d2
2014-04-28 00:39:46 -07:00
James Zern
ba99a922ab dec_neon: use positive tests for USE_INTRINSICS
makes Simple* layout consistent with the rest of the file

Change-Id: Ib3108b0f2c694c634210e22027c253ea6236a9c6
2014-04-28 00:38:47 -07:00
pascal massimino
69058ff813 Merge "example_util: add ExUtilDecodeWebPIncremental" 2014-04-28 00:38:24 -07:00
James Zern
a7828e8bdb dec_neon: make WORK_AROUND_GCC conditional on version
Change-Id: Ic1b95f8749988de90df7c1ff6c537a21981329db
2014-04-28 00:01:19 -07:00
pascal massimino
3f3d717a6c Merge "enc_neon: enable intrinsics-only functions" 2014-04-27 02:05:53 -07:00
pascal massimino
de3cb6c820 Merge "move LOCAL_GCC_VERSION def to dsp.h" 2014-04-27 02:04:08 -07:00
James Zern
1b2fe14de5 example_util: add ExUtilDecodeWebPIncremental
simplifies ExUtilDecodeWebP() call

Change-Id: I33b42ec1c398b9fc0a895ad9deb89ab01a33b3da
2014-04-27 01:58:25 -07:00
pascal massimino
ca49e7ad97 Merge "enc_neon: move Transpose4x4 to dsp/neon.h" 2014-04-27 01:11:05 -07:00
Pascal Massimino
ad900abddd Merge "fix warning about size_t -> int conversion" 2014-04-27 01:07:03 -07:00
Pascal Massimino
4825b4360d fix warning about size_t -> int conversion
+ re-order and add some const

Change-Id: I3746520b75699e56e20835d10d1dd9cd9fd6d85d
2014-04-27 00:50:07 -07:00
James Zern
42b35e086b enc_neon: enable intrinsics-only functions
CollectHistogram / SSE* / QuantizeBlock have no inline equivalents,
enable them where possible and use USE_INTRINSICS to control borderline
cases: it's left undefined for now.

Change-Id: I62235bc4ddb8aa0769d1ce18a90e0d7da1e18155
2014-04-26 19:09:04 -07:00
James Zern
f937e01261 move LOCAL_GCC_VERSION def to dsp.h
+ add LOCAL_GCC_PREREQ and use it in lossless_neon.c

Change-Id: Ic9fd99540bc3e19c027d1598e4530dfdc9b9de00
2014-04-26 19:09:04 -07:00
James Zern
5e1a17ef4b enc_neon: move Transpose4x4 to dsp/neon.h
+ reuse it in TransformWHT()

Change-Id: Idfbd0f9b58d6253ac3d65ba55b58989c427ee989
2014-04-26 14:06:04 -07:00
James Zern
c7b92a5a29 dec_neon: (WORK_AROUND_GCC) delete unused Load4x8
using this in Load4x16 was slightly slower and didn't help mitigate any
of the remaining build issues with 4.6.x.

Change-Id: Idabfe1b528842a514d14a85f4cefeb90abe08e51
2014-04-26 12:36:14 -07:00
James Zern
8e5f90b086 Merge "make ExUtilLoadWebP() accept NULL bitstream param." 2014-04-26 12:22:53 -07:00
pascal massimino
05d4c1b7b3 Merge "cwebp: add webpdec" 2014-04-26 02:25:02 -07:00
James Zern
ddeb6ac802 cwebp: add webpdec
simple webp decoding, no metadata support

Change-Id: I2752fd1442c87513922878cbf72f001d45b631bc
2014-04-26 02:20:41 -07:00
pascal massimino
35d7d095dd Merge "Reduce memory footprint for encoding WebP lossless." 2014-04-26 01:28:43 -07:00
Vikas Arora
0b896101b4 Reduce memory footprint for encoding WebP lossless.
Reduce calls to Malloc (WebPSafeMalloc/WebPSafeCalloc) for:
- Building HashChain data-structure used in creating the backward references.
- Creating Backward references for LZ77 or RLE coding.
- Creating Huffman tree for encoding the image.
For the above mentioned code-paths, allocate memory once and re-use it
subsequently.

Reduce the foorprint of VP8LHistogram struct by changing the Struct
field 'literal_' from an array of constant size to dynamically allocated
buffer based on the input parameter cache_bits.

Initialize BitWriter buffer corresponding to 16bpp (2*W*H).
There are some hard-files that are compressed at 12 bpp or more. The
realloc is costly and can be avoided for most of the WebP lossless
images by allocating some extra memory at the encoder initializaiton.

Change-Id: I1ea8cf60df727b8eb41547901f376c9a585e6095
2014-04-26 01:14:33 -07:00
Pascal Massimino
f0b65c9a1e make ExUtilLoadWebP() accept NULL bitstream param.
Change-Id: Ifce379873ca39e46d011a1cb829beab89de5f15d
2014-04-26 01:11:44 -07:00
pascal massimino
9c0a60ccb3 Merge "dwebp: move webp decoding to example_util" 2014-04-26 01:05:48 -07:00
Djordje Pesut
1d62acf6af MIPS: MIPS32r1: Added optimization for HuffmanCost functions.
HuffmanCost and HuffmanCostCombined optimized and added
'const' to some variables from ExtraCost functions.

Change-Id: I28b2b357a06766bee78bdab294b5fc8c05ac120d
2014-04-24 11:14:57 +02:00
James Zern
4a0e73904d dwebp: move webp decoding to example_util
this will allow reuse by cwebp

Change-Id: I667252fdacfc5436112d21b040ca299273ec1515
2014-04-22 20:31:41 -07:00
James Zern
c0220460e9 Merge "Bugfix: Incremental decode of lossy-alpha" 2014-04-22 16:33:12 -07:00
Urvang Joshi
8c7cd722f6 Bugfix: Incremental decode of lossy-alpha
When remapping buffer, br->eos_ was wrongly being set to true for
certain
images.

Also, refactored the end-of-stream detection as a function.

Reported in http://crbug.com/364830

Change-Id: I716ce082ef2b505fe24246b9c14912d8e97b5d84
2014-04-22 16:06:32 -07:00
Djordje Pesut
7955152d58 MIPS: fix error with number of registers.
Some versions of compiler in debug build can't find
a register in class 'GR_REGS' while reloading 'asm'

Number of used registers is decreased in this fix.

Change-Id: I7d7b8172b8f37f1de4db3d8534a346d7a72c5065
2014-04-22 12:06:45 +02:00
skal
b1dabe3767 Merge "Move the HuffmanCost() function to dsp lib" 2014-04-18 12:08:22 -07:00
skal
75b12006e3 Move the HuffmanCost() function to dsp lib
This is to help further optimizations.
(like in https://gerrit.chromium.org/gerrit/#/c/69787/)

There's a small slowdown (~0.5% at -z 9 quality) due to
function pointer usage. Note that, for speed, it's important
to return VP8LStreaks by value, and not pass a pointer.

Change-Id: Id4167366765fb7fc5dff89c1fd75dee456737000
2014-04-18 11:59:48 -07:00
Djordje Pesut
2772b8bd98 MIPS: fix assembler error revealed by clang's debug build
.set at -  Indicates that macro expansions may clobber
           the assembler temporary ($at or $28) register.
           Some macros may not be expanded without this
           and will generate an error message if noat
           is in effect.

"at" also added to the clobber list.

Change-Id: I67feebbd9f2944fc7f26c28496e49e1e2348529d
2014-04-18 18:10:52 +02:00
James Zern
6653b601ef enc_mips32: fix unused symbol warning in debug
move kC1 / kC2 under __OPTIMIZE__
missed in:
8dec120 enc_mips32: disable ITransform(One) in debug builds

Change-Id: Ic9a12e6d73090c8c06b0e7a4bc56dd9c76b8e596
2014-04-17 23:35:36 -07:00
James Zern
8dec120975 enc_mips32: disable ITransform(One) in debug builds
avoids:
src/dsp/enc_mips32.c: In function 'ITransformOne':
src/dsp/enc_mips32.c:123:3: can't find a register in class 'GR_REGS' while reloading 'asm'
src/dsp/enc_mips32.c:123:3: 'asm' operand has impossible constraints

Change-Id: Ic469667ee572f25e502c9873c913643cf7bbe89d
2014-04-17 20:10:31 -07:00
James Zern
98519dd5c1 enc_neon: convert Disto4x4 to intrinsics
Change-Id: I0f00d5af2de2301e8237c2a38a9612d3645abad6
2014-04-17 18:29:31 -07:00
Pascal Massimino
fe9317c9bf cosmetics:
* remove MIPS32 suffix from static function names
* fix a long line in enc_neon.c

Change-Id: Ia1294ae46f471b3eb1e9ba43c6aa1b29a7aeb447
2014-04-16 00:36:19 -07:00
James Zern
953b074677 enc_neon: cosmetics
fix/remove incorrect comments
+ whitespace

Change-Id: Id1b86beb23e5bf946e73c34ab7066b6ca177f33b
2014-04-15 23:57:03 -07:00
skal
a9fc697cb6 Merge "WIP: extract the float-calculation of HuffmanCost from loop" 2014-04-15 11:33:11 -07:00
skal
3f84b5219d Merge "replace some mult-long (vmull_u8) with mult-long-accumulate (vmlal_u8)" 2014-04-15 07:09:12 -07:00
Djordje Pesut
4ae0533f39 MIPS: MIPS32r1: Added optimizations for ExtraCost functions.
ExtraCost and ExtraCostCombined

Change-Id: I7eceb9ce2807296c6b43b974e4216879ddcd79f2
2014-04-15 15:37:06 +02:00
skal
b30a04cf11 WIP: extract the float-calculation of HuffmanCost from loop
new function: VP8FinalHuffmanCost()

Change-Id: I42102f8e5ef6d7a7af66490af77b7dc2048a9cb9
2014-04-15 14:52:52 +02:00
skal
a8fe8ce231 Merge "NEON intrinsics version of CollectHistogram" 2014-04-15 03:00:45 -07:00
skal
95203d2d1b NEON intrinsics version of CollectHistogram
apparently faster, but we might save some load/store to/from memory
once we settle for the intrinsics-based FTransform()

(also: fixed some #ifdef USE_INTRINSICS problems)

Change-Id: I426dea299cea0c64eb21c4d81a04a960e0c263c7
2014-04-14 16:47:20 +02:00
skal
7ca2e74bb4 replace some mult-long (vmull_u8) with mult-long-accumulate (vmlal_u8)
saves few instructions

Change-Id: If8f464bb2894a209bba94825a4db9267df126d47
2014-04-14 15:14:45 +02:00
skal
41c6efbdc5 fix lossless_neon.c
* some extra {xx , 0 } in initializers
* replaced by vget_lane_u32() where appropriate

Change-Id: Iabcd8ec34d7c853920491fb147a10d4472280a36
2014-04-14 14:27:11 +02:00
skal
8ff96a027a NEON intrinsics version of FTransform
as little bit slower than inlined asm it seems.
So disabled for now.

Change-Id: I8c942846f9bedaed57275675ea9dbbcb8dfd9ccd
2014-04-14 09:58:35 +02:00
Jovan Zelincevic
0214f4a908 Merge "MIPS: MIPS32r1: Added optimizations for FastLog2" 2014-04-10 08:54:12 -07:00
Jovan Zelincevic
baabf1ea3a MIPS: MIPS32r1: Added optimizations for FastLog2
Functions VP8LFastLog2Slow and VP8LFastSLog2Slow

also: replaced some "% y" by "& (y-1)" in the C-version
(since y is a power-of-two)

Change-Id: I875170384e3c333812ca42d6ce7278aecabd60f0
2014-04-10 08:32:51 -07:00
skal
3d49871dbe NEON functions for lossless coding
Verified OK, but right now they don't seem faster.
So they are disabled behind a USE_INTRINSICS flag (off for now)

Change-Id: I72a1c4fa3798f98c1e034f7ca781914c36d3392c
2014-04-10 15:32:08 +02:00
Slobodan Prijic
3fe0291530 MIPS: MIPS32r1: Added optimizations for SSE functions.
Change-Id: I1287fa65064192cc2edc5c4be2b1974be665b9b4
2014-04-09 11:02:13 +02:00