Commit Graph

65 Commits

Author SHA1 Message Date
Pascal Massimino
693bf74ec0 move the SSIM calculation code in ssim.c / ssim_sse2.c
Change-Id: I63a63fa7f44f257f2e17e45358b206c23069c448
2017-02-21 12:53:35 +01:00
James Zern
668e1dd44f src/{dec,enc,utils}: give filenames a unique suffix
this avoids duplicates between these trees and dsp/, e.g., enc/tree.c,
dec/tree.c, making pulling the whole library source tree into one target
possible

BUG=webp:279

Change-Id: I060a614833c7c24ddd37bf641702ae6a5eef1775
2017-01-19 19:09:48 -08:00
Owen Rodley
67748b41db Improve latency of FTransform2.
Benchmarks from vrabaud@:
8BIT/GRAY                corpus speed: faster: -4.3 % , corpus size: unchanged
skal/sources_png_skal    corpus speed: faster: -5.2 % , corpus size: unchanged
images/png_rgb           corpus speed: faster: -5.1 % , corpus size: unchanged
images/lpcb              corpus speed: unchanged, corpus size: unchanged
images/png_big           corpus speed: faster: -1.7 % , corpus size: unchanged
images/png_doc           corpus speed: unchanged, corpus size: unchanged
images/png_1bit          corpus speed: faster: -1.2 % , corpus size: unchanged
images/jpeg_small        corpus speed: unchanged, corpus size: unchanged
images/icip_core1        corpus speed: unchanged, corpus size: unchanged
images/png_gray          corpus speed: faster: -2.5 % , corpus size: unchanged
images/jpeg_high_quality corpus speed: faster: -4.0 % , corpus size: unchanged
images/jpeg              corpus speed: faster: -2.3 % , corpus size: unchanged
images/png_translucent   corpus speed: faster: -2.8 % , corpus size: unchanged
images/gif               corpus speed: faster: -1.4 % , corpus size: unchanged
images/png_opaque        corpus speed: faster: -2.8 % , corpus size: unchanged
images/png_rgb_opaque    corpus speed: unchanged, corpus size: unchanged
images/png_indexed       corpus speed: faster: -2.0 % , corpus size: unchanged
images/all               corpus speed: faster: -1.5 % , corpus size: unchanged
images/png_small         corpus speed: unchanged, corpus size: unchanged
images/png               corpus speed: unchanged, corpus size: unchanged
images/gif_still         corpus speed: faster: -1.6 % , corpus size: unchanged

Change-Id: I69fe11baa188c5d32cbc77a84b8c0deae13d792b
2016-11-24 07:09:50 +00:00
Pascal Massimino
ba843a92e7 fix some SSIM calculations
* prevent 64bit overflow by controlling the 32b->64b conversions
  and preventively descaling by 8bit before the final multiply
* adjust the threshold constants C1 and C2 to de-emphasis the dark
  areas
* use a hat-like filter instead of box-filtering to avoid blockiness
  during averaging

SSIM distortion calc is actually *faster* now in SSE2, because of the
unrolling during the function rewrite.
The C-version is quite slower because still un-optimized.

Change-Id: I96e2715827f79d26faae354cc28c7406c6800c90
2016-10-04 01:09:07 -07:00
Pascal Massimino
86a84b3598 2x faster SSE2 implementation of SSIMGet
Change-Id: I53705d7ddfa595389ff2d542e5088f96f948d351
2016-09-23 23:23:06 -07:00
Pascal Massimino
50c3d7da9a refactor the PSNR / SSIM calculation code
-print_psnr is now much faster because it doesn't use the SSIM code.
The SSIM speed-up and re-write will come later.

Change-Id: Iabf565e0a8b41651d8164df1266cfeded4ab4823
2016-09-14 06:13:24 +00:00
skal
5b60db5c9d FastMBAnalyze() for quick i16/i4 decision
The decision is based on the variance between DC values of each
sub-4x4 block. This heuristic is rather ok for predicting whether
the 2nd transform (intra-16) is going to help or not.
The decision threshold varies with quality (=quantization).

It's only used for -m 0 and -m 1, where no full RD-opt is performed.
It actually makes these modes quite faster, with RD curve much
closer to the -m 2 mode.

Change-Id: I15f972db97ba4082cbd1dfd16bee3eb2eca701a8
2016-07-15 11:21:08 -07:00
James Zern
6b53ca876e cosmetics,(dec|enc)_sse2.c: fix indent
Change-Id: Ic3326136ddd325e911e96c2e5a7f06b3e1d60f66
2016-07-13 16:11:29 -07:00
Vincent Rabaud
7561d0c338 FTransformWHT optimization.
Data is packed sooner in the functions.

Change-Id: I018cfeca43f015ac755c7f209f9a97984cc0517b
2016-02-18 17:44:05 +01:00
Vincent Rabaud
8aa352b256 Merge "Remove an unnecessary transposition in TTransform." 2016-02-18 08:15:10 +00:00
Vincent Rabaud
9960c31685 Remove an unnecessary transposition in TTransform.
Change-Id: Ib715c2d5ba659cb2db9c6832875ba508cc2fca3e
2016-02-17 21:41:28 +01:00
Vincent Rabaud
6e36b51188 Small speedup in FTransform.
It removes two _mm_unpacklo_epi32 and two _mm_sub_epi16.

Change-Id: Icdf86259f796ba855d1cda5e9c0e99cb396cb351
2016-02-17 21:26:36 +01:00
Vincent Rabaud
bf2b4f114f Regroup common SSE code + optimization.
The transpose refactoring will help removing a transpose in a
later CL.

The horizontal add function helps removing a _mm_sad_epu8 in DC8uv
=> the latency/throughput went from 29/25 to 23/19

Change-Id: I5f3dfd4aad614eb079b1e83631e6a7cef49a3766
2016-02-16 18:34:34 +01:00
Pascal Massimino
2dee2966df remove few obsolete TODO about aligned loads in SSE2
Change-Id: I3628602942ea2ce34dbcb85975d15afc1041f76c
2015-12-15 23:00:41 -08:00
Pascal Massimino
2c08aac81a introduce WebPMemToUint32 and WebPUint32ToMem for memory access
it uses memcpy() when unaligned memory write is tricky

Change-Id: I5d966ca9d19e9b43ac90140fa487824116982874
2015-12-04 13:43:01 +00:00
Pascal Massimino
25bf2ce5cc fix some warning about unaligned 32b reads
on x86 + gcc, the assembly code is the same.

Change-Id: Ib0d23772ccf928f8d9ebcb0e157c0573d1f6a786
2015-10-28 15:51:55 -07:00
Pascal Massimino
0ae2c2e4b2 SSE2/SSE41: optimize SSE_16xN loops
After several trials at re-organizing the main loop and accumulation scheme,
this is apparently the faster variant.

removed the SSE41 version, which is no longer faster now.
For some reason, the AVX variant seems to benefit most for the change.

Change-Id: Ib11ee18dbb69596cee1a3a289af8e2b4253de7b5
2015-07-02 20:55:04 +02:00
Pascal Massimino
8ef9a63b45 SSE2: slightly faster FTransformWHT
goes from 0.3% to 0.1% overall CPU time, but...

Change-Id: I4c9a92b1e1d6b58ed57c6b890366f1dbeaf84f84
2015-07-01 23:03:17 -07:00
skal
ac76801159 introduce FTransform2 to perform two transforms at a time.
FTransform goes from ~12.0% to 11.5% total CPU time.

Change-Id: Ibcb23155324f4fd8b235563f80668531c781f624
2015-05-18 21:06:15 -07:00
James Zern
929a0fdccd enc_sse2/TTransform: simplify abs calculation
max(b, 0 - b) works as well as (b ^ sign) - b

Change-Id: Iad923236fd70db85ff58a64d3c8e25e4f42a525d
2015-05-08 19:50:29 -07:00
James Zern
17dbd05819 enc_sse2/CollectHistogram: simplify abs calculation
max(out, 0 - out) works as well as (out ^ sign) - out

Change-Id: Id820ab9b296512cb0d56c8026b986bf98e3d3909
2015-05-08 19:49:08 -07:00
James Zern
f274a96ce9 dsp/enc_sse2: add luma4 intra predictors
VP8EncPredLuma4 improvement over ~20M pixels: ~39%

Change-Id: I9cd841250771276d2d1bef3991215a56e83f7f20
2015-05-05 23:51:19 -07:00
James Zern
040b11bdf6 dsp/enc_sse2: add chroma intra predictors
VP8EncPredChroma8 improvements over ~20M pixels
left/top: ~67%
left-only: ~52%
top-only: ~57%
none: ~61%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: Iee7e387fb2570b4eb5af5bfd123e9c2e9ea49c76
2015-05-05 23:51:14 -07:00
James Zern
aee021bbb1 dsp/enc_sse2: add luma16 intra predictors
VP8EncPredLuma16 improvements over ~20M pixels
left/top: ~75%
left-only: ~47%
top-only: ~59%
none: ~63%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: I7548be7214fa85c38fd11d30f5b8b271f437657d
2015-05-05 23:51:07 -07:00
James Zern
b44eda3f60 dsp: add DSP_INIT_STUB
generates a stub function when the specific architecture is not enabled,
exposing a symbol in the module, avoiding a compiler warning

Change-Id: Ia9336e57466a9b5241b85c1c95838e91c9283147
2015-04-02 23:55:35 -07:00
James Zern
67ba7c7acc enc_sse2: call local FTransform in CollectHistogram
allows the former to be inlined; negligible speed-up in most cases,
however this is structure is consistent with the rest of the optimized
modules

Change-Id: Ib080240b06f7a995b47f1906627850c355b82901
2015-03-24 20:22:24 -07:00
James Zern
182497993b dsp: s/VP8LSetHistogramData/VP8SetHistogramData/
this function is for lossy encoding; the VP8L prefix is used by lossless

Change-Id: I147590a91477a77af51ed79cc640546dfe53abdb
2015-03-24 18:27:41 -07:00
James Zern
fbdcef2401 dsp/enc*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: I0cf40b500f9b3eed55a3211213db180c7c0dd43b
2015-03-20 19:19:46 -07:00
Pascal Massimino
2a407092ab 4-5% faster encoding using SSE2 for GetResidualCost
new file: cost_sse2.c

Change-Id: I4896c07f5ff2443ef743f4435fe2758d95a672ed
2015-02-18 09:41:02 +01:00
James Zern
b969f5dfac dsp: normalize WEBP_TSAN_IGNORE_FUNCTION usage
the attribute is only necessary in one location; remove it from the
prototypes.

Change-Id: I3820a3c34fbb029fd7ac69a1b0a9b76091bdbde2
2015-02-13 15:23:40 -08:00
James Zern
183168f332 cosmetics: enc_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: Ib85d63abbb9fc33096f893c2524d3ce8ae3ebd03
2015-02-05 23:51:29 -08:00
Pascal Massimino
bad775715a simplify the Histogram struct, to only store max_value and last_nz
we don't need to store the whole distribution in order to compute the alpha

Later, we can incorporate the max_value / last_non_zero bookkeeping
in SSE2 directly.

Change-Id: I748ccea4ac17965d7afcab91845ef01be3aa3e15
2014-12-10 10:44:57 +01:00
James Zern
f85ec712b0 PrintReg: output to stderr
allows use of '-o -' while testing

Change-Id: Ibc02d7cede2df4eb8be0a28c0ca4bf5e91864191
2014-10-22 17:28:19 +02:00
James Zern
a4c3a31b8f WEBP_TSAN_IGNORE_FUNCTION: fix gcc compat warning
move the attribute to the front of the function to quiet clang warning:
GCC does not allow no_sanitize_thread attribute in this position on a
function definition

Change-Id: Ie4cc6e35a07bd00eab67d9cd6801bd2be9cfe676
2014-10-16 18:06:43 +02:00
Pascal Massimino
80247291c6 mark some init function as being safe for thread_sanitizer.
introduces the macro WEBP_TSAN_IGNORE_FUNCTION

Change-Id: I3de2b6c1a2076fba4da7ae50322551e026b2082b
2014-10-16 16:34:07 +02:00
Pascal Massimino
fabc65da32 1-3% faster encoding optimizing SSE_NxN functions
got rid of the |a-b|^|b-a| method and went back
to just (a-b)^2 instead.

quality | size(bytes) after/before | time (ms) after/before

Change-Id: Ia3e0e6507b3f903deb1e182f78dad6df07380fd0
2014-10-09 07:20:00 -07:00
skal
73d361dd5f introduce VP8EncQuantize2Blocks to quantize two blocks at a time
No speed diff for now. We might reorder better the instructions later,
to speed things up.

Change-Id: I1949525a0b329c7fd861b8dbea7db4b23d37709c
2014-08-25 20:21:42 -07:00
Pascal Massimino
1f3e5f1e60 remove unused 'shift' argument and QFIX2 define
this will remove a warning about the shift amount not being
an immediate (=constant).

Change-Id: Ie9a00fefdb9a07ec8994fb113f24234518bc878a
Also: fix the NULL sharpen argument mismatch.
2014-06-26 00:44:12 -07:00
levytamar82
27bfeee43a QuantizeBlock SSE2 Optimization:
Another store to load forward block was detected coming from the function
FTransform.
FTransform save the output data 4 times 8 bytes each. when this data is
later being loaded by the QuantizeBlock function in one chunk of 16 bytes
that caused a store to load forward block.
The fix was done in the FTransform function where each two consecutive 8 bytes
were merged into one 16 bytes register and saved into the memory.
This fix gives ~21% function level gain and 1.6% user level gain.

Change-Id: Idc27c307d5083f3ebe206d3ca19059e5bd465992
2014-06-18 16:22:00 -07:00
skal
69fce2ea78 remove the special casing for res->first in VP8SetResidualCoeffs
if res->first = 1, coeffs[0]=0 because of quant.c:749 and line
added at quant.c:744
So, no need for the extra case.
Going forward, TrellisQuantizeBlock() should also be calling
a variant of VP8SetResidualCoeffs() to set the 'last' field.

also: fixes a warning for win64
    + slight speed-up

Change-Id: Ib24b611f7396d24aeb5b56dc74d5c39160f048f0
2014-06-08 06:40:22 +02:00
James Zern
db4860b355 enc_sse2: prevent signed int overflow
_mm_movemask_epi8 returns a 16-bit mask; << 16 can overflow a signed
int.

Change-Id: Ia0bb0804fe548fb9b0edb3695e82727506066cda
2014-06-04 23:18:22 -07:00
skal
6679f8996f Optimize VP8SetResidualCoeffs.
Brings down WebP lossy encoding timings by 5%

Change-Id: Ia4a2fab0a887aaaf7841ce6d9ee16270d3e15489
2014-06-03 06:44:04 +02:00
skal
869eaf6c60 ~30% encoding speedup: use NEON for QuantizeBlock()
also revamped the signature to avoid having to pass the 'first' parameter

Change-Id: Ief9af1747dcfb5db0700b595d0073cebd57542a5
2014-04-08 03:08:22 -07:00
James Zern
2ca42a4fb7 enc_sse2: drop SSE2 suffix from local functions
Change-Id: I5d61605a9d410761d50b689b046114f0ab3ba24e
2014-04-02 23:24:36 -07:00
skal
0235d5e44b 1-2% faster quantization in SSE2
C-version is a bit faster too (sub-1% faster on ARM)

Change-Id: I077262042f1d0937aba1ecf15174f2c51bf6cd97
2014-02-13 15:55:30 -08:00
James Zern
5227d99146 drop: ifdef __cplusplus checks from C files
the prototypes are already marked in the headers

Change-Id: I172fe742200c939ca32a70a2299809b8baf9b094
2013-12-13 11:42:13 -08:00
skal
73b731fb42 introduce a special quantization function for WHT
WHT is somewhat a special case: no sharpen[] bias, etc.
Will be useful in a later CL when precision of input is changed.

Change-Id: I851b06deb94abdfc1ef00acafb8aa731801b4299
2013-12-10 14:21:47 +01:00
skal
41c0cc4b9a Make Forward WHT transform use 32bit fixed-point calculation
This is in preparation for a future change where input will
be 16bit instead of 12bit

No speed diff observed.

Note that the NEON implementation was using 32bit calc already.

Change-Id: If06935db5c56a77fc9cefcb2dec617483f5f62b4
2013-12-10 06:10:52 +01:00
skal
d513bb62bc * fix off-by-one zthresh calculation
* remove the sharpening for non luma-AC coeffs
* adjust the bias a little bit to compensate for this

Using the multiply-by-reciprocal doesn't always give the same result
as the exact divide, given the QFIX fixed-point precision we use.
-> removed few now-unneeded SSE2 instructions (and checked for
bit-exactness using -noasm)

Change-Id: Ib68057cbdd69c4e589af56a01a8e7085db762c24
2013-12-09 13:56:04 +01:00
James Zern
4931c3294b cosmetics: fix some typos
Change-Id: I0d6efebd817815139db5ae87236fd8911df4d53c
2013-11-26 19:21:14 -08:00