75 Commits

Author SHA1 Message Date
Nozomi Isozaki
ac42dde1c5 Specialize and optimize ITransform_SSE2 using do_two
Change-Id: I976eb4a0cc4e669a02b55012d4aba1536f193781
2023-05-16 12:07:58 +09:00
James Zern
aff1c546ef dsp,x86: normalize types w/_mm_cvtsi128_si32 calls
fixes integer sanitizer warnings of the form:
implicit conversion from type 'int' of value -2122283647 (32-bit,
signed) to type 'uint32_t' (aka 'unsigned int') changed the value to
2172683649 (32-bit, unsigned)

implicit conversion from type 'uint32_t' (aka 'unsigned int') of value
3724541952 (32-bit, unsigned) to type 'int' changed the value to
-570425344 (32-bit, signed)

Bug: b/229626362
Change-Id: I79f68e3e2fcab7cc0402477d2e88d629348c9ff4
2022-08-04 11:26:23 -07:00
James Zern
835392393b dsp,x86: normalize types w/_mm_set* calls
fixes integer sanitizer warnings of the form:
runtime error: implicit conversion from type 'unsigned int' of value
4294967295 (32-bit, unsigned) to type 'int' changed the value to -1
(32-bit, signed)
runtime error: implicit conversion from type
'uint8_t' (aka 'unsigned char') of value 128 (8-bit, unsigned) to type
'char' changed the value to -128 (8-bit, signed)

Bug: b/229626362
Change-Id: I6be3c40407cf7a27b79d31ee32d3829ecb78ed66
2022-08-03 16:50:46 -07:00
James Zern
748e92bbb9 add WebPInt32ToMem
and use it in calls containing _mm_cvtsi32_si128; this calls
WebPUint32ToMem, but corrects the type to avoid runtime warnings with
clang -fsanitize=integer of the form:

implicit conversion from type 'int' of value -1904123502 (32-bit,
signed) to type 'uint32_t' (aka 'unsigned int') changed the value to
2390843794 (32-bit, unsigned)

Bug: b/229626362
Change-Id: I20545e822d8045fa44f688241879206055a0a148
2022-08-01 13:44:20 -07:00
James Zern
4f402f34a1 add WebPMemToInt32
and use it with calls to _mm_cvtsi32_si128 and _mm_set_epi32; this calls
WebPMemToUint32, but corrects the type to avoid runtime warnings with
clang -fsanitize=integer of the form:

implicit conversion from type 'uint32_t' (aka 'unsigned int') of value
2155905152 (32-bit, unsigned) to type 'int' changed the value to
-2139062144 (32-bit, signed)

Bug: b/229626362
Change-Id: I50101ba2b46dfaa852f02d46830f3511c80b02d9
2022-07-28 22:10:22 -07:00
James Zern
e78dea7587 (alpha_processing,enc}_sse2: quiet signed conv warnings
_mm_set1_epi8() takes a char argument
_mm_insert_epi16 takes a short argument

from clang-7 integer sanitizer:
implicit conversion from type 'int' of value 255 (32-bit, signed) to
type 'char' changed the value to -1 (8-bit, signed)
implicit conversion from type 'int' of value 33153 (32-bit, signed) to
type 'short' changed the value to -32383 (16-bit, signed)

Change-Id: Ic88c8ef3d00146d34f53a560582db673f818370d
2019-06-10 14:23:58 -07:00
Pascal Massimino
0a17f4712c Merge "WIP: list includes as descendants of the project dir" 2017-10-11 08:21:42 +00:00
James Zern
a439972175 WIP: list includes as descendants of the project dir
#include "(.|..)/..." -> #include "src/..."

Change-Id: I772880aa097a770722043c8a4393552ba38a89b6
2017-10-10 23:04:05 -07:00
James Zern
bc634d57c2 enc_sse2: harmonize function suffixes
BUG=webp:355

Change-Id: Idd2f289fcf99f12bf36494111b07a8906c99c826
2017-10-08 14:05:59 -07:00
skal
b09307dcde Encoder: harmonize function suffixes
BUG=webp:355

Change-Id: Ia2fe95db7dfb303f3f64e390d43bc41b8933256c
2017-08-09 02:41:01 +00:00
Pascal Massimino
693bf74ec0 move the SSIM calculation code in ssim.c / ssim_sse2.c
Change-Id: I63a63fa7f44f257f2e17e45358b206c23069c448
2017-02-21 12:53:35 +01:00
James Zern
668e1dd44f src/{dec,enc,utils}: give filenames a unique suffix
this avoids duplicates between these trees and dsp/, e.g., enc/tree.c,
dec/tree.c, making pulling the whole library source tree into one target
possible

BUG=webp:279

Change-Id: I060a614833c7c24ddd37bf641702ae6a5eef1775
2017-01-19 19:09:48 -08:00
Owen Rodley
67748b41db Improve latency of FTransform2.
Benchmarks from vrabaud@:
8BIT/GRAY                corpus speed: faster: -4.3 % , corpus size: unchanged
skal/sources_png_skal    corpus speed: faster: -5.2 % , corpus size: unchanged
images/png_rgb           corpus speed: faster: -5.1 % , corpus size: unchanged
images/lpcb              corpus speed: unchanged, corpus size: unchanged
images/png_big           corpus speed: faster: -1.7 % , corpus size: unchanged
images/png_doc           corpus speed: unchanged, corpus size: unchanged
images/png_1bit          corpus speed: faster: -1.2 % , corpus size: unchanged
images/jpeg_small        corpus speed: unchanged, corpus size: unchanged
images/icip_core1        corpus speed: unchanged, corpus size: unchanged
images/png_gray          corpus speed: faster: -2.5 % , corpus size: unchanged
images/jpeg_high_quality corpus speed: faster: -4.0 % , corpus size: unchanged
images/jpeg              corpus speed: faster: -2.3 % , corpus size: unchanged
images/png_translucent   corpus speed: faster: -2.8 % , corpus size: unchanged
images/gif               corpus speed: faster: -1.4 % , corpus size: unchanged
images/png_opaque        corpus speed: faster: -2.8 % , corpus size: unchanged
images/png_rgb_opaque    corpus speed: unchanged, corpus size: unchanged
images/png_indexed       corpus speed: faster: -2.0 % , corpus size: unchanged
images/all               corpus speed: faster: -1.5 % , corpus size: unchanged
images/png_small         corpus speed: unchanged, corpus size: unchanged
images/png               corpus speed: unchanged, corpus size: unchanged
images/gif_still         corpus speed: faster: -1.6 % , corpus size: unchanged

Change-Id: I69fe11baa188c5d32cbc77a84b8c0deae13d792b
2016-11-24 07:09:50 +00:00
Pascal Massimino
ba843a92e7 fix some SSIM calculations
* prevent 64bit overflow by controlling the 32b->64b conversions
  and preventively descaling by 8bit before the final multiply
* adjust the threshold constants C1 and C2 to de-emphasis the dark
  areas
* use a hat-like filter instead of box-filtering to avoid blockiness
  during averaging

SSIM distortion calc is actually *faster* now in SSE2, because of the
unrolling during the function rewrite.
The C-version is quite slower because still un-optimized.

Change-Id: I96e2715827f79d26faae354cc28c7406c6800c90
2016-10-04 01:09:07 -07:00
Pascal Massimino
86a84b3598 2x faster SSE2 implementation of SSIMGet
Change-Id: I53705d7ddfa595389ff2d542e5088f96f948d351
2016-09-23 23:23:06 -07:00
Pascal Massimino
50c3d7da9a refactor the PSNR / SSIM calculation code
-print_psnr is now much faster because it doesn't use the SSIM code.
The SSIM speed-up and re-write will come later.

Change-Id: Iabf565e0a8b41651d8164df1266cfeded4ab4823
2016-09-14 06:13:24 +00:00
skal
5b60db5c9d FastMBAnalyze() for quick i16/i4 decision
The decision is based on the variance between DC values of each
sub-4x4 block. This heuristic is rather ok for predicting whether
the 2nd transform (intra-16) is going to help or not.
The decision threshold varies with quality (=quantization).

It's only used for -m 0 and -m 1, where no full RD-opt is performed.
It actually makes these modes quite faster, with RD curve much
closer to the -m 2 mode.

Change-Id: I15f972db97ba4082cbd1dfd16bee3eb2eca701a8
2016-07-15 11:21:08 -07:00
James Zern
6b53ca876e cosmetics,(dec|enc)_sse2.c: fix indent
Change-Id: Ic3326136ddd325e911e96c2e5a7f06b3e1d60f66
2016-07-13 16:11:29 -07:00
Vincent Rabaud
7561d0c338 FTransformWHT optimization.
Data is packed sooner in the functions.

Change-Id: I018cfeca43f015ac755c7f209f9a97984cc0517b
2016-02-18 17:44:05 +01:00
Vincent Rabaud
8aa352b256 Merge "Remove an unnecessary transposition in TTransform." 2016-02-18 08:15:10 +00:00
Vincent Rabaud
9960c31685 Remove an unnecessary transposition in TTransform.
Change-Id: Ib715c2d5ba659cb2db9c6832875ba508cc2fca3e
2016-02-17 21:41:28 +01:00
Vincent Rabaud
6e36b51188 Small speedup in FTransform.
It removes two _mm_unpacklo_epi32 and two _mm_sub_epi16.

Change-Id: Icdf86259f796ba855d1cda5e9c0e99cb396cb351
2016-02-17 21:26:36 +01:00
Vincent Rabaud
bf2b4f114f Regroup common SSE code + optimization.
The transpose refactoring will help removing a transpose in a
later CL.

The horizontal add function helps removing a _mm_sad_epu8 in DC8uv
=> the latency/throughput went from 29/25 to 23/19

Change-Id: I5f3dfd4aad614eb079b1e83631e6a7cef49a3766
2016-02-16 18:34:34 +01:00
Pascal Massimino
2dee2966df remove few obsolete TODO about aligned loads in SSE2
Change-Id: I3628602942ea2ce34dbcb85975d15afc1041f76c
2015-12-15 23:00:41 -08:00
Pascal Massimino
2c08aac81a introduce WebPMemToUint32 and WebPUint32ToMem for memory access
it uses memcpy() when unaligned memory write is tricky

Change-Id: I5d966ca9d19e9b43ac90140fa487824116982874
2015-12-04 13:43:01 +00:00
Pascal Massimino
25bf2ce5cc fix some warning about unaligned 32b reads
on x86 + gcc, the assembly code is the same.

Change-Id: Ib0d23772ccf928f8d9ebcb0e157c0573d1f6a786
2015-10-28 15:51:55 -07:00
Pascal Massimino
0ae2c2e4b2 SSE2/SSE41: optimize SSE_16xN loops
After several trials at re-organizing the main loop and accumulation scheme,
this is apparently the faster variant.

removed the SSE41 version, which is no longer faster now.
For some reason, the AVX variant seems to benefit most for the change.

Change-Id: Ib11ee18dbb69596cee1a3a289af8e2b4253de7b5
2015-07-02 20:55:04 +02:00
Pascal Massimino
8ef9a63b45 SSE2: slightly faster FTransformWHT
goes from 0.3% to 0.1% overall CPU time, but...

Change-Id: I4c9a92b1e1d6b58ed57c6b890366f1dbeaf84f84
2015-07-01 23:03:17 -07:00
skal
ac76801159 introduce FTransform2 to perform two transforms at a time.
FTransform goes from ~12.0% to 11.5% total CPU time.

Change-Id: Ibcb23155324f4fd8b235563f80668531c781f624
2015-05-18 21:06:15 -07:00
James Zern
929a0fdccd enc_sse2/TTransform: simplify abs calculation
max(b, 0 - b) works as well as (b ^ sign) - b

Change-Id: Iad923236fd70db85ff58a64d3c8e25e4f42a525d
2015-05-08 19:50:29 -07:00
James Zern
17dbd05819 enc_sse2/CollectHistogram: simplify abs calculation
max(out, 0 - out) works as well as (out ^ sign) - out

Change-Id: Id820ab9b296512cb0d56c8026b986bf98e3d3909
2015-05-08 19:49:08 -07:00
James Zern
f274a96ce9 dsp/enc_sse2: add luma4 intra predictors
VP8EncPredLuma4 improvement over ~20M pixels: ~39%

Change-Id: I9cd841250771276d2d1bef3991215a56e83f7f20
2015-05-05 23:51:19 -07:00
James Zern
040b11bdf6 dsp/enc_sse2: add chroma intra predictors
VP8EncPredChroma8 improvements over ~20M pixels
left/top: ~67%
left-only: ~52%
top-only: ~57%
none: ~61%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: Iee7e387fb2570b4eb5af5bfd123e9c2e9ea49c76
2015-05-05 23:51:14 -07:00
James Zern
aee021bbb1 dsp/enc_sse2: add luma16 intra predictors
VP8EncPredLuma16 improvements over ~20M pixels
left/top: ~75%
left-only: ~47%
top-only: ~59%
none: ~63%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: I7548be7214fa85c38fd11d30f5b8b271f437657d
2015-05-05 23:51:07 -07:00
James Zern
b44eda3f60 dsp: add DSP_INIT_STUB
generates a stub function when the specific architecture is not enabled,
exposing a symbol in the module, avoiding a compiler warning

Change-Id: Ia9336e57466a9b5241b85c1c95838e91c9283147
2015-04-02 23:55:35 -07:00
James Zern
67ba7c7acc enc_sse2: call local FTransform in CollectHistogram
allows the former to be inlined; negligible speed-up in most cases,
however this is structure is consistent with the rest of the optimized
modules

Change-Id: Ib080240b06f7a995b47f1906627850c355b82901
2015-03-24 20:22:24 -07:00
James Zern
182497993b dsp: s/VP8LSetHistogramData/VP8SetHistogramData/
this function is for lossy encoding; the VP8L prefix is used by lossless

Change-Id: I147590a91477a77af51ed79cc640546dfe53abdb
2015-03-24 18:27:41 -07:00
James Zern
fbdcef2401 dsp/enc*.c: rework WEBP_USE_<arch> ifdef
add a dummy init rather than repeating the '#ifdef WEBP_USE_...'
pattern.

Change-Id: I0cf40b500f9b3eed55a3211213db180c7c0dd43b
2015-03-20 19:19:46 -07:00
Pascal Massimino
2a407092ab 4-5% faster encoding using SSE2 for GetResidualCost
new file: cost_sse2.c

Change-Id: I4896c07f5ff2443ef743f4435fe2758d95a672ed
2015-02-18 09:41:02 +01:00
James Zern
b969f5dfac dsp: normalize WEBP_TSAN_IGNORE_FUNCTION usage
the attribute is only necessary in one location; remove it from the
prototypes.

Change-Id: I3820a3c34fbb029fd7ac69a1b0a9b76091bdbde2
2015-02-13 15:23:40 -08:00
James Zern
183168f332 cosmetics: enc_sse2: add const to some casts
source pointers are often cast to __m128*, retain the const in those
cases

Change-Id: Ib85d63abbb9fc33096f893c2524d3ce8ae3ebd03
2015-02-05 23:51:29 -08:00
Pascal Massimino
bad775715a simplify the Histogram struct, to only store max_value and last_nz
we don't need to store the whole distribution in order to compute the alpha

Later, we can incorporate the max_value / last_non_zero bookkeeping
in SSE2 directly.

Change-Id: I748ccea4ac17965d7afcab91845ef01be3aa3e15
2014-12-10 10:44:57 +01:00
James Zern
f85ec712b0 PrintReg: output to stderr
allows use of '-o -' while testing

Change-Id: Ibc02d7cede2df4eb8be0a28c0ca4bf5e91864191
2014-10-22 17:28:19 +02:00
James Zern
a4c3a31b8f WEBP_TSAN_IGNORE_FUNCTION: fix gcc compat warning
move the attribute to the front of the function to quiet clang warning:
GCC does not allow no_sanitize_thread attribute in this position on a
function definition

Change-Id: Ie4cc6e35a07bd00eab67d9cd6801bd2be9cfe676
2014-10-16 18:06:43 +02:00
Pascal Massimino
80247291c6 mark some init function as being safe for thread_sanitizer.
introduces the macro WEBP_TSAN_IGNORE_FUNCTION

Change-Id: I3de2b6c1a2076fba4da7ae50322551e026b2082b
2014-10-16 16:34:07 +02:00
Pascal Massimino
fabc65da32 1-3% faster encoding optimizing SSE_NxN functions
got rid of the |a-b|^|b-a| method and went back
to just (a-b)^2 instead.

quality | size(bytes) after/before | time (ms) after/before

Change-Id: Ia3e0e6507b3f903deb1e182f78dad6df07380fd0
2014-10-09 07:20:00 -07:00
skal
73d361dd5f introduce VP8EncQuantize2Blocks to quantize two blocks at a time
No speed diff for now. We might reorder better the instructions later,
to speed things up.

Change-Id: I1949525a0b329c7fd861b8dbea7db4b23d37709c
2014-08-25 20:21:42 -07:00
Pascal Massimino
1f3e5f1e60 remove unused 'shift' argument and QFIX2 define
this will remove a warning about the shift amount not being
an immediate (=constant).

Change-Id: Ie9a00fefdb9a07ec8994fb113f24234518bc878a
Also: fix the NULL sharpen argument mismatch.
2014-06-26 00:44:12 -07:00
levytamar82
27bfeee43a QuantizeBlock SSE2 Optimization:
Another store to load forward block was detected coming from the function
FTransform.
FTransform save the output data 4 times 8 bytes each. when this data is
later being loaded by the QuantizeBlock function in one chunk of 16 bytes
that caused a store to load forward block.
The fix was done in the FTransform function where each two consecutive 8 bytes
were merged into one 16 bytes register and saved into the memory.
This fix gives ~21% function level gain and 1.6% user level gain.

Change-Id: Idc27c307d5083f3ebe206d3ca19059e5bd465992
2014-06-18 16:22:00 -07:00
skal
69fce2ea78 remove the special casing for res->first in VP8SetResidualCoeffs
if res->first = 1, coeffs[0]=0 because of quant.c:749 and line
added at quant.c:744
So, no need for the extra case.
Going forward, TrellisQuantizeBlock() should also be calling
a variant of VP8SetResidualCoeffs() to set the 'last' field.

also: fixes a warning for win64
    + slight speed-up

Change-Id: Ib24b611f7396d24aeb5b56dc74d5c39160f048f0
2014-06-08 06:40:22 +02:00