1981 Commits

Author SHA1 Message Date
Pascal Massimino
bfc300c7ff SSE4.1 implementation of some alpha-processing functions
DispatchAlpha* functions are hard to speed up, compared to SSE2.
ExtractAlpha sees a ~15% speed-up though.

Change-Id: I8715c2defecbc832f469eed7e6ffd012146b52de
2015-06-19 14:17:39 -07:00
Pascal Massimino
7f9c98f21d Merge "sse2 in-loop: simplify SignedShift8b() a bit" 2015-06-12 07:37:32 +00:00
James Zern
ef314a5d6c dec_sse2/GetNotHEV: micro optimization
trade 2 subtractions + logical or for 1 max + 1 subtraction

Change-Id: I7d1f25f7cda2a89bc8247f3d3d5417f6b0e3d96c
2015-06-11 22:46:24 -07:00
Pascal Massimino
a729cff987 sse2 in-loop: simplify SignedShift8b() a bit
Change-Id: Ida3e096bb41451194d03dc7a97753a222ff0135c
2015-06-11 15:26:31 -07:00
Pascal Massimino
422ec9fb62 simplify Load8x4() a bit
Change-Id: I68cf09c432f48e34bbe1d47dd091417cfd40cf4e
2015-06-10 12:35:50 -07:00
James Zern
8df238ec8a Merge "remove some duplicate FlipSign()" 2015-06-06 05:25:04 +00:00
Pascal Massimino
751506c484 remove some duplicate FlipSign()
ApplyFilter2NoFlip is the new variant of ApplyFilter2 without the sign-flip

Change-Id: I2af54bd1499118c8321183e42251d265ba76219c
2015-06-05 17:20:29 +02:00
James Zern
65ef5afc27 Merge "lossless: 0.13% compression density gain" 2015-06-03 03:02:09 +00:00
Jyrki Alakuijala
2beef2f245 lossless: 0.13% compression density gain
over a 1000 image corpus

Single photograph benchmark:
Before:
Q=20: 2.560 MP/s
Q=40: 2.593 MP/s
Q=60: 1.795 MP/s
Q=80: 1.603 MP/s
Q=99: 1.122 MP/s

After:
Q=20: 3.334 MP/s
Q=40: 2.464 MP/s
Q=60: 2.009 MP/s
Q=80: 1.871 MP/s
Q=99: 1.163 MP/s

This CL allows for some further improvements that would not be possible
otherwise.

Change-Id: I61ba154beca2266cb96469281cf96e84a4412586
2015-06-02 17:27:36 -07:00
Pascal Massimino
3033f24c26 lossless: 0.06 % compression density improvement
Change-Id: Ib662e6aec53b40d6bc736d3ecfd6475bb005c790
2015-06-02 14:51:51 +02:00
James Zern
64960da9e1 dec_neon: add VE8uv / VE16
VE8uv/VE16: ~25%/~33% faster over 20M pixels

Change-Id: Ifac1114091527a05ed10edfcc43852edff012d14
2015-05-30 13:40:00 -07:00
James Zern
14dbd87bed dec_neon: add HE8uv / HE16
HE8uv/HE16: ~91%/~83% faster over 20M pixels

Change-Id: Ib0a776f7c193593ea0993e92cfa6e6be000fb810
2015-05-30 13:39:24 -07:00
skal
ac76801159 introduce FTransform2 to perform two transforms at a time.
FTransform goes from ~12.0% to 11.5% total CPU time.

Change-Id: Ibcb23155324f4fd8b235563f80668531c781f624
2015-05-18 21:06:15 -07:00
James Zern
aa6065aedd dec_neon: use vld1_dup(mem) rather than vdup(mem[0])
should result in slightly less general purpose register use

Change-Id: I6069f49541392e56c8db2c28c8d1fdf88c1a1726
2015-05-16 11:24:32 -07:00
Pascal Massimino
8b63ac78e0 Merge "dec_neon: add TM16" 2015-05-16 10:56:07 +00:00
Pascal Massimino
f51be09e1f Merge "dec_neon/TrueMotion: simply left border load" 2015-05-16 10:54:05 +00:00
James Zern
dc48196bd9 dec_neon: add TM16
over 20M pixels ~78% faster

Change-Id: I420d5d590f275f19e08f86df1d1caa6b82fffbde
2015-05-15 12:50:11 -07:00
James Zern
ea95b305ca dec_neon/TrueMotion: simply left border load
use vld1_dup_u8() rather than a separate ld+dup after the values were
zero extended; mildly faster at the function level

Change-Id: I1b3666a6aeb465722a1214dbc6d71c27689a7f89
2015-05-15 12:48:13 -07:00
Pascal Massimino
f262d6120e speed-up SetResidualSSE2
(was unnecessarily complicated)

Before:
VP8SetResidualCoeffs: checksum = 1127918   elapsed = 475 ms.

Change-Id: Ia54bef86c45f9f474622ff16e594bf1da4f67ebd
After:
VP8SetResidualCoeffs: checksum = 1127918   elapsed = 404 ms.
2015-05-14 21:24:24 -07:00
James Zern
bf46d0acff fix mips2 build target
tested with mips1 and mips2; this should cover 3/4 as well.
fixes an ftbfs reported on the debian issue tracker:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785000

Change-Id: I2458487c92bd638589fdfec5adb4f22102a5960c
2015-05-13 10:36:22 -07:00
James Zern
929a0fdccd enc_sse2/TTransform: simplify abs calculation
max(b, 0 - b) works as well as (b ^ sign) - b

Change-Id: Iad923236fd70db85ff58a64d3c8e25e4f42a525d
2015-05-08 19:50:29 -07:00
James Zern
17dbd05819 enc_sse2/CollectHistogram: simplify abs calculation
max(out, 0 - out) works as well as (out ^ sign) - out

Change-Id: Id820ab9b296512cb0d56c8026b986bf98e3d3909
2015-05-08 19:49:08 -07:00
James Zern
a6c1593645 dec_neon: add DC16 intra predictors
improvement over 20M pixels:
DC16: ~77%
DC16NoTop: ~78%
DC16NoLeft: ~83%
DC16NoTopLeft: ~83%

Change-Id: I4c4ee16a8fa0eb466eee45dfa6f6bbce5ce64b99
2015-05-08 00:12:48 -07:00
James Zern
f274a96ce9 dsp/enc_sse2: add luma4 intra predictors
VP8EncPredLuma4 improvement over ~20M pixels: ~39%

Change-Id: I9cd841250771276d2d1bef3991215a56e83f7f20
2015-05-05 23:51:19 -07:00
James Zern
040b11bdf6 dsp/enc_sse2: add chroma intra predictors
VP8EncPredChroma8 improvements over ~20M pixels
left/top: ~67%
left-only: ~52%
top-only: ~57%
none: ~61%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: Iee7e387fb2570b4eb5af5bfd123e9c2e9ea49c76
2015-05-05 23:51:14 -07:00
James Zern
aee021bbb1 dsp/enc_sse2: add luma16 intra predictors
VP8EncPredLuma16 improvements over ~20M pixels
left/top: ~75%
left-only: ~47%
top-only: ~59%
none: ~63%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: I7548be7214fa85c38fd11d30f5b8b271f437657d
2015-05-05 23:51:07 -07:00
James Zern
4c9af02326 dec_neon: add DC8uvNoTopLeft
~93% faster

Change-Id: Icf0fd5f85ac53c306a1b69d84275023e5b24a602
2015-05-01 20:03:57 -07:00
Pascal Massimino
9287761d95 Merge "GetResidualCostSSE2: simplify abs calculation" 2015-04-30 06:30:58 +00:00
James Zern
0e009366f8 dsp/cpu.c(x86): check maximum supported cpuid feature
structured extended feature flags require eax = 7; avoids incorrectly
detecting avx2 on some older processors that support avx.
for completeness also check for value=1 support used by the other
checks.

from [1]:
INPUT EAX = 0: Returns CPUID’s Highest Value for Basic Processor
Information and the Vendor Identification String

[1]
http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html

Change-Id: I60b20d661a978d551614dbf7acdc25db19cb6046
2015-04-29 23:22:53 -07:00
James Zern
b243a4bc30 GetResidualCostSSE2: simplify abs calculation
max(coeff, 0 - coeff) works as well as min/max/sub or
(coeff ^ sign) - coeff

Change-Id: I9b11715372e49cd83820677bf4beba4a1c04931c
2015-04-21 20:29:12 -07:00
Pascal Massimino
b83bd7c4ea Merge "populate 'libwebpextras' with: import gray, rgb565 and rgb4444 functions" 2015-04-17 15:30:52 -07:00
James Zern
dbba67d1e7 histogram.h: cosmetics: remove unnecessary includes
Change-Id: Ia8277d3587534c2a1af05d3df57a6973a68be16d
2015-04-17 12:23:06 -07:00
James Zern
e978fec61a Merge "VP8LBitReader: fix remaining ubsan error with large shifts" 2015-04-17 00:30:05 -07:00
James Zern
d6fe588469 Merge "ReconstructRow: move some one-time inits out of the main loop" 2015-04-16 14:51:36 -07:00
Pascal Massimino
a21d647c11 ReconstructRow: move some one-time inits out of the main loop
+ some cosmetics clean-up

Change-Id: Ifb34b914844bb7734137bacd61fcfc4a13971665
2015-04-16 14:31:19 -07:00
Pascal Massimino
7a01c3c3ec VP8LBitReader: fix remaining ubsan error with large shifts
* make VP8LPrefetchBits() safe wrt past-EOS reads
* set 'BitReader::bits_" to a safe shifting value upon EOS

no visible performance difference on x86

Change-Id: I0a4177928cfa81d5dfc9054b36a686eaa1bf8c65
2015-04-16 00:57:42 -07:00
Pascal Massimino
7fa67c9b9e change GetPixPairHash64() return type to uint32_t
Change-Id: Ibb61c1631d7a4bcda5417b5a85864d5e2c3f3858
2015-04-16 00:55:25 -07:00
pascal massimino
ec1fb9f8dd Merge "dsp/enc.c: cosmetics: move DST() def closer to use" 2015-04-16 00:17:37 -07:00
Pascal Massimino
7073bfb3ee Merge "split 64-mult hashing into two 32-bit multiplies" 2015-04-15 23:04:47 -07:00
James Zern
0768b252fa dsp/enc.c: cosmetics: move DST() def closer to use
Change-Id: Iccbcf046412426c2893b71eced517f611d2ffc3f
2015-04-15 20:03:39 -07:00
James Zern
6a48b8f003 Merge "fix MSVC size_t->int conversion warning" 2015-04-15 19:54:18 -07:00
James Zern
1db07cdeef Merge "anim_encode: cosmetics: fix alignment" 2015-04-15 15:32:12 -07:00
James Zern
e28271a394 anim_encode: cosmetics: fix alignment
Change-Id: I0a746421f5cceebbbecfb75d11d11ec5d86a1900
2015-04-15 15:03:17 -07:00
Pascal Massimino
7fe357b8c0 split 64-mult hashing into two 32-bit multiplies
Speed-wise equivalent on x86 and ARM (maybe a tad faster, hard to tell).

Note that the two 32-bit multiples are not strictly equivalent
to the 64-bit one, since we're missing one carry propagation.
In practice, no observable difference was seen because of this
slightly different hashing result.

Change-Id: I8f2381175eae1cb20dabf149e6b27e1768fba6ab
2015-04-15 17:45:19 +02:00
Pascal Massimino
af74c1453b populate 'libwebpextras' with: import gray, rgb565 and rgb4444 functions
update makefile.unix to provide 'make extras' building instructions.

note: input ordering depends on WEBP_SWAP_16BIT_CSP for rgb565 and rgb4444

Change-Id: I6f22d32189d9ba2619146a9714cedabfe28e2ad0
2015-04-15 02:54:44 -07:00
Pascal Massimino
6121413415 remove VP8Residual::cost unused field
Change-Id: Id494475b05c540b40fd104594acbcaa783b88d77
2015-04-15 01:56:31 -07:00
Pascal Massimino
e25448235a fix MSVC size_t->int conversion warning
use size_t for 'total_frames' and compute average with float arith.

Change-Id: Ibf16edb38405b0d525bec38c246cf874668c994e
2015-04-14 23:55:00 -07:00
Urvang Joshi
0ac29c5190 AnimEncoder API: Consistent use of trailing underscores in struct.
Change-Id: Ica361eee0059250a6800c6c43264e3bd5e5aa3e0
2015-04-14 15:44:41 -07:00
Urvang Joshi
d484555024 AnimEncoder API: Use timestamp instead of duration as input to Add().
When converting from video sources, the duration of current frame
is often unavailable until the next frame. So, we internally convert
timestamps to durations.

Change-Id: I20ad86361c22e014be7eb91f00d5d40108281351
2015-04-14 12:00:57 -07:00
James Zern
9904e365a8 dsp/dec_sse2: DC8uv / DC8uvNoLeft speedup
use psadbw to perform top row summation; left remains in C as repacking
it into a vector to apply the same operation is too costly.

DC8uv: ~19% faster
DC8uvNoLeft: ~12% faster

Change-Id: I707c4f6177a65b5d1f2d3deeca87d2bb740185e2
2015-04-08 23:12:53 -07:00