Commit Graph

2724 Commits

Author SHA1 Message Date
James Zern
559e54ca60 Merge "SSE2: slightly faster FTransformWHT" 2015-07-02 06:36:33 +00:00
Pascal Massimino
8ef9a63b45 SSE2: slightly faster FTransformWHT
goes from 0.3% to 0.1% overall CPU time, but...

Change-Id: I4c9a92b1e1d6b58ed57c6b890366f1dbeaf84f84
2015-07-01 23:03:17 -07:00
James Zern
f27f773576 lossless_neon: enable VP8LAddGreenToBlueAndRed
this moves the function outside the WEBP_USE_INTRINSICS check.
there's no alternative version and it's ~70% faster at the
function level and 1-2% faster overall

Change-Id: I59fb4918ec86b1ac3a47cbd5d05ce62f007461cb
2015-07-01 22:50:54 -07:00
Pascal Massimino
36e9c4bc50 SSE2: minor cosmetrics on in-loop filter code
Change-Id: Ic0e6502081d7063bb2841df74e05c450d708aaf2
2015-06-28 11:59:22 +02:00
James Zern
4741fac42e dsp/lossless_*sse2: remove some unnecessary inlines
TransformColor / TransformColorInverse are the top-level function
pointer calls

Change-Id: Ieabdb4005ff3e4f9bb3ebcb140ccb6bef5d28f8b
2015-06-25 21:02:01 -07:00
Pascal Massimino
1819965e0a fix warning ("left shift of negative value") using a cast
Change-Id: Ie99e8ff87924a1d15e2c5d83bd9adf07dab04e94
2015-06-24 23:46:09 -07:00
Pascal Massimino
7017001462 SSE2: speed-up some lossless-encoding functions
optimized: CollectColorRedTransforms, CollectColorBlueTransforms, SubtractGreenFromBlueAndRed

overall effect is sub-1% speed-up, though.

Change-Id: I9cb49af5c56e4c03db417929b0a2cf575d60a5c6
2015-06-24 20:09:13 -07:00
Pascal Massimino
abcb012841 Merge "SSE2: slightly faster (~5%) AddGreenToBlueAndRed()" 2015-06-24 09:37:46 +00:00
Pascal Massimino
2df5bd30a6 Merge "Speedup to HuffmanCostCombinedCount" 2015-06-24 07:42:26 +00:00
Pascal Massimino
9e356d6b25 SSE2: slightly faster (~5%) AddGreenToBlueAndRed()
Change-Id: Ie147010b66544c4e959f26966ad588394302d418
2015-06-24 09:36:44 +02:00
Pascal Massimino
fc6c75a2a2 SSE2: 53% faster TransformColor[Inverse]
Changed the code (again) to process 4 pixels at a time. Loop is more
involved, but overall it's faster.

Removed the SSE4.1 implementation which is now slower than SSE2.

Change-Id: I7734e371033ad8929ace7f7e1373ba930d9bb5f1
2015-06-23 14:52:01 -07:00
Pascal Massimino
49073da6d6 SSE2: 46% speed-up of TransformColor[Inverse]
Change-Id: If3bf26dc8ed32a7c03cb438e5d5fc996e2e96b5e
2015-06-23 20:09:04 +02:00
Pascal Massimino
32462a072c Speedup to HuffmanCostCombinedCount
~3% speedup for lossless encoding
Improves compression ratio by ~0.03%

Change-Id: Ic6d05fb0b1099b5ca56689b92b1c6515d54a5d6b
2015-06-23 16:41:03 +02:00
Pascal Massimino
f3d687e3fa SSE4.1 implementation of some lossless encoding functions
New implementations: SubtractGreenFromBlueAndRed and TransformColor

around 1-2% faster lossless encoding.

Change-Id: I1668e36fdc316ba55b3b798b91b4a3e36ce62861
2015-06-23 08:46:57 +02:00
Pascal Massimino
bfc300c7ff SSE4.1 implementation of some alpha-processing functions
DispatchAlpha* functions are hard to speed up, compared to SSE2.
ExtractAlpha sees a ~15% speed-up though.

Change-Id: I8715c2defecbc832f469eed7e6ffd012146b52de
2015-06-19 14:17:39 -07:00
Pascal Massimino
7f9c98f21d Merge "sse2 in-loop: simplify SignedShift8b() a bit" 2015-06-12 07:37:32 +00:00
James Zern
ef314a5d6c dec_sse2/GetNotHEV: micro optimization
trade 2 subtractions + logical or for 1 max + 1 subtraction

Change-Id: I7d1f25f7cda2a89bc8247f3d3d5417f6b0e3d96c
2015-06-11 22:46:24 -07:00
Pascal Massimino
a729cff987 sse2 in-loop: simplify SignedShift8b() a bit
Change-Id: Ida3e096bb41451194d03dc7a97753a222ff0135c
2015-06-11 15:26:31 -07:00
Pascal Massimino
422ec9fb62 simplify Load8x4() a bit
Change-Id: I68cf09c432f48e34bbe1d47dd091417cfd40cf4e
2015-06-10 12:35:50 -07:00
James Zern
8df238ec8a Merge "remove some duplicate FlipSign()" 2015-06-06 05:25:04 +00:00
Pascal Massimino
751506c484 remove some duplicate FlipSign()
ApplyFilter2NoFlip is the new variant of ApplyFilter2 without the sign-flip

Change-Id: I2af54bd1499118c8321183e42251d265ba76219c
2015-06-05 17:20:29 +02:00
James Zern
65ef5afc27 Merge "lossless: 0.13% compression density gain" 2015-06-03 03:02:09 +00:00
Jyrki Alakuijala
2beef2f245 lossless: 0.13% compression density gain
over a 1000 image corpus

Single photograph benchmark:
Before:
Q=20: 2.560 MP/s
Q=40: 2.593 MP/s
Q=60: 1.795 MP/s
Q=80: 1.603 MP/s
Q=99: 1.122 MP/s

After:
Q=20: 3.334 MP/s
Q=40: 2.464 MP/s
Q=60: 2.009 MP/s
Q=80: 1.871 MP/s
Q=99: 1.163 MP/s

This CL allows for some further improvements that would not be possible
otherwise.

Change-Id: I61ba154beca2266cb96469281cf96e84a4412586
2015-06-02 17:27:36 -07:00
Pascal Massimino
3033f24c26 lossless: 0.06 % compression density improvement
Change-Id: Ib662e6aec53b40d6bc736d3ecfd6475bb005c790
2015-06-02 14:51:51 +02:00
James Zern
64960da9e1 dec_neon: add VE8uv / VE16
VE8uv/VE16: ~25%/~33% faster over 20M pixels

Change-Id: Ifac1114091527a05ed10edfcc43852edff012d14
2015-05-30 13:40:00 -07:00
James Zern
14dbd87bed dec_neon: add HE8uv / HE16
HE8uv/HE16: ~91%/~83% faster over 20M pixels

Change-Id: Ib0a776f7c193593ea0993e92cfa6e6be000fb810
2015-05-30 13:39:24 -07:00
skal
ac76801159 introduce FTransform2 to perform two transforms at a time.
FTransform goes from ~12.0% to 11.5% total CPU time.

Change-Id: Ibcb23155324f4fd8b235563f80668531c781f624
2015-05-18 21:06:15 -07:00
James Zern
aa6065aedd dec_neon: use vld1_dup(mem) rather than vdup(mem[0])
should result in slightly less general purpose register use

Change-Id: I6069f49541392e56c8db2c28c8d1fdf88c1a1726
2015-05-16 11:24:32 -07:00
Pascal Massimino
8b63ac78e0 Merge "dec_neon: add TM16" 2015-05-16 10:56:07 +00:00
Pascal Massimino
f51be09e1f Merge "dec_neon/TrueMotion: simply left border load" 2015-05-16 10:54:05 +00:00
James Zern
dc48196bd9 dec_neon: add TM16
over 20M pixels ~78% faster

Change-Id: I420d5d590f275f19e08f86df1d1caa6b82fffbde
2015-05-15 12:50:11 -07:00
James Zern
ea95b305ca dec_neon/TrueMotion: simply left border load
use vld1_dup_u8() rather than a separate ld+dup after the values were
zero extended; mildly faster at the function level

Change-Id: I1b3666a6aeb465722a1214dbc6d71c27689a7f89
2015-05-15 12:48:13 -07:00
Pascal Massimino
f262d6120e speed-up SetResidualSSE2
(was unnecessarily complicated)

Before:
VP8SetResidualCoeffs: checksum = 1127918   elapsed = 475 ms.

Change-Id: Ia54bef86c45f9f474622ff16e594bf1da4f67ebd
After:
VP8SetResidualCoeffs: checksum = 1127918   elapsed = 404 ms.
2015-05-14 21:24:24 -07:00
James Zern
bf46d0acff fix mips2 build target
tested with mips1 and mips2; this should cover 3/4 as well.
fixes an ftbfs reported on the debian issue tracker:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785000

Change-Id: I2458487c92bd638589fdfec5adb4f22102a5960c
2015-05-13 10:36:22 -07:00
James Zern
929a0fdccd enc_sse2/TTransform: simplify abs calculation
max(b, 0 - b) works as well as (b ^ sign) - b

Change-Id: Iad923236fd70db85ff58a64d3c8e25e4f42a525d
2015-05-08 19:50:29 -07:00
James Zern
17dbd05819 enc_sse2/CollectHistogram: simplify abs calculation
max(out, 0 - out) works as well as (out ^ sign) - out

Change-Id: Id820ab9b296512cb0d56c8026b986bf98e3d3909
2015-05-08 19:49:08 -07:00
James Zern
a6c1593645 dec_neon: add DC16 intra predictors
improvement over 20M pixels:
DC16: ~77%
DC16NoTop: ~78%
DC16NoLeft: ~83%
DC16NoTopLeft: ~83%

Change-Id: I4c4ee16a8fa0eb466eee45dfa6f6bbce5ce64b99
2015-05-08 00:12:48 -07:00
Urvang Joshi
03b4f50d39 Makefile.vc: add anim_diff build support.
Change-Id: Ib5efc5cffea2d906640c81348db26ae28d28d3f1
2015-05-07 12:00:47 -07:00
Pascal Massimino
1b989874a7 Merge changes I9cd84125,Iee7e387f,I7548be72
* changes:
  dsp/enc_sse2: add luma4 intra predictors
  dsp/enc_sse2: add chroma intra predictors
  dsp/enc_sse2: add luma16 intra predictors
2015-05-07 11:19:12 +00:00
Urvang Joshi
acd7b5af0f Introduce a test tool anim_diff.
It can be used to test if given pair of animated images (GIF and/or
WebP) are identical in terms of pixel match and other animation
properties.

Change-Id: I84adea145e9d062be6ad06a0d4fcdc9658cf52d4
2015-05-06 17:17:03 -07:00
James Zern
f274a96ce9 dsp/enc_sse2: add luma4 intra predictors
VP8EncPredLuma4 improvement over ~20M pixels: ~39%

Change-Id: I9cd841250771276d2d1bef3991215a56e83f7f20
2015-05-05 23:51:19 -07:00
James Zern
040b11bdf6 dsp/enc_sse2: add chroma intra predictors
VP8EncPredChroma8 improvements over ~20M pixels
left/top: ~67%
left-only: ~52%
top-only: ~57%
none: ~61%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: Iee7e387fb2570b4eb5af5bfd123e9c2e9ea49c76
2015-05-05 23:51:14 -07:00
James Zern
aee021bbb1 dsp/enc_sse2: add luma16 intra predictors
VP8EncPredLuma16 improvements over ~20M pixels
left/top: ~75%
left-only: ~47%
top-only: ~59%
none: ~63%

based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary

Change-Id: I7548be7214fa85c38fd11d30f5b8b271f437657d
2015-05-05 23:51:07 -07:00
James Zern
9e00a499a6 makefile.unix: remove superclean target
this target is out of date and there are better ways to make a clean
tree (the first 2 versioned, the last one not):
make distclean (when using autoconf)
git clean -fdx
git archive

Change-Id: I766b75e0adf566c6f7db1a087ff486020b031b3a
2015-05-02 11:31:40 -07:00
James Zern
cefc9c0964 makefile.unix: clean up after extras target
Change-Id: I3e2d259473db9f3649d18120513f8edcba64c5e6
2015-05-02 11:31:40 -07:00
James Zern
4c9af02326 dec_neon: add DC8uvNoTopLeft
~93% faster

Change-Id: Icf0fd5f85ac53c306a1b69d84275023e5b24a602
2015-05-01 20:03:57 -07:00
James Zern
dd55b8734a Merge "doc/webp-container-spec: update repo browser link" 2015-04-30 07:20:58 +00:00
James Zern
f0486968ba doc/webp-container-spec: update repo browser link
gerrit.chromium.org is deprecated, use chromium.googlesource.com.

Change-Id: Iaa6d6d18798dbd8cce908988287387f5cb8e8e64
2015-04-29 23:31:34 -07:00
Pascal Massimino
9287761d95 Merge "GetResidualCostSSE2: simplify abs calculation" 2015-04-30 06:30:58 +00:00
James Zern
0e009366f8 dsp/cpu.c(x86): check maximum supported cpuid feature
structured extended feature flags require eax = 7; avoids incorrectly
detecting avx2 on some older processors that support avx.
for completeness also check for value=1 support used by the other
checks.

from [1]:
INPUT EAX = 0: Returns CPUID’s Highest Value for Basic Processor
Information and the Vendor Identification String

[1]
http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html

Change-Id: I60b20d661a978d551614dbf7acdc25db19cb6046
2015-04-29 23:22:53 -07:00