James Zern
a53c336919
lossless_neon: add VP8LTransformColorInverse
...
based on SSE2, only ~11% faster
Change-Id: I45434639d81e153f01f77c1f5d2da510b542170e
2015-08-04 23:22:36 -07:00
James Zern
99131e7f8c
Merge changes I9fb25a89,Ibc648e9e
...
* changes:
lossless_neon: remove predictors 5-13
ll_enc_neon: enable VP8LSubtractGreenFromBlueAndRed
2015-08-04 02:24:15 +00:00
Pascal Massimino
c455676680
simplify the main loop for downscaling
...
(part of bug #254 investigation)
no speed change observed.
Change-Id: Ie21b33171def367f37643fef6a0bd378e49468c7
2015-08-03 16:57:35 +02:00
James Zern
2a010f992a
lossless_neon: remove predictors 5-13
...
operating on single uint32's isn't helped by NEON.
this improves aarch64 performance by ~4%
Change-Id: I9fb25a8962de7b80e893e756ee7c76393cfd40c7
2015-07-28 19:44:58 -07:00
James Zern
ca221bbc48
ll_enc_neon: enable VP8LSubtractGreenFromBlueAndRed
...
this moves the function outside the WEBP_USE_INTRINSICS check.
there's no alternative version and it's ~54% faster at the
function level and mildly faster overall
Change-Id: Ibc648e9ee35021d48901e05aa596aa01067796a2
2015-07-28 19:44:45 -07:00
Jyrki Alakuijala
85b44d8a69
lossless: encoding, don't compute unnecessary histo
...
share the computation between different modes
3-5 % speedup for lossless alpha
1 % for lossy alpha
no change in compression density
Change-Id: I5e31413b3efcd4319121587da8320ac4f14550b2
2015-07-07 20:24:26 -07:00
Pascal Massimino
0ae2c2e4b2
SSE2/SSE41: optimize SSE_16xN loops
...
After several trials at re-organizing the main loop and accumulation scheme,
this is apparently the faster variant.
removed the SSE41 version, which is no longer faster now.
For some reason, the AVX variant seems to benefit most for the change.
Change-Id: Ib11ee18dbb69596cee1a3a289af8e2b4253de7b5
2015-07-02 20:55:04 +02:00
James Zern
39216e59d9
cosmetics: fix indent after 32462a07
...
Change-Id: If9a5d91c25e981bc4cd81adb476244e63fc7c3c8
2015-07-01 23:49:20 -07:00
James Zern
559e54ca60
Merge "SSE2: slightly faster FTransformWHT"
2015-07-02 06:36:33 +00:00
Pascal Massimino
8ef9a63b45
SSE2: slightly faster FTransformWHT
...
goes from 0.3% to 0.1% overall CPU time, but...
Change-Id: I4c9a92b1e1d6b58ed57c6b890366f1dbeaf84f84
2015-07-01 23:03:17 -07:00
James Zern
f27f773576
lossless_neon: enable VP8LAddGreenToBlueAndRed
...
this moves the function outside the WEBP_USE_INTRINSICS check.
there's no alternative version and it's ~70% faster at the
function level and 1-2% faster overall
Change-Id: I59fb4918ec86b1ac3a47cbd5d05ce62f007461cb
2015-07-01 22:50:54 -07:00
Pascal Massimino
36e9c4bc50
SSE2: minor cosmetrics on in-loop filter code
...
Change-Id: Ic0e6502081d7063bb2841df74e05c450d708aaf2
2015-06-28 11:59:22 +02:00
James Zern
4741fac42e
dsp/lossless_*sse2: remove some unnecessary inlines
...
TransformColor / TransformColorInverse are the top-level function
pointer calls
Change-Id: Ieabdb4005ff3e4f9bb3ebcb140ccb6bef5d28f8b
2015-06-25 21:02:01 -07:00
Pascal Massimino
1819965e0a
fix warning ("left shift of negative value") using a cast
...
Change-Id: Ie99e8ff87924a1d15e2c5d83bd9adf07dab04e94
2015-06-24 23:46:09 -07:00
Pascal Massimino
7017001462
SSE2: speed-up some lossless-encoding functions
...
optimized: CollectColorRedTransforms, CollectColorBlueTransforms, SubtractGreenFromBlueAndRed
overall effect is sub-1% speed-up, though.
Change-Id: I9cb49af5c56e4c03db417929b0a2cf575d60a5c6
2015-06-24 20:09:13 -07:00
Pascal Massimino
abcb012841
Merge "SSE2: slightly faster (~5%) AddGreenToBlueAndRed()"
2015-06-24 09:37:46 +00:00
Pascal Massimino
2df5bd30a6
Merge "Speedup to HuffmanCostCombinedCount"
2015-06-24 07:42:26 +00:00
Pascal Massimino
9e356d6b25
SSE2: slightly faster (~5%) AddGreenToBlueAndRed()
...
Change-Id: Ie147010b66544c4e959f26966ad588394302d418
2015-06-24 09:36:44 +02:00
Pascal Massimino
fc6c75a2a2
SSE2: 53% faster TransformColor[Inverse]
...
Changed the code (again) to process 4 pixels at a time. Loop is more
involved, but overall it's faster.
Removed the SSE4.1 implementation which is now slower than SSE2.
Change-Id: I7734e371033ad8929ace7f7e1373ba930d9bb5f1
2015-06-23 14:52:01 -07:00
Pascal Massimino
49073da6d6
SSE2: 46% speed-up of TransformColor[Inverse]
...
Change-Id: If3bf26dc8ed32a7c03cb438e5d5fc996e2e96b5e
2015-06-23 20:09:04 +02:00
Pascal Massimino
32462a072c
Speedup to HuffmanCostCombinedCount
...
~3% speedup for lossless encoding
Improves compression ratio by ~0.03%
Change-Id: Ic6d05fb0b1099b5ca56689b92b1c6515d54a5d6b
2015-06-23 16:41:03 +02:00
Pascal Massimino
f3d687e3fa
SSE4.1 implementation of some lossless encoding functions
...
New implementations: SubtractGreenFromBlueAndRed and TransformColor
around 1-2% faster lossless encoding.
Change-Id: I1668e36fdc316ba55b3b798b91b4a3e36ce62861
2015-06-23 08:46:57 +02:00
Pascal Massimino
bfc300c7ff
SSE4.1 implementation of some alpha-processing functions
...
DispatchAlpha* functions are hard to speed up, compared to SSE2.
ExtractAlpha sees a ~15% speed-up though.
Change-Id: I8715c2defecbc832f469eed7e6ffd012146b52de
2015-06-19 14:17:39 -07:00
Pascal Massimino
7f9c98f21d
Merge "sse2 in-loop: simplify SignedShift8b() a bit"
2015-06-12 07:37:32 +00:00
James Zern
ef314a5d6c
dec_sse2/GetNotHEV: micro optimization
...
trade 2 subtractions + logical or for 1 max + 1 subtraction
Change-Id: I7d1f25f7cda2a89bc8247f3d3d5417f6b0e3d96c
2015-06-11 22:46:24 -07:00
Pascal Massimino
a729cff987
sse2 in-loop: simplify SignedShift8b() a bit
...
Change-Id: Ida3e096bb41451194d03dc7a97753a222ff0135c
2015-06-11 15:26:31 -07:00
Pascal Massimino
422ec9fb62
simplify Load8x4() a bit
...
Change-Id: I68cf09c432f48e34bbe1d47dd091417cfd40cf4e
2015-06-10 12:35:50 -07:00
James Zern
8df238ec8a
Merge "remove some duplicate FlipSign()"
2015-06-06 05:25:04 +00:00
Pascal Massimino
751506c484
remove some duplicate FlipSign()
...
ApplyFilter2NoFlip is the new variant of ApplyFilter2 without the sign-flip
Change-Id: I2af54bd1499118c8321183e42251d265ba76219c
2015-06-05 17:20:29 +02:00
James Zern
65ef5afc27
Merge "lossless: 0.13% compression density gain"
2015-06-03 03:02:09 +00:00
Jyrki Alakuijala
2beef2f245
lossless: 0.13% compression density gain
...
over a 1000 image corpus
Single photograph benchmark:
Before:
Q=20: 2.560 MP/s
Q=40: 2.593 MP/s
Q=60: 1.795 MP/s
Q=80: 1.603 MP/s
Q=99: 1.122 MP/s
After:
Q=20: 3.334 MP/s
Q=40: 2.464 MP/s
Q=60: 2.009 MP/s
Q=80: 1.871 MP/s
Q=99: 1.163 MP/s
This CL allows for some further improvements that would not be possible
otherwise.
Change-Id: I61ba154beca2266cb96469281cf96e84a4412586
2015-06-02 17:27:36 -07:00
Pascal Massimino
3033f24c26
lossless: 0.06 % compression density improvement
...
Change-Id: Ib662e6aec53b40d6bc736d3ecfd6475bb005c790
2015-06-02 14:51:51 +02:00
James Zern
64960da9e1
dec_neon: add VE8uv / VE16
...
VE8uv/VE16: ~25%/~33% faster over 20M pixels
Change-Id: Ifac1114091527a05ed10edfcc43852edff012d14
2015-05-30 13:40:00 -07:00
James Zern
14dbd87bed
dec_neon: add HE8uv / HE16
...
HE8uv/HE16: ~91%/~83% faster over 20M pixels
Change-Id: Ib0a776f7c193593ea0993e92cfa6e6be000fb810
2015-05-30 13:39:24 -07:00
skal
ac76801159
introduce FTransform2 to perform two transforms at a time.
...
FTransform goes from ~12.0% to 11.5% total CPU time.
Change-Id: Ibcb23155324f4fd8b235563f80668531c781f624
2015-05-18 21:06:15 -07:00
James Zern
aa6065aedd
dec_neon: use vld1_dup(mem) rather than vdup(mem[0])
...
should result in slightly less general purpose register use
Change-Id: I6069f49541392e56c8db2c28c8d1fdf88c1a1726
2015-05-16 11:24:32 -07:00
Pascal Massimino
8b63ac78e0
Merge "dec_neon: add TM16"
2015-05-16 10:56:07 +00:00
Pascal Massimino
f51be09e1f
Merge "dec_neon/TrueMotion: simply left border load"
2015-05-16 10:54:05 +00:00
James Zern
dc48196bd9
dec_neon: add TM16
...
over 20M pixels ~78% faster
Change-Id: I420d5d590f275f19e08f86df1d1caa6b82fffbde
2015-05-15 12:50:11 -07:00
James Zern
ea95b305ca
dec_neon/TrueMotion: simply left border load
...
use vld1_dup_u8() rather than a separate ld+dup after the values were
zero extended; mildly faster at the function level
Change-Id: I1b3666a6aeb465722a1214dbc6d71c27689a7f89
2015-05-15 12:48:13 -07:00
Pascal Massimino
f262d6120e
speed-up SetResidualSSE2
...
(was unnecessarily complicated)
Before:
VP8SetResidualCoeffs: checksum = 1127918 elapsed = 475 ms.
Change-Id: Ia54bef86c45f9f474622ff16e594bf1da4f67ebd
After:
VP8SetResidualCoeffs: checksum = 1127918 elapsed = 404 ms.
2015-05-14 21:24:24 -07:00
James Zern
bf46d0acff
fix mips2 build target
...
tested with mips1 and mips2; this should cover 3/4 as well.
fixes an ftbfs reported on the debian issue tracker:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785000
Change-Id: I2458487c92bd638589fdfec5adb4f22102a5960c
2015-05-13 10:36:22 -07:00
James Zern
929a0fdccd
enc_sse2/TTransform: simplify abs calculation
...
max(b, 0 - b) works as well as (b ^ sign) - b
Change-Id: Iad923236fd70db85ff58a64d3c8e25e4f42a525d
2015-05-08 19:50:29 -07:00
James Zern
17dbd05819
enc_sse2/CollectHistogram: simplify abs calculation
...
max(out, 0 - out) works as well as (out ^ sign) - out
Change-Id: Id820ab9b296512cb0d56c8026b986bf98e3d3909
2015-05-08 19:49:08 -07:00
James Zern
a6c1593645
dec_neon: add DC16 intra predictors
...
improvement over 20M pixels:
DC16: ~77%
DC16NoTop: ~78%
DC16NoLeft: ~83%
DC16NoTopLeft: ~83%
Change-Id: I4c4ee16a8fa0eb466eee45dfa6f6bbce5ce64b99
2015-05-08 00:12:48 -07:00
James Zern
f274a96ce9
dsp/enc_sse2: add luma4 intra predictors
...
VP8EncPredLuma4 improvement over ~20M pixels: ~39%
Change-Id: I9cd841250771276d2d1bef3991215a56e83f7f20
2015-05-05 23:51:19 -07:00
James Zern
040b11bdf6
dsp/enc_sse2: add chroma intra predictors
...
VP8EncPredChroma8 improvements over ~20M pixels
left/top: ~67%
left-only: ~52%
top-only: ~57%
none: ~61%
based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary
Change-Id: Iee7e387fb2570b4eb5af5bfd123e9c2e9ea49c76
2015-05-05 23:51:14 -07:00
James Zern
aee021bbb1
dsp/enc_sse2: add luma16 intra predictors
...
VP8EncPredLuma16 improvements over ~20M pixels
left/top: ~75%
left-only: ~47%
top-only: ~59%
none: ~63%
based on dec_sse2 versions with minor changes to benefit from the linear
storage of the left boundary
Change-Id: I7548be7214fa85c38fd11d30f5b8b271f437657d
2015-05-05 23:51:07 -07:00
James Zern
4c9af02326
dec_neon: add DC8uvNoTopLeft
...
~93% faster
Change-Id: Icf0fd5f85ac53c306a1b69d84275023e5b24a602
2015-05-01 20:03:57 -07:00
Pascal Massimino
9287761d95
Merge "GetResidualCostSSE2: simplify abs calculation"
2015-04-30 06:30:58 +00:00