move the attribute to the front of the function to quiet clang warning:
GCC does not allow no_sanitize_thread attribute in this position on a
function definition
Change-Id: Ie4cc6e35a07bd00eab67d9cd6801bd2be9cfe676
Extract loop invariant and avoid storing/loading samples
if they can be re-used. This is particularly interesting when
a transpose is involved (HFilter16i).
Change-Id: I93274620f6da220a35025ff8708ff0c9ee8c4139
move simple loop filter defines closer to their use and LOAD* to a
location common with the intrinsics
Change-Id: Iaec506d27bbc9a01be20936e30b68a4b0e690ee3
the complex loop filter has no inline equivalent; the simple loop filter
remains conditional on USE_INTRINSICS: it's left undefined for now.
Change-Id: I4f258e10458df53a7a1819707c8f46b450e9d9d2
using this in Load4x16 was slightly slower and didn't help mitigate any
of the remaining build issues with 4.6.x.
Change-Id: Idabfe1b528842a514d14a85f4cefeb90abe08e51
+ misc cosmetics
* seems 4% slower than inlined-asm with gcc-4.6
* is a tad faster (<1%) with gcc-4.8
(disabled for now)
Change-Id: Iea6cd00053a2e9c1b1ccfdad1378be26584f1095
The nice trick is to pack 8 u + 8 v samples into a single uint8x16x_t
register, and re-use the previous (luma) functions
Change-Id: Idf50ed2d6b7137ea080d603062bc9e0c66d79f38
+ added some work-around gcc-4.6 to make it compile (except one function).
+ lots of revamping
All variants tested ok.
Speed-up is ~5-7%
Change-Id: I5ceda2ee5debfada090907fe3696889eb66269c3
vertical only currently, 2.5-3% faster
placed under USE_INTRINSICS as this change depends on the simple
loopfilter
improves the simple loopfilter slightly thanks to some reorganization
Change-Id: I6611441fa54228549b21ea74c013cb78d53c7155
It's disable for now, because it crashes gcc-4.6.3 during compilation
with -O2 or -O3. It's been tested OK with -O1.
Code is still globally disabled with USE_INTRINSICS, though.
Change-Id: I3ca6cf83f3b9545ad8909556f700758b3cefa61c
disabled for now (but tested OK), thanks to the USE_INTRINSICS #define
We'll activate the code when we're on par with non-intrinsics
Change-Id: Idbfb9cb01f4c7c9f5131b270f8c11b70d0d485ff
converts 2 s16 vectors to 2 u8 and store to uint8_t destination;
TransformAC3 can reuse this after a rework
Change-Id: Ia9370283ee3d9bfbc8c008fa883412100ff483d0
add TransformDC special case, and make the switch function inlined.
Recovers a few of the CPU lost during the addition of TransformAC3
(only on ARM)
Change-Id: I21c1f0c6a9cb9d1dfc1e307b4f473a2791273bd6
rather than symlink the webm/vpx terms, use the same header as libvpx to
reference in-tree files
based on the discussion in:
https://codereview.chromium.org/12771026/
Change-Id: Ia3067ecddefaa7ee01550136e00f7b3f086d4af4
Contributed by Wayne Chen (datoudatou at gmail dot com)
+ some header cleanup
+ remove the NEON suffix in static functions
Change-Id: I75bf5e9b54cf5e1acc53764c6f081d61690f8e3d
this will avoid the "dec_neon.o has no symbol" warning
no change in binary size observed on linux.
Change-Id: Ia27ae2bc5a03d714afa7e46671fdcf4cb630784d
Defining LOCAL_ARM_NEON = true can result in neon instructions being
used in portions unprotected by the cpu check.
This changes defines a WEBP_USE_NEON/WEBP_ANDROID_NEON pair similar to
the SSE2 code and MSVC.
Change-Id: Ifac010b06e42c73d5aca529baa2198c6796674bd