This reverts commit 61e5c391d6.
When `WEBP_USE_THREAD` is not defined the assignments of *_SSE and their
unsuffixed counterparts may race. Assigning *_SSE directly rather than
relying on the unsuffixed values avoids a case where the *_SSE variants
may refer to the calling function (i.e., AVX2) resulting in infinite
recursion.
Defining `WEBP_USE_THREAD` is recommended when decode/encode calls can
be made from different threads.
In the previous commit (2246828b) not all indices of
`VP8LPredictorsAdd_SSE[]` were assigned a value, namely 14 and 15. These
are used in handling invalid bitstreams to avoid a branch in a hot
function. The indices are now assigned to `PredictorAdd0_SSE2` which
mimics `VP8LPredictorsSub[]` in lossless_enc_sse2.c.
Bug: 435213378, 438295348, 438294044, 438264629, 438294033
Change-Id: I3623717597f0ac6b0d60429adfbb20c611fe6742
This reverts commit 2246828be3.
Reason for revert: NULL dereferences in the fuzzers.
The `VP8LPredictorsAdd_SSE` table is not completely initialized (index
14 and 15) which may be accessed with an invalid bitstream.
Bug: 435213378
Original change's description:
> dsp/lossless{,_enc}_sse2.c: reorder *_SSE assignments
>
> When `WEBP_USE_THREAD` is not defined the assignments of *_SSE and their
> unsuffixed counterparts may race. Assigning *_SSE directly rather than
> relying on the unsuffixed values avoids a case where the *_SSE variants
> may refer to the calling function (i.e., AVX2) resulting in infinite
> recursion.
>
> Defining `WEBP_USE_THREAD` is recommended when decode/encode calls can
> be made from different threads.
>
> Bug: 435213378
> Change-Id: Id5549730cb72be99b3014ed8e4e355f3ea988659
Bug: 435213378, 438295348, 438294044, 438264629, 438294033
Change-Id: I3299d6fbb29c45872e2ea1f8f1c3d0ebbda64a69
When `WEBP_USE_THREAD` is not defined the assignments of *_SSE and their
unsuffixed counterparts may race. Assigning *_SSE directly rather than
relying on the unsuffixed values avoids a case where the *_SSE variants
may refer to the calling function (i.e., AVX2) resulting in infinite
recursion.
Defining `WEBP_USE_THREAD` is recommended when decode/encode calls can
be made from different threads.
Bug: 435213378
Change-Id: Id5549730cb72be99b3014ed8e4e355f3ea988659
Assume the CONDITION_VARIABLE added in Windows Vista is available.
Remove an unneeded WaitForSingleObject() macro that converts
WaitForSingleObject() calls to WaitForSingleObjectEx() calls with
bAlertable=FALSE. The WaitForSingleObject() function does not enter an
alertable wait state, so it is equivalent to WaitForSingleObjectEx()
with bAlertable=FALSE.
Remove code for Windows older than Vista in src/dsp/cpu.h.
Change-Id: I7df95557713923e05a7bfb62e095ec6172cfd708
Import bounds_safety.h across all of webputils, with one exception being
dsp.h, since it's imported by webputils.h in one place. Also prepend
WEBP_ASSUME_UNSAFE_INDEXABLE_ABI to every webputil file to indicate to
the compiler that every pointer should be treated as __unsafe_indexable.
We also need to replace memcpy/memset/memmove with the unsafe variants
WEBP_UNSAFE_*, as memcpy/memset/memmove require bounded/sized pointers.
With this change, all of libwebputils (and libwebp) should build with
-DWEBP_ENABLE_FBOUNDS_SAFETY=true
Change-Id: Iad87be0455182d534c074ef6dc1a30fa66b74b6c
A slim reader/writer (SRW) lock can be initialized statically with the
constant SRWLOCK_INIT. It is the only Windows synchronization object I
can find with this property.
Note: On old Windows versions that don't have SRWLOCK, use the fallback,
thread-unsafe implementation.
Change a NOLINT comment to a NOLINTNEXTLINE comment to prevent
clang-format from aligning the #else and #endif comments in undesired
way.
Bug: 435213378
Change-Id: Iecff615a14a1905aedd2c05ad9444889f711cc17
Function static variables are initialized on the first call to the
function. In C the initialization of function static variables is not
thread-safe. Use file static variables instead in the
WEBP_DSP_INIT_FUNC() macro.
Remove the volatile qualifier for the pthread version of the
func##_last_cpuinfo_used variable because the variable is only accessed
while holding the mutex.
Change-Id: I1237904a49d2467d7ce79fc53f9e7f966aa7a5c1
(Debian clang-format version 19.1.7 (3+build4)) with `--style=Google`.
Manual changes:
* clang-format disabled around macros with stringification (mostly
assembly)
* some inline assembly strings were adjusted to avoid awkward line
breaks
* trailing commas, `//` or suffixes (`ull`) added to help array
formatting
* thread_utils.c: parameter comments were changed to the more common
/*...=*/ style to improve formatting
The automatically generated code under swig/ was skipped.
Bug: 433996651
Change-Id: Iea3f24160d78d2a2653971cdf13fa932e47ff1b3
Rename them to WebPConvertRGBToY/WebPConvertBGRToY and accept the
'step' parameter (3 for RGB, 4 for ARGB).
Change-Id: I930a23894e4135a34fff2174e6a5bbee1eac2ba0
Fixes:
dsp/enc_neon.c:1192:11: warning: implicit declaration of function
'vld1_u8_x2'; did you mean 'vld1_u32'? [-Wimplicit-function-declaration]
inner = vld1_u8_x2(top);
^~~~~~~~~~
vld1_u32
Change-Id: I8d0175561efd69bc9614a68dca1d0fc19cdf91be
Semi-automatically taking the the misc-include-cleaner warnings
by clang-tidy and fixing files to be self-contained.
Change-Id: Iaaa2b2ec9d6dcce547fa5cb6b4f056dfc8c781ff
- move HistogramAdd to histogram_enc.cc: it is too high level
- homogenize the argument naming (e.g. h for histogram, p for
population)
- separate a bit the data from the stats (only used within
VP8LGetHistoImageSymbols)
Change-Id: I274546e3ff96297383bcae0a95696c11f18decbf
After:
44f91b0d Speed DispatchAlpha_SSE2 up
_mm_set1_epi8 takes a char argument; add a `char` cast for 0xff.
from clang-14 integer sanitizer:
implicit conversion from type 'int' of value 255 (32-bit, signed) to
type 'char' changed the value to -1 (8-bit, signed)
Change-Id: I0f4ed092eddc0beb311f44bf3d4b74a4d1177040
This is a follow up to:
ee8e8c62 Fix member naming for VP8LHistogram
This better matches Google style and clears some clang-tidy warnings.
This is the final change in this set. It is rather large due to the
shared dependencies between dec/enc.
Change-Id: I89de06b5653ae0bb627f904fa6060334831f7e3b
On some dataset, this was taking 2.5%. 2% when switching to
_mm_maskmoveu_si128. 1.7% when using _mm_loadu_si128
Confirmed by IACA: going from throughput of 4.26 to 3.5 and then
to 6.26 for twice the input.
Change-Id: I409f901aaad9d39bf55a1aac28cc25f126876b01
This restores the use of the function after
980b708e enc_neon: fix build w/aarch64 gcc < 9.4.0
The intrinsic was added to llvm for aarch64 in:
5e4ce1ae9dad Implement the newly added AArch64 ACLE functions for
ld1/st1 with 2/3/4 vectors. The functions are like:
vst1_s8_x2 ...
llvmorg-3.4.0-rc1~101
https://github.com/llvm/llvm-project/commit/5e4ce1ae9dad
Visual Studio 2019 and 2022 also support the function (2017 is still
disabled for this path due to it relying on arm64_neon.h).
Change-Id: I6ff10e22deb3968a48738a4458d2d3d55410b5ec
The values for the R/G/B floating point formulas resembled
https://fourcc.org/fccyvrgb.php and Video Demystified, but the fixed
point values are more closely aligned to rounded values from
https://en.wikipedia.org/wiki/YCbCr and BT.601.
The R/G/B formulas with the values prior to this change are added to
sharpyuv_csp.c as they align with the fixed values. The origin of those
coefficients is unclear. For consistency between library versions we'll
leave them as is.
Bug: webp:375011696
Change-Id: Id3f2a57530eee700cc52a899b32b25b5c015e89b
Take advantage of the known sizes used by VP8LHistogramAdd() and
remove loop for the remainder. The loop was being auto-vectorized making
the code larger and slower than the vectorized C code.
For larger sizes the new code is ~3-4.5% faster than the old code with
about the same improvement against the vectorized C code. For the
minimal size (40), the new code is ~30% faster than the C and old SSE2
code.
The LINE_SIZE==8 option is removed with this change. It had been set
to 16 for its entire life and clang-16 was unrolling the LINE_SIZE==8
case by 2 in any case; they both profile similarly.
Change-Id: I6dfedfd57474f44d15e2ce510a48e5252221077a
Take advantage of the known sizes used by VP8LHistogramAdd() and remove
loop for the remainder. The loop was being auto-vectorized making the
code larger and slower than the vectorized C code.
For larger sizes the new code is ~4-7% faster than the old code with
about the same improvement against the vectorized C code. For the
minimal size (40), the new code is ~30% faster than the C and old SSE2
code.
The LINE_SIZE==8 option is removed with this change. It had been set to
16 for its entire life and clang-16 was unrolling the LINE_SIZE==8 case
by 2 in any case; they both profile similarly.
Change-Id: I2376e2dca3bffa38477b4a432f4c533419e3be0e
Extend VP8EncIterator::i4_boundary_ by 3 bytes to avoid Intra4Preds_NEON
reading deeper into the struct (likely padding) when top is positioned
at offset 29. This data is memset with MSan to prevent a warning due to
its incorrect modeling of tbl instructions.
Prior to:
169dfbf9 disable Intra4Preds_NEON
there was a mismatch in the preprocessor checks for enabling the
function in NEON and removing the C version; NEON used `BPS == 32` while
the C code was removed unconditionally when building for aarch64. This
patch also normalizes those checks to look for `BPS == 32` and `BPS !=
32` as appropriate.
Bug: b:366668849,webp:372109644
Change-Id: Ic9e6ad4b2d844cb446decd63aec0b2676a89c8d0
These appear as warnings under VS15 (16 and 17 are silent) and were
missed in:
a32b436b dsp/lossless*: use WEBP_RESTRICT qualifier
Change-Id: Ia7cffafc166f2da93b51714363558798cda71b67
* changes:
dsp/yuv*: use WEBP_RESTRICT qualifier
dsp/upsampling*: use WEBP_RESTRICT qualifier
dsp/rescaler*: use WEBP_RESTRICT qualifier
dsp/lossless*: use WEBP_RESTRICT qualifier
dsp/filters*: use WEBP_RESTRICT qualifier
dsp/enc*: use WEBP_RESTRICT qualifier
dsp/dec*: use WEBP_RESTRICT qualifier
dsp/cost*: use WEBP_RESTRICT qualifier
The load of the `top` parameter may over read causing MSan errors:
==7373==WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0xfff891d52ad4 in Intra4Preds_NEON src/dsp/enc_neon.c:1003:12
#1 0xfff892d87618 in MakeIntra4Preds src/enc/quant_enc.c:484:3
Bug: b:366668849
Change-Id: I29cf3b2f402ee79ea93c1ee2a4fdd95083aeed68
Better vectorization in the C code, fewer instructions / comparisons in
NEON, and fewer reloads in SSE2/SSE4 w/ndk r27/gcc-13/clang-16.
This only affects non-vector pointers; any vector pointers are left as a
follow up.
Change-Id: I07a7e36a2dce8632c71c0fbbeef94dc51453eaf7
Better vectorization in the C code, fewer instructions in NEON, and some
code reordering / better register usage in SSE2/SSE4 w/ndk
r27/gcc-13/clang-16.
This only affects non-vector pointers; any vector pointers are left as a
follow up.
Change-Id: Ib29980f778ad3dbb952178ad8dee39b8673c4ff8
Some improvement in the C code. No changes in NEON or SSE2 w/ndk
r27/gcc-13/clang-16.
This only affects non-vector pointers; any vector pointers are left as a
follow up.
Change-Id: I2316122db893f48f0afda90a147c83cac7f07526
lossless_enc: better vectorization, most benefits seen in AddVector/Eq
w/ndk r27/gcc-13/clang-16
lossless: minor reordering and some improvement to PredictorAdd5_SSE2
w/gcc-13
This only affects non-vector pointers; any vector pointers are left as a
follow up.
Change-Id: I2356e314f391ee2f2c71f00bc6ee10097d3881e7
Better stack/register usage in SSE2/NEON code and improved vectorization
of the C code with ndk r27/gcc-13/clang-16.
This only affects non-vector pointers; any vector pointers are left as a
follow up.
Change-Id: I32b53dd38bfc7e2231d875409e7dfda7c513cfb6
This allows for better vectorization of the C code, inlining of
TrueMotion_SSE2, better load usage in aarch64 and other minor
reordering with ndk r27/gcc-13/clang-16.
This only affects non-vector pointers; any vector pointers are left as a
follow up.
Change-Id: I07e9944d5c0aa5a079b22883ac5a2d649695e4a0
A minor improvement for arm targets with ndk r27/gcc-13 in H/VFilter8 (a
couple fewer moves w/aarch64) and much better vectorization of
DitherCombine8x8_C in most targets.
This only affects non-vector pointers; any vector pointers are left as a
follow up.
Change-Id: I03e73e6d6404261bb8408a9ae76a4b6ef142f8f0
on SetResidualCoeffs_*. This results in some minor code reordering when
targeting arvm7 with ndk r27 and other recent versions of clang. No
changes in the x86 compilations with clang-16 / gcc-13.
This only affects non-vector pointers; any vector pointers are left as a
follow up.
Change-Id: I7c3554ece848fafbc5ac9c4944f1dc85129f6fd8
The row parameter became a constant in:
2102ccd update the Unfilter API in dsp to process one row independently
num_rows is always equal to height.
Change-Id: Ie43dc5ef222e442ce8c92766da0b9824ccbca236
The inverse parameter became a constant in:
2102ccd update the Unfilter API in dsp to process one row independently
The row parameter to these functions is in a similar state; it will be
removed in a follow up.
Change-Id: I94cd8babe0e42474ff794ba5fa29dd48039de5f8
Replace vmovl_u8 -> s16 + signed vaddq with unsigned vaddw.
No change in assembly with clang-16 (armv7 & aarch64) and gcc-13
(aarch64). armv7 gcc-13 had kept the vmovl instructions, those are now
gone.
Change-Id: Ibb4fbdd5680d3e9dd06933c100528a6f363de472
This needs to be done with signed saturation as the sum may be negative.
fixes mismatch with C code after:
3bfb05e3 Add AArch64 Neon implementation of Intra16Preds
Change-Id: I017e939d7155cc3489ceb76fc8ad50ac9917f23d
This needs to be done with signed saturation as the sum may be negative.
fixes mismatch with C code after:
baa93808 Add AArch64 Neon implementation of Intra4Preds
Change-Id: I190c3d7f78cfd2c7ae83fb7059de41e307abda36
* changes:
Use QuantizeBlock_NEON for VP8EncQuantizeBlockWHT on Arm
Add AArch64 Neon implementation of Intra16Preds
Add AArch64 Neon implementation of Intra4Preds