From patchwork Mon Sep 13 23:05:04 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 44962
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 402CB3857C5B
	for <patchwork@sourceware.org>; Mon, 13 Sep 2021 23:21:26 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 402CB3857C5B
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1631575286;
	bh=XYxF+mEP+kiNBekDqnQmwAOisYngS9xahHbO20c5aXc=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:From;
	b=WbTisberBo6iZ3nWeVhAhNDO6jabwiYmInAKso9rfExlP2dsoqdAB1qQt9ZbHlWJY
	 DHvc2i8KxE8yxoj3xTOU2JikUrG7YxJnH2ZPDQBX/ze1kgzxHDiV8BvCdFvuZUs3z0
	 3K06C3vOMxa7gzYm6CViksA/PpIbpI3j1LyAsVY8=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-il1-x12b.google.com (mail-il1-x12b.google.com
 [IPv6:2607:f8b0:4864:20::12b])
 by sourceware.org (Postfix) with ESMTPS id 0B2AD3858408
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 23:21:02 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0B2AD3858408
Received: by mail-il1-x12b.google.com with SMTP id v16so7146677ilg.3
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 16:21:02 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
 :content-transfer-encoding;
 bh=XYxF+mEP+kiNBekDqnQmwAOisYngS9xahHbO20c5aXc=;
 b=lV1v/HV83dAtQ1iWL75P7SrcrUSGZoWkzqInX0KDtyslP8XaJ5hFl9UE++cO1UIcSL
 v3bqprI+riCgKlWElQhdhaW2HM5JH5xPO994fxklbG046fR1sa4AQQdT/MMDkIOv+snw
 hzw+RpaS4Xq8odREqo8Z3z3oJFd4wSvAWP7bSk7AnadEg9JCnxk3wP93f1VrApuh0Ful
 tmpJYLU1WBz+FImtL2j7VP6TfsZnkeob/VBVpn7TR6rZxILV7n2E9KcRSAq/eQ9R3D3I
 X9i9aeovRDi2tgZlnACXiImabqM7bShefR8eTPNl84ainCpcxw5CBksxNY5RvxFUV463
 /IQQ==
X-Gm-Message-State: AOAM531e/SuZLcBPoLLJ9jwxP8bHoC5VZJfZ6CJuH+RD5cezsHYqFZl9
 TdrC5aglrdl9m8mNQh7DRhEDnfpRg6g=
X-Google-Smtp-Source: 
 ABdhPJwT+7CwTW5q1/Ls89A+VFmbwsKyyjZk00GOoN6Mc7pAB2oeCTSiXJ3sY/Y/U7fP+KdipMozVw==
X-Received: by 2002:a05:6e02:1c81:: with SMTP id
 w1mr9790867ill.112.1631575260970;
 Mon, 13 Sep 2021 16:21:00 -0700 (PDT)
Received: from localhost.localdomain (mobile-130-126-255-38.near.illinois.edu.
 [130.126.255.38])
 by smtp.googlemail.com with ESMTPSA id s5sm5508857iol.33.2021.09.13.16.21.00
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 13 Sep 2021 16:21:00 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH 1/5] x86_64: Add support for bcmp using sse2, sse4_1, avx2,
 and evex
Date: Mon, 13 Sep 2021 18:05:04 -0500
Message-Id: <20210913230506.546749-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.25.1
MIME-Version: 1.0
X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

No bug. This commit adds support for an optimized bcmp implementation.
Support is for sse2, sse4_1, avx2, and evex.

All string tests passing and build succeeding.
---
This commit is essentially because compilers will optimize the
idiomatic use of memcmp return as a boolean:
    
https://godbolt.org/z/Tbhefh6cv
    
so it seems reasonable to have an optimized bcmp implementation as we
can get ~0-25% improvement (generally larger improvement for the
smaller size ranges which ultimately are the most important to opimize
for).
    
Numbers for new implementations attached in reply.

Tests where run on the following CPUs:

Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

Some notes on the numbers.

There are some regressions in the sse2/sse4_1 versions. I didn't
optimize these versions beyond defining out obviously irrelivant code
for bcmp. My intuition is that the slowdowns are alignment related. I
am not sure if these issues would translate to architectures that
would actually use sse2/sse4_1.

I add the sse2/sse4_1 implementations mostly so that the ifunc would
have something to fallback on. With the lackluster numbers it may not
be worth it, especially factoring in code size costs. Thoughts?

The Tigerlake and Skylake versions are basically universal
improvements for evex and avx2. I opted to align bcmp to 64 byte as
opposed to 16. The rational is that to optimize for frontend behavior
on either machine, only 16 byte gurantees is not enough. I think in
any function where throughput (which I think bcmp can be) might be
important good frontend behavior is important.

    
 benchtests/Makefile                        |  2 +-
 benchtests/bench-bcmp.c                    | 20 ++++++++
 benchtests/bench-memcmp.c                  |  4 +-
 string/Makefile                            |  4 +-
 string/test-bcmp.c                         | 21 +++++++++
 string/test-memcmp.c                       | 27 +++++++----
 sysdeps/x86_64/memcmp.S                    |  2 -
 sysdeps/x86_64/multiarch/Makefile          |  3 ++
 sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S   | 12 +++++
 sysdeps/x86_64/multiarch/bcmp-avx2.S       | 23 ++++++++++
 sysdeps/x86_64/multiarch/bcmp-evex.S       | 23 ++++++++++
 sysdeps/x86_64/multiarch/bcmp-sse2.S       | 23 ++++++++++
 sysdeps/x86_64/multiarch/bcmp-sse4.S       | 23 ++++++++++
 sysdeps/x86_64/multiarch/bcmp.c            | 35 ++++++++++++++
 sysdeps/x86_64/multiarch/ifunc-bcmp.h      | 53 ++++++++++++++++++++++
 sysdeps/x86_64/multiarch/ifunc-impl-list.c | 23 ++++++++++
 sysdeps/x86_64/multiarch/memcmp-sse2.S     |  4 +-
 sysdeps/x86_64/multiarch/memcmp.c          |  2 -
 18 files changed, 286 insertions(+), 18 deletions(-)
 create mode 100644 benchtests/bench-bcmp.c
 create mode 100644 string/test-bcmp.c
 create mode 100644 sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S
 create mode 100644 sysdeps/x86_64/multiarch/bcmp-avx2.S
 create mode 100644 sysdeps/x86_64/multiarch/bcmp-evex.S
 create mode 100644 sysdeps/x86_64/multiarch/bcmp-sse2.S
 create mode 100644 sysdeps/x86_64/multiarch/bcmp-sse4.S
 create mode 100644 sysdeps/x86_64/multiarch/bcmp.c
 create mode 100644 sysdeps/x86_64/multiarch/ifunc-bcmp.h

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 1530939a8c..5fc495eb57 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -47,7 +47,7 @@ bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}})
 endif
 
 # String function benchmarks.
-string-benchset := memccpy memchr memcmp memcpy memmem memmove \
+string-benchset := bcmp memccpy memchr memcmp memcpy memmem memmove \
 		   mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
 		   strcat strchr strchrnul strcmp strcpy strcspn strlen \
 		   strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
diff --git a/benchtests/bench-bcmp.c b/benchtests/bench-bcmp.c
new file mode 100644
index 0000000000..1023639787
--- /dev/null
+++ b/benchtests/bench-bcmp.c
@@ -0,0 +1,20 @@
+/* Measure bcmp functions.
+   Copyright (C) 2015-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#define TEST_BCMP 1
+#include "bench-memcmp.c"
diff --git a/benchtests/bench-memcmp.c b/benchtests/bench-memcmp.c
index 744c7ec5ba..4d5f8fb766 100644
--- a/benchtests/bench-memcmp.c
+++ b/benchtests/bench-memcmp.c
@@ -17,7 +17,9 @@
    <https://www.gnu.org/licenses/>.  */
 
 #define TEST_MAIN
-#ifdef WIDE
+#ifdef TEST_BCMP
+# define TEST_NAME "bcmp"
+#elif defined WIDE
 # define TEST_NAME "wmemcmp"
 #else
 # define TEST_NAME "memcmp"
diff --git a/string/Makefile b/string/Makefile
index f0fce2a0b8..f1f67ee157 100644
--- a/string/Makefile
+++ b/string/Makefile
@@ -35,7 +35,7 @@ routines	:= strcat strchr strcmp strcoll strcpy strcspn		\
 		   strncat strncmp strncpy				\
 		   strrchr strpbrk strsignal strspn strstr strtok	\
 		   strtok_r strxfrm memchr memcmp memmove memset	\
-		   mempcpy bcopy bzero ffs ffsll stpcpy stpncpy		\
+		   mempcpy bcmp bcopy bzero ffs ffsll stpcpy stpncpy		\
 		   strcasecmp strncase strcasecmp_l strncase_l		\
 		   memccpy memcpy wordcopy strsep strcasestr		\
 		   swab strfry memfrob memmem rawmemchr strchrnul	\
@@ -52,7 +52,7 @@ strop-tests	:= memchr memcmp memcpy memmove mempcpy memset memccpy	\
 		   stpcpy stpncpy strcat strchr strcmp strcpy strcspn	\
 		   strlen strncmp strncpy strpbrk strrchr strspn memmem	\
 		   strstr strcasestr strnlen strcasecmp strncasecmp	\
-		   strncat rawmemchr strchrnul bcopy bzero memrchr	\
+		   strncat rawmemchr strchrnul bcmp bcopy bzero memrchr	\
 		   explicit_bzero
 tests		:= tester inl-tester noinl-tester testcopy test-ffs	\
 		   tst-strlen stratcliff tst-svc tst-inlcall		\
diff --git a/string/test-bcmp.c b/string/test-bcmp.c
new file mode 100644
index 0000000000..6d19a4a87c
--- /dev/null
+++ b/string/test-bcmp.c
@@ -0,0 +1,21 @@
+/* Test and measure bcmp functions.
+   Copyright (C) 2012-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#define BAD_RESULT(result, expec) ((!(result)) != (!(expec)))
+#define TEST_BCMP 1
+#include "test-memcmp.c"
diff --git a/string/test-memcmp.c b/string/test-memcmp.c
index 6ddbc05d2f..c630e6799d 100644
--- a/string/test-memcmp.c
+++ b/string/test-memcmp.c
@@ -17,11 +17,14 @@
    <https://www.gnu.org/licenses/>.  */
 
 #define TEST_MAIN
-#ifdef WIDE
+#ifdef TEST_BCMP
+# define TEST_NAME "bcmp"
+#elif defined WIDE
 # define TEST_NAME "wmemcmp"
 #else
 # define TEST_NAME "memcmp"
 #endif
+
 #include "test-string.h"
 #ifdef WIDE
 # include <inttypes.h>
@@ -35,6 +38,7 @@
 # define CHARBYTES 4
 # define CHAR__MIN WCHAR_MIN
 # define CHAR__MAX WCHAR_MAX
+
 int
 simple_wmemcmp (const wchar_t *s1, const wchar_t *s2, size_t n)
 {
@@ -48,8 +52,11 @@ simple_wmemcmp (const wchar_t *s1, const wchar_t *s2, size_t n)
 }
 #else
 # include <limits.h>
-
-# define MEMCMP memcmp
+# ifdef TEST_BCMP
+#  define MEMCMP bcmp
+# else
+#  define MEMCMP memcmp
+# endif
 # define MEMCPY memcpy
 # define SIMPLE_MEMCMP simple_memcmp
 # define CHAR char
@@ -69,6 +76,12 @@ simple_memcmp (const char *s1, const char *s2, size_t n)
 }
 #endif
 
+# ifndef BAD_RESULT
+#  define BAD_RESULT(result, expec)                                     \
+    (((result) == 0 && (expec)) || ((result) < 0 && (expec) >= 0) ||    \
+     ((result) > 0 && (expec) <= 0))
+#  endif
+
 typedef int (*proto_t) (const CHAR *, const CHAR *, size_t);
 
 IMPL (SIMPLE_MEMCMP, 0)
@@ -79,9 +92,7 @@ check_result (impl_t *impl, const CHAR *s1, const CHAR *s2, size_t len,
 	      int exp_result)
 {
   int result = CALL (impl, s1, s2, len);
-  if ((exp_result == 0 && result != 0)
-      || (exp_result < 0 && result >= 0)
-      || (exp_result > 0 && result <= 0))
+  if (BAD_RESULT(result, exp_result))
     {
       error (0, 0, "Wrong result in function %s %d %d", impl->name,
 	     result, exp_result);
@@ -186,9 +197,7 @@ do_random_tests (void)
 	{
 	  r = CALL (impl, (CHAR *) p1 + align1, (const CHAR *) p2 + align2,
 		    len);
-	  if ((r == 0 && result)
-	      || (r < 0 && result >= 0)
-	      || (r > 0 && result <= 0))
+	  if (BAD_RESULT(r, result))
 	    {
 	      error (0, 0, "Iteration %zd - wrong result in function %s (%zd, %zd, %zd, %zd) %ld != %d, p1 %p p2 %p",
 		     n, impl->name, align1 * CHARBYTES & 63,  align2 * CHARBYTES & 63, len, pos, r, result, p1, p2);
diff --git a/sysdeps/x86_64/memcmp.S b/sysdeps/x86_64/memcmp.S
index 870e15c5a0..dfd0269db2 100644
--- a/sysdeps/x86_64/memcmp.S
+++ b/sysdeps/x86_64/memcmp.S
@@ -356,6 +356,4 @@ L(ATR32res):
 	.p2align 4,, 4
 END(memcmp)
 
-#undef bcmp
-weak_alias (memcmp, bcmp)
 libc_hidden_builtin_def (memcmp)
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index 26be40959c..9dd0d8c3ff 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -1,6 +1,7 @@
 ifeq ($(subdir),string)
 
 sysdep_routines += strncat-c stpncpy-c strncpy-c \
+		   bcmp-sse2 bcmp-sse4 bcmp-avx2 \
 		   strcmp-sse2 strcmp-sse2-unaligned strcmp-ssse3  \
 		   strcmp-sse4_2 strcmp-avx2 \
 		   strncmp-sse2 strncmp-ssse3 strncmp-sse4_2 strncmp-avx2 \
@@ -40,6 +41,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
 		   memset-sse2-unaligned-erms \
 		   memset-avx2-unaligned-erms \
 		   memset-avx512-unaligned-erms \
+		   bcmp-avx2-rtm \
 		   memchr-avx2-rtm \
 		   memcmp-avx2-movbe-rtm \
 		   memmove-avx-unaligned-erms-rtm \
@@ -59,6 +61,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
 		   strncpy-avx2-rtm \
 		   strnlen-avx2-rtm \
 		   strrchr-avx2-rtm \
+		   bcmp-evex \
 		   memchr-evex \
 		   memcmp-evex-movbe \
 		   memmove-evex-unaligned-erms \
diff --git a/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S b/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S
new file mode 100644
index 0000000000..d742257e4e
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S
@@ -0,0 +1,12 @@
+#ifndef MEMCMP
+# define MEMCMP __bcmp_avx2_rtm
+#endif
+
+#define ZERO_UPPER_VEC_REGISTERS_RETURN \
+  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
+
+#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
+
+#define SECTION(p) p##.avx.rtm
+
+#include "bcmp-avx2.S"
diff --git a/sysdeps/x86_64/multiarch/bcmp-avx2.S b/sysdeps/x86_64/multiarch/bcmp-avx2.S
new file mode 100644
index 0000000000..93a9a20b17
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/bcmp-avx2.S
@@ -0,0 +1,23 @@
+/* bcmp optimized with AVX2.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef MEMCMP
+# define MEMCMP	__bcmp_avx2
+#endif
+
+#include "bcmp-avx2.S"
diff --git a/sysdeps/x86_64/multiarch/bcmp-evex.S b/sysdeps/x86_64/multiarch/bcmp-evex.S
new file mode 100644
index 0000000000..ade52e8c68
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/bcmp-evex.S
@@ -0,0 +1,23 @@
+/* bcmp optimized with EVEX.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef MEMCMP
+# define MEMCMP	__bcmp_evex
+#endif
+
+#include "memcmp-evex-movbe.S"
diff --git a/sysdeps/x86_64/multiarch/bcmp-sse2.S b/sysdeps/x86_64/multiarch/bcmp-sse2.S
new file mode 100644
index 0000000000..b18d570386
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/bcmp-sse2.S
@@ -0,0 +1,23 @@
+/* bcmp optimized with SSE2
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+# ifndef memcmp
+#  define memcmp	__bcmp_sse2
+# endif
+# define USE_AS_BCMP	1
+#include "memcmp-sse2.S"
diff --git a/sysdeps/x86_64/multiarch/bcmp-sse4.S b/sysdeps/x86_64/multiarch/bcmp-sse4.S
new file mode 100644
index 0000000000..ed9804053f
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/bcmp-sse4.S
@@ -0,0 +1,23 @@
+/* bcmp optimized with SSE4.1
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+# ifndef MEMCMP
+#  define MEMCMP	__bcmp_sse4_1
+# endif
+# define USE_AS_BCMP	1
+#include "memcmp-sse4.S"
diff --git a/sysdeps/x86_64/multiarch/bcmp.c b/sysdeps/x86_64/multiarch/bcmp.c
new file mode 100644
index 0000000000..6e26b73ecc
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/bcmp.c
@@ -0,0 +1,35 @@
+/* Multiple versions of bcmp.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+#if IS_IN (libc)
+# define bcmp __redirect_bcmp
+# include <string.h>
+# undef bcmp
+
+# define SYMBOL_NAME bcmp
+# include "ifunc-bcmp.h"
+
+libc_ifunc_redirected (__redirect_bcmp, bcmp, IFUNC_SELECTOR ());
+
+# ifdef SHARED
+__hidden_ver1 (bcmp, __GI_bcmp, __redirect_bcmp)
+  __attribute__ ((visibility ("hidden"))) __attribute_copy__ (bcmp);
+# endif
+#endif
diff --git a/sysdeps/x86_64/multiarch/ifunc-bcmp.h b/sysdeps/x86_64/multiarch/ifunc-bcmp.h
new file mode 100644
index 0000000000..b0dacd8526
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/ifunc-bcmp.h
@@ -0,0 +1,53 @@
+/* Common definition for bcmp ifunc selections.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+# include <init-arch.h>
+
+extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (sse4_1) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
+
+static inline void *
+IFUNC_SELECTOR (void)
+{
+  const struct cpu_features* cpu_features = __get_cpu_features ();
+
+  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
+      && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
+      && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+    {
+      if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
+	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
+	return OPTIMIZE (evex);
+
+      if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
+	return OPTIMIZE (avx2_rtm);
+
+      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
+	return OPTIMIZE (avx2);
+    }
+
+  if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_1))
+    return OPTIMIZE (sse4_1);
+
+  return OPTIMIZE (sse2);
+}
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index 39ab10613b..dd0c393c7d 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -38,6 +38,29 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   size_t i = 0;
 
+  /* Support sysdeps/x86_64/multiarch/bcmp.c.  */
+  IFUNC_IMPL (i, name, bcmp,
+	      IFUNC_IMPL_ADD (array, i, bcmp,
+			      (CPU_FEATURE_USABLE (AVX2)
+                   && CPU_FEATURE_USABLE (MOVBE)
+			       && CPU_FEATURE_USABLE (BMI2)),
+			      __bcmp_avx2)
+	      IFUNC_IMPL_ADD (array, i, bcmp,
+			      (CPU_FEATURE_USABLE (AVX2)
+			       && CPU_FEATURE_USABLE (BMI2)
+                   && CPU_FEATURE_USABLE (MOVBE)
+			       && CPU_FEATURE_USABLE (RTM)),
+			      __bcmp_avx2_rtm)
+	      IFUNC_IMPL_ADD (array, i, bcmp,
+			      (CPU_FEATURE_USABLE (AVX512VL)
+			       && CPU_FEATURE_USABLE (AVX512BW)
+                   && CPU_FEATURE_USABLE (MOVBE)
+			       && CPU_FEATURE_USABLE (BMI2)),
+			      __bcmp_evex)
+	      IFUNC_IMPL_ADD (array, i, bcmp, CPU_FEATURE_USABLE (SSE4_1),
+			      __bcmp_sse4_1)
+	      IFUNC_IMPL_ADD (array, i, bcmp, 1, __bcmp_sse2))
+
   /* Support sysdeps/x86_64/multiarch/memchr.c.  */
   IFUNC_IMPL (i, name, memchr,
 	      IFUNC_IMPL_ADD (array, i, memchr,
diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
index b135fa2d40..2a4867ad18 100644
--- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
+++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
@@ -17,7 +17,9 @@
    <https://www.gnu.org/licenses/>.  */
 
 #if IS_IN (libc)
-# define memcmp __memcmp_sse2
+# ifndef memcmp
+#  define memcmp __memcmp_sse2
+# endif
 
 # ifdef SHARED
 #  undef libc_hidden_builtin_def
diff --git a/sysdeps/x86_64/multiarch/memcmp.c b/sysdeps/x86_64/multiarch/memcmp.c
index fe725f3563..1760e045df 100644
--- a/sysdeps/x86_64/multiarch/memcmp.c
+++ b/sysdeps/x86_64/multiarch/memcmp.c
@@ -27,8 +27,6 @@
 # include "ifunc-memcmp.h"
 
 libc_ifunc_redirected (__redirect_memcmp, memcmp, IFUNC_SELECTOR ());
-# undef bcmp
-weak_alias (memcmp, bcmp)
 
 # ifdef SHARED
 __hidden_ver1 (memcmp, __GI_memcmp, __redirect_memcmp)

From patchwork Mon Sep 13 23:05:05 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 44963
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 1BD093858002
	for <patchwork@sourceware.org>; Mon, 13 Sep 2021 23:22:08 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1BD093858002
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1631575328;
	bh=OFtg0gPtEGHJuCJcm7UANpGqrn4yBP/l84mFCGodt3k=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=jQLNKszw3BkgugqdGsAqqN/fuQ0wsLB1/oJ/tKAeI6zIRpJomLU81MM3iJg4zOTHU
	 Y4N9Xi0loQ6jWBqYYLSlNnplWf7D696cgq0f34TBAy7A2kCGax7Z+H8GOUxUgTq/4Z
	 ZnryrQrAx9t35sQBcE6fPHoYZ2pydRZ0yzcLZZgU=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-io1-xd35.google.com (mail-io1-xd35.google.com
 [IPv6:2607:f8b0:4864:20::d35])
 by sourceware.org (Postfix) with ESMTPS id C4C603858413
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 23:21:03 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C4C603858413
Received: by mail-io1-xd35.google.com with SMTP id a22so14413633iok.12
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 16:21:03 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=OFtg0gPtEGHJuCJcm7UANpGqrn4yBP/l84mFCGodt3k=;
 b=MsjEfFLOcKOfGQxwQU0r17TYqhltm+wOs/c3YwYxhoXxCsyfhmDqbUFQG1UfCZShK3
 K8JX2DUMvZtQkSSi3wB9LpNc1dhLHjirNDgbXZ/NpPy/hYjM9eEeCMBbVgcIDoSRYwEz
 kxhMTc/JkiKaMFLa+kQyweOZdfeGwt3a3YGNAKT1KFY9MFV3ffopOmsF0dWDnCkIcXxc
 dOFQDlnKvg7R+fta3+fDGG1vBp0S0y1Ikr25w3niQGMfKNMD9peQavQ+35vby1MS323K
 JWZgs6zEOD4Qz42EX+6gM/jm5rcO17rtBegcg4lCzkh7BzU8iU554YHyDurbrp6egI+n
 im0w==
X-Gm-Message-State: AOAM530Kxyw9fOYyrKywKPQSZa5AhGHOptcv+L6EclUWCUT7OwkoOWU2
 U/PuskQsxe8avtbUL1CjPGT3gCnTn4k=
X-Google-Smtp-Source: 
 ABdhPJxyU4gc5Dk8174eTTU5Qqg3JBw88+lHqtvbKnzpw6rB44cELoRZGOSuWpuzSOuLmOQ+vOvcbg==
X-Received: by 2002:a5d:914b:: with SMTP id y11mr11401063ioq.6.1631575263021;
 Mon, 13 Sep 2021 16:21:03 -0700 (PDT)
Received: from localhost.localdomain (mobile-130-126-255-38.near.illinois.edu.
 [130.126.255.38])
 by smtp.googlemail.com with ESMTPSA id s5sm5508857iol.33.2021.09.13.16.21.02
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 13 Sep 2021 16:21:02 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH 2/5] x86_64: Add sse2 optimized bcmp implementation in
 memcmp.S
Date: Mon, 13 Sep 2021 18:05:05 -0500
Message-Id: <20210913230506.546749-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20210913230506.546749-1-goldstein.w.n@gmail.com>
References: <20210913230506.546749-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

No bug. This commit does not modify any of the memcmp
implementation. It just adds bcmp ifdefs to skip obvious cases
where computing the proper 1/-1 required by memcmp is not needed.

test-memcmp, test-bcmp, and test-wmemcmp are all passing.
---
 sysdeps/x86_64/memcmp.S | 55 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/sysdeps/x86_64/memcmp.S b/sysdeps/x86_64/memcmp.S
index dfd0269db2..21607e7c91 100644
--- a/sysdeps/x86_64/memcmp.S
+++ b/sysdeps/x86_64/memcmp.S
@@ -49,34 +49,63 @@ L(s2b):
 	movzwl	(%rdi),	%eax
 	movzwl	(%rdi, %rsi), %edx
 	subq    $2, %r10
+#ifdef USE_AS_BCMP
+	je	L(finz1)
+#else
 	je	L(fin2_7)
+#endif
 	addq	$2, %rdi
 	cmpl	%edx, %eax
+#ifdef USE_AS_BCMP
+	jnz	L(neq_early)
+#else
 	jnz	L(fin2_7)
+#endif
 L(s4b):
 	testq	$4, %r10
 	jz	L(s8b)
 	movl	(%rdi),	%eax
 	movl	(%rdi, %rsi), %edx
 	subq    $4, %r10
+#ifdef USE_AS_BCMP
+	je	L(finz1)
+#else
 	je	L(fin2_7)
+#endif
 	addq	$4, %rdi
 	cmpl	%edx, %eax
+#ifdef USE_AS_BCMP
+	jnz	L(neq_early)
+#else
 	jnz	L(fin2_7)
+#endif
 L(s8b):
 	testq	$8, %r10
 	jz	L(s16b)
 	movq	(%rdi),	%rax
 	movq	(%rdi, %rsi), %rdx
 	subq    $8, %r10
+#ifdef USE_AS_BCMP
+	je	L(sub_return8)
+#else
 	je	L(fin2_7)
+#endif
 	addq	$8, %rdi
 	cmpq	%rdx, %rax
+#ifdef USE_AS_BCMP
+	jnz	L(neq_early)
+#else
 	jnz	L(fin2_7)
+#endif
 L(s16b):
 	movdqu    (%rdi), %xmm1
 	movdqu    (%rdi, %rsi), %xmm0
 	pcmpeqb   %xmm0, %xmm1
+#ifdef USE_AS_BCMP
+	pmovmskb  %xmm1, %eax
+	subl      $0xffff, %eax
+	ret
+#else
 	pmovmskb  %xmm1, %edx
 	xorl	  %eax, %eax
 	subl      $0xffff, %edx
@@ -86,7 +115,7 @@ L(s16b):
 	movzbl	 (%rcx), %eax
 	movzbl	 (%rsi, %rcx), %edx
 	jmp	 L(finz1)
-
+#endif
 	.p2align 4,, 4
 L(finr1b):
 	movzbl	(%rdi), %eax
@@ -95,7 +124,15 @@ L(finz1):
 	subl	%edx, %eax
 L(exit):
 	ret
-
+#ifdef USE_AS_BCMP
+	.p2align 4,, 4
+L(sub_return8):
+	subq	%rdx, %rax
+	movl	%eax, %edx
+	shrq	$32, %rax
+	orl	%edx, %eax
+	ret
+#else
 	.p2align 4,, 4
 L(fin2_7):
 	cmpq	%rdx, %rax
@@ -111,12 +148,17 @@ L(fin2_7):
 	movzbl  %dl, %edx
 	subl	%edx, %eax
 	ret
-
+#endif
 	.p2align 4,, 4
 L(finz):
 	xorl	%eax, %eax
 	ret
-
+#ifdef USE_AS_BCMP
+	.p2align 4,, 4
+L(neq_early):
+	movl	$1, %eax
+	ret
+#endif
 	/* For blocks bigger than 32 bytes
 	   1. Advance one of the addr pointer to be 16B aligned.
 	   2. Treat the case of both addr pointers aligned to 16B
@@ -246,11 +288,16 @@ L(mt16):
 
 	.p2align 4,, 4
 L(neq):
+#ifdef USE_AS_BCMP
+	movl	$1, %eax
+    ret
+#else
 	bsfl      %edx, %ecx
 	movzbl	 (%rdi, %rcx), %eax
 	addq	 %rdi, %rsi
 	movzbl	 (%rsi,%rcx), %edx
 	jmp	 L(finz1)
+#endif
 
 	.p2align 4,, 4
 L(ATR):

From patchwork Mon Sep 13 23:05:06 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 44965
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 0529E3858039
	for <patchwork@sourceware.org>; Mon, 13 Sep 2021 23:23:40 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0529E3858039
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1631575420;
	bh=FFM0d0/seXcNtzd5Y/BrYUfl1OzBKHIyixwHMMVQKJM=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=eHTICr6apBp0r+ZF5WesUavu2Ofn68O9tfpnQTnl1iWlbK9TQfuZPlfDd+JKGdeGc
	 deX7YfJtV3dfdG8Q/INroOnSBqOBHogjXN73hXk49nS7JZwiWwGwvjtcPKBLVmQWvu
	 6u1zPDeYqeRXcifKLTrmQZijvxHpCHxZ2irC/tHw=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-io1-xd2d.google.com (mail-io1-xd2d.google.com
 [IPv6:2607:f8b0:4864:20::d2d])
 by sourceware.org (Postfix) with ESMTPS id AE4B93858408
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 23:21:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AE4B93858408
Received: by mail-io1-xd2d.google.com with SMTP id q3so14400350iot.3
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 16:21:05 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=FFM0d0/seXcNtzd5Y/BrYUfl1OzBKHIyixwHMMVQKJM=;
 b=AunZ4CADbuOlCP6glszJ/1e3arYovsn+7gSX8f5aTLmaIIkJzFBBqaMb/KnC2lmGQV
 z9p5zRPDjH+JLeS3BnSfOOsLkLi2lnx66rhSH2goLD/axvUKsyBbHEFysbgTMVyxM+ly
 zLvu59EfbgAavXrj9BbSd5UhWN7fQSsoeU3s5jhdrrwtzt4FjFL/5DP/C9VNtcFhL6Cr
 XlOdL0VW6af1w4FFKfIYcIaDSOTAYbJqx+ObC35AY0OR25QzrOgiQ3Lka264o49OOjJ0
 r8wVshhONVVWWseA0dtlLglfEaJ0ZMFlFq8w9LeHQ8Cwxy0fuz8kNMSZsCPZpXSihVST
 zzUw==
X-Gm-Message-State: AOAM530FSUsE4yNsn4wMsX6fJY61BiD6sJyH71cgGF2RZy+r9p80Kyfh
 ti6zgRHxBbyH2Wu0wsK7h3XrMXkmjVM=
X-Google-Smtp-Source: 
 ABdhPJzTr9HHOIJdlMCPD21N2V5gEMarXdNO8oKKFMJq0ApitbkGgTpbl7wOcpAcli25tFtvtclixQ==
X-Received: by 2002:a5e:a81a:: with SMTP id c26mr11174726ioa.15.1631575264223;
 Mon, 13 Sep 2021 16:21:04 -0700 (PDT)
Received: from localhost.localdomain (mobile-130-126-255-38.near.illinois.edu.
 [130.126.255.38])
 by smtp.googlemail.com with ESMTPSA id s5sm5508857iol.33.2021.09.13.16.21.03
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 13 Sep 2021 16:21:03 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH 3/5] x86_64: Add sse4_1 optimized bcmp implementation in
 memcmp-sse4.S
Date: Mon, 13 Sep 2021 18:05:06 -0500
Message-Id: <20210913230506.546749-3-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20210913230506.546749-1-goldstein.w.n@gmail.com>
References: <20210913230506.546749-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-10.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SCC_10_SHORT_WORD_LINES, SCC_20_SHORT_WORD_LINES,
 SCC_35_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

No bug. This commit does not modify any of the memcmp
implementation. It just adds bcmp ifdefs to skip obvious cases
where computing the proper 1/-1 required by memcmp is not needed.

test-memcmp, test-bcmp, and test-wmemcmp are all passing.
---
 sysdeps/x86_64/multiarch/memcmp-sse4.S | 761 ++++++++++++++++++++++++-
 1 file changed, 746 insertions(+), 15 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memcmp-sse4.S b/sysdeps/x86_64/multiarch/memcmp-sse4.S
index b82adcd5fa..b9528ed58e 100644
--- a/sysdeps/x86_64/multiarch/memcmp-sse4.S
+++ b/sysdeps/x86_64/multiarch/memcmp-sse4.S
@@ -72,7 +72,11 @@ L(79bytesormore):
 	movdqu	(%rdi), %xmm2
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 	mov	%rsi, %rcx
 	and	$-16, %rsi
 	add	$16, %rsi
@@ -91,34 +95,58 @@ L(less128bytes):
 	movdqu	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqu	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 
 	movdqu	32(%rdi), %xmm2
 	pxor	32(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(48bytesin256)
+# endif
 
 	movdqu	48(%rdi), %xmm2
 	pxor	48(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesin256)
+# endif
 	cmp	$32, %rdx
 	jb	L(less32bytesin64)
 
 	movdqu	64(%rdi), %xmm2
 	pxor	64(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(80bytesin256)
+# endif
 
 	movdqu	80(%rdi), %xmm2
 	pxor	80(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(96bytesin256)
+# endif
 	sub	$32, %rdx
 	add	$32, %rdi
 	add	$32, %rsi
@@ -140,42 +168,74 @@ L(less256bytes):
 	movdqu	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqu	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 
 	movdqu	32(%rdi), %xmm2
 	pxor	32(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(48bytesin256)
+# endif
 
 	movdqu	48(%rdi), %xmm2
 	pxor	48(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesin256)
+# endif
 
 	movdqu	64(%rdi), %xmm2
 	pxor	64(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(80bytesin256)
+# endif
 
 	movdqu	80(%rdi), %xmm2
 	pxor	80(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(96bytesin256)
+# endif
 
 	movdqu	96(%rdi), %xmm2
 	pxor	96(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(112bytesin256)
+# endif
 
 	movdqu	112(%rdi), %xmm2
 	pxor	112(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(128bytesin256)
+# endif
 
 	add	$128, %rsi
 	add	$128, %rdi
@@ -189,12 +249,20 @@ L(less256bytes):
 	movdqu	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqu	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 	sub	$32, %rdx
 	add	$32, %rdi
 	add	$32, %rsi
@@ -208,82 +276,146 @@ L(less512bytes):
 	movdqu	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqu	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 
 	movdqu	32(%rdi), %xmm2
 	pxor	32(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(48bytesin256)
+# endif
 
 	movdqu	48(%rdi), %xmm2
 	pxor	48(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesin256)
+# endif
 
 	movdqu	64(%rdi), %xmm2
 	pxor	64(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(80bytesin256)
+# endif
 
 	movdqu	80(%rdi), %xmm2
 	pxor	80(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(96bytesin256)
+# endif
 
 	movdqu	96(%rdi), %xmm2
 	pxor	96(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(112bytesin256)
+# endif
 
 	movdqu	112(%rdi), %xmm2
 	pxor	112(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(128bytesin256)
+# endif
 
 	movdqu	128(%rdi), %xmm2
 	pxor	128(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(144bytesin256)
+# endif
 
 	movdqu	144(%rdi), %xmm2
 	pxor	144(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(160bytesin256)
+# endif
 
 	movdqu	160(%rdi), %xmm2
 	pxor	160(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(176bytesin256)
+# endif
 
 	movdqu	176(%rdi), %xmm2
 	pxor	176(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(192bytesin256)
+# endif
 
 	movdqu	192(%rdi), %xmm2
 	pxor	192(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(208bytesin256)
+# endif
 
 	movdqu	208(%rdi), %xmm2
 	pxor	208(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(224bytesin256)
+# endif
 
 	movdqu	224(%rdi), %xmm2
 	pxor	224(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(240bytesin256)
+# endif
 
 	movdqu	240(%rdi), %xmm2
 	pxor	240(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(256bytesin256)
+# endif
 
 	add	$256, %rsi
 	add	$256, %rdi
@@ -300,12 +432,20 @@ L(less512bytes):
 	movdqu	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqu	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 	sub	$32, %rdx
 	add	$32, %rdi
 	add	$32, %rsi
@@ -346,7 +486,11 @@ L(64bytesormore_loop):
 	por	%xmm5, %xmm1
 
 	ptest	%xmm1, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesormore_loop_end)
+# endif
 	add	$64, %rsi
 	add	$64, %rdi
 	sub	$64, %rdx
@@ -380,7 +524,11 @@ L(L2_L3_unaligned_128bytes_loop):
 	por	%xmm5, %xmm1
 
 	ptest	%xmm1, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesormore_loop_end)
+# endif
 	add	$64, %rsi
 	add	$64, %rdi
 	sub	$64, %rdx
@@ -404,34 +552,58 @@ L(less128bytesin2aligned):
 	movdqa	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqa	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 
 	movdqa	32(%rdi), %xmm2
 	pxor	32(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(48bytesin256)
+# endif
 
 	movdqa	48(%rdi), %xmm2
 	pxor	48(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesin256)
+# endif
 	cmp	$32, %rdx
 	jb	L(less32bytesin64in2alinged)
 
 	movdqa	64(%rdi), %xmm2
 	pxor	64(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(80bytesin256)
+# endif
 
 	movdqa	80(%rdi), %xmm2
 	pxor	80(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(96bytesin256)
+# endif
 	sub	$32, %rdx
 	add	$32, %rdi
 	add	$32, %rsi
@@ -454,42 +626,74 @@ L(less256bytesin2alinged):
 	movdqa	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqa	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 
 	movdqa	32(%rdi), %xmm2
 	pxor	32(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(48bytesin256)
+# endif
 
 	movdqa	48(%rdi), %xmm2
 	pxor	48(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesin256)
+# endif
 
 	movdqa	64(%rdi), %xmm2
 	pxor	64(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(80bytesin256)
+# endif
 
 	movdqa	80(%rdi), %xmm2
 	pxor	80(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(96bytesin256)
+# endif
 
 	movdqa	96(%rdi), %xmm2
 	pxor	96(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(112bytesin256)
+# endif
 
 	movdqa	112(%rdi), %xmm2
 	pxor	112(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(128bytesin256)
+# endif
 
 	add	$128, %rsi
 	add	$128, %rdi
@@ -503,12 +707,20 @@ L(less256bytesin2alinged):
 	movdqu	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqu	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 	sub	$32, %rdx
 	add	$32, %rdi
 	add	$32, %rsi
@@ -524,82 +736,146 @@ L(256bytesormorein2aligned):
 	movdqa	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqa	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 
 	movdqa	32(%rdi), %xmm2
 	pxor	32(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(48bytesin256)
+# endif
 
 	movdqa	48(%rdi), %xmm2
 	pxor	48(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesin256)
+# endif
 
 	movdqa	64(%rdi), %xmm2
 	pxor	64(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(80bytesin256)
+# endif
 
 	movdqa	80(%rdi), %xmm2
 	pxor	80(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(96bytesin256)
+# endif
 
 	movdqa	96(%rdi), %xmm2
 	pxor	96(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(112bytesin256)
+# endif
 
 	movdqa	112(%rdi), %xmm2
 	pxor	112(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(128bytesin256)
+# endif
 
 	movdqa	128(%rdi), %xmm2
 	pxor	128(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(144bytesin256)
+# endif
 
 	movdqa	144(%rdi), %xmm2
 	pxor	144(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(160bytesin256)
+# endif
 
 	movdqa	160(%rdi), %xmm2
 	pxor	160(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(176bytesin256)
+# endif
 
 	movdqa	176(%rdi), %xmm2
 	pxor	176(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(192bytesin256)
+# endif
 
 	movdqa	192(%rdi), %xmm2
 	pxor	192(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(208bytesin256)
+# endif
 
 	movdqa	208(%rdi), %xmm2
 	pxor	208(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(224bytesin256)
+# endif
 
 	movdqa	224(%rdi), %xmm2
 	pxor	224(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(240bytesin256)
+# endif
 
 	movdqa	240(%rdi), %xmm2
 	pxor	240(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(256bytesin256)
+# endif
 
 	add	$256, %rsi
 	add	$256, %rdi
@@ -616,12 +892,20 @@ L(256bytesormorein2aligned):
 	movdqa	(%rdi), %xmm2
 	pxor	(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(16bytesin256)
+# endif
 
 	movdqa	16(%rdi), %xmm2
 	pxor	16(%rsi), %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(32bytesin256)
+# endif
 	sub	$32, %rdx
 	add	$32, %rdi
 	add	$32, %rsi
@@ -663,7 +947,11 @@ L(64bytesormore_loopin2aligned):
 	por	%xmm5, %xmm1
 
 	ptest	%xmm1, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesormore_loop_end)
+# endif
 	add	$64, %rsi
 	add	$64, %rdi
 	sub	$64, %rdx
@@ -697,7 +985,11 @@ L(L2_L3_aligned_128bytes_loop):
 	por	%xmm5, %xmm1
 
 	ptest	%xmm1, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(64bytesormore_loop_end)
+# endif
 	add	$64, %rsi
 	add	$64, %rdi
 	sub	$64, %rdx
@@ -708,7 +1000,7 @@ L(L2_L3_aligned_128bytes_loop):
 	add	%rdx, %rdi
 	BRANCH_TO_JMPTBL_ENTRY(L(table_64bytes), %rdx, 4)
 
-
+# ifndef USE_AS_BCMP
 	.p2align 4
 L(64bytesormore_loop_end):
 	add	$16, %rdi
@@ -791,17 +1083,29 @@ L(32bytesin256):
 L(16bytesin256):
 	add	$16, %rdi
 	add	$16, %rsi
+# endif
 L(16bytes):
 	mov	-16(%rdi), %rax
 	mov	-16(%rsi), %rcx
 	cmp	%rax, %rcx
+# ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+# else
 	jne	L(diffin8bytes)
+# endif
 L(8bytes):
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
+# ifdef USE_AS_BCMP
+	sub	%rcx, %rax
+	mov	%rax, %rcx
+	shr	$32, %rcx
+	or	%ecx, %eax
+# else
 	cmp	%rax, %rcx
 	jne	L(diffin8bytes)
 	xor	%eax, %eax
+# endif
 	ret
 
 	.p2align 4
@@ -809,16 +1113,26 @@ L(12bytes):
 	mov	-12(%rdi), %rax
 	mov	-12(%rsi), %rcx
 	cmp	%rax, %rcx
+# ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+# else
 	jne	L(diffin8bytes)
+# endif
 L(4bytes):
 	mov	-4(%rsi), %ecx
-# ifndef USE_AS_WMEMCMP
+# ifdef USE_AS_BCMP
 	mov	-4(%rdi), %eax
-	cmp	%eax, %ecx
+	sub	%ecx, %eax
+	ret
 # else
+#  ifndef USE_AS_WMEMCMP
+	mov	-4(%rdi), %eax
+	cmp	%eax, %ecx
+#  else
 	cmp	-4(%rdi), %ecx
-# endif
+#  endif
 	jne	L(diffin4bytes)
+# endif
 L(0bytes):
 	xor	%eax, %eax
 	ret
@@ -832,31 +1146,51 @@ L(65bytes):
 	mov	$-65, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(49bytes):
 	movdqu	-49(%rdi), %xmm1
 	movdqu	-49(%rsi), %xmm2
 	mov	$-49, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(33bytes):
 	movdqu	-33(%rdi), %xmm1
 	movdqu	-33(%rsi), %xmm2
 	mov	$-33, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(17bytes):
 	mov	-17(%rdi), %rax
 	mov	-17(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 L(9bytes):
 	mov	-9(%rdi), %rax
 	mov	-9(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	movzbl	-1(%rdi), %eax
 	movzbl	-1(%rsi), %edx
 	sub	%edx, %eax
@@ -867,12 +1201,23 @@ L(13bytes):
 	mov	-13(%rdi), %rax
 	mov	-13(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
+#  ifdef USE_AS_BCMP
+	sub	%rcx, %rax
+	mov	%rax, %rcx
+	shr	$32, %rcx
+	or	%ecx, %eax
+#  else
 	cmp	%rax, %rcx
 	jne	L(diffin8bytes)
 	xor	%eax, %eax
+#  endif
 	ret
 
 	.p2align 4
@@ -880,7 +1225,11 @@ L(5bytes):
 	mov	-5(%rdi), %eax
 	mov	-5(%rsi), %ecx
 	cmp	%eax, %ecx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin4bytes)
+#  endif
 	movzbl	-1(%rdi), %eax
 	movzbl	-1(%rsi), %edx
 	sub	%edx, %eax
@@ -893,37 +1242,59 @@ L(66bytes):
 	mov	$-66, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(50bytes):
 	movdqu	-50(%rdi), %xmm1
 	movdqu	-50(%rsi), %xmm2
 	mov	$-50, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(34bytes):
 	movdqu	-34(%rdi), %xmm1
 	movdqu	-34(%rsi), %xmm2
 	mov	$-34, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(18bytes):
 	mov	-18(%rdi), %rax
 	mov	-18(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 L(10bytes):
 	mov	-10(%rdi), %rax
 	mov	-10(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	movzwl	-2(%rdi), %eax
 	movzwl	-2(%rsi), %ecx
+#  ifndef USE_AS_BCMP
 	cmp	%cl, %al
 	jne	L(end)
 	and	$0xffff, %eax
 	and	$0xffff, %ecx
+#  endif
 	sub	%ecx, %eax
 	ret
 
@@ -932,12 +1303,23 @@ L(14bytes):
 	mov	-14(%rdi), %rax
 	mov	-14(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
+#  ifdef USE_AS_BCMP
+	sub	%rcx, %rax
+	mov	%rax, %rcx
+	shr	$32, %rcx
+	or	%ecx, %eax
+#  else
 	cmp	%rax, %rcx
 	jne	L(diffin8bytes)
 	xor	%eax, %eax
+#  endif
 	ret
 
 	.p2align 4
@@ -945,14 +1327,20 @@ L(6bytes):
 	mov	-6(%rdi), %eax
 	mov	-6(%rsi), %ecx
 	cmp	%eax, %ecx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin4bytes)
+#  endif
 L(2bytes):
 	movzwl	-2(%rsi), %ecx
 	movzwl	-2(%rdi), %eax
+#  ifndef USE_AS_BCMP
 	cmp	%cl, %al
 	jne	L(end)
 	and	$0xffff, %eax
 	and	$0xffff, %ecx
+#  endif
 	sub	%ecx, %eax
 	ret
 
@@ -963,36 +1351,60 @@ L(67bytes):
 	mov	$-67, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(51bytes):
 	movdqu	-51(%rdi), %xmm2
 	movdqu	-51(%rsi), %xmm1
 	mov	$-51, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(35bytes):
 	movdqu	-35(%rsi), %xmm1
 	movdqu	-35(%rdi), %xmm2
 	mov	$-35, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(19bytes):
 	mov	-19(%rdi), %rax
 	mov	-19(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 L(11bytes):
 	mov	-11(%rdi), %rax
 	mov	-11(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	mov	-4(%rdi), %eax
 	mov	-4(%rsi), %ecx
+#  ifdef USE_AS_BCMP
+	sub	%ecx, %eax
+#  else
 	cmp	%eax, %ecx
 	jne	L(diffin4bytes)
 	xor	%eax, %eax
+#  endif
 	ret
 
 	.p2align 4
@@ -1000,12 +1412,23 @@ L(15bytes):
 	mov	-15(%rdi), %rax
 	mov	-15(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
+#  ifdef USE_AS_BCMP
+	sub	%rcx, %rax
+	mov	%rax, %rcx
+	shr	$32, %rcx
+	or	%ecx, %eax
+#  else
 	cmp	%rax, %rcx
 	jne	L(diffin8bytes)
 	xor	%eax, %eax
+#  endif
 	ret
 
 	.p2align 4
@@ -1013,12 +1436,20 @@ L(7bytes):
 	mov	-7(%rdi), %eax
 	mov	-7(%rsi), %ecx
 	cmp	%eax, %ecx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin4bytes)
+#  endif
 	mov	-4(%rdi), %eax
 	mov	-4(%rsi), %ecx
+#  ifdef USE_AS_BCMP
+	sub	%ecx, %eax
+#  else
 	cmp	%eax, %ecx
 	jne	L(diffin4bytes)
 	xor	%eax, %eax
+#  endif
 	ret
 
 	.p2align 4
@@ -1026,7 +1457,11 @@ L(3bytes):
 	movzwl	-3(%rdi), %eax
 	movzwl	-3(%rsi), %ecx
 	cmp	%eax, %ecx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin2bytes)
+#  endif
 L(1bytes):
 	movzbl	-1(%rdi), %eax
 	movzbl	-1(%rsi), %ecx
@@ -1041,38 +1476,58 @@ L(68bytes):
 	mov	$-68, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(52bytes):
 	movdqu	-52(%rdi), %xmm2
 	movdqu	-52(%rsi), %xmm1
 	mov	$-52, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(36bytes):
 	movdqu	-36(%rdi), %xmm2
 	movdqu	-36(%rsi), %xmm1
 	mov	$-36, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(20bytes):
 	movdqu	-20(%rdi), %xmm2
 	movdqu	-20(%rsi), %xmm1
 	mov	$-20, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 	mov	-4(%rsi), %ecx
-
-# ifndef USE_AS_WMEMCMP
+# ifdef USE_AS_BCMP
 	mov	-4(%rdi), %eax
-	cmp	%eax, %ecx
+	sub	%ecx, %eax
 # else
+#  ifndef USE_AS_WMEMCMP
+	mov	-4(%rdi), %eax
+	cmp	%eax, %ecx
+#  else
 	cmp	-4(%rdi), %ecx
-# endif
+#  endif
 	jne	L(diffin4bytes)
 	xor	%eax, %eax
+# endif
 	ret
 
 # ifndef USE_AS_WMEMCMP
@@ -1084,32 +1539,52 @@ L(69bytes):
 	mov	$-69, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(53bytes):
 	movdqu	-53(%rsi), %xmm1
 	movdqu	-53(%rdi), %xmm2
 	mov	$-53, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(37bytes):
 	movdqu	-37(%rsi), %xmm1
 	movdqu	-37(%rdi), %xmm2
 	mov	$-37, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(21bytes):
 	movdqu	-21(%rsi), %xmm1
 	movdqu	-21(%rdi), %xmm2
 	mov	$-21, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	xor	%eax, %eax
 	ret
 
@@ -1120,32 +1595,52 @@ L(70bytes):
 	mov	$-70, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(54bytes):
 	movdqu	-54(%rsi), %xmm1
 	movdqu	-54(%rdi), %xmm2
 	mov	$-54, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(38bytes):
 	movdqu	-38(%rsi), %xmm1
 	movdqu	-38(%rdi), %xmm2
 	mov	$-38, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(22bytes):
 	movdqu	-22(%rsi), %xmm1
 	movdqu	-22(%rdi), %xmm2
 	mov	$-22, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	xor	%eax, %eax
 	ret
 
@@ -1156,32 +1651,52 @@ L(71bytes):
 	mov	$-71, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(55bytes):
 	movdqu	-55(%rdi), %xmm2
 	movdqu	-55(%rsi), %xmm1
 	mov	$-55, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(39bytes):
 	movdqu	-39(%rdi), %xmm2
 	movdqu	-39(%rsi), %xmm1
 	mov	$-39, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(23bytes):
 	movdqu	-23(%rdi), %xmm2
 	movdqu	-23(%rsi), %xmm1
 	mov	$-23, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	xor	%eax, %eax
 	ret
 # endif
@@ -1193,33 +1708,53 @@ L(72bytes):
 	mov	$-72, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(56bytes):
 	movdqu	-56(%rdi), %xmm2
 	movdqu	-56(%rsi), %xmm1
 	mov	$-56, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(40bytes):
 	movdqu	-40(%rdi), %xmm2
 	movdqu	-40(%rsi), %xmm1
 	mov	$-40, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(24bytes):
 	movdqu	-24(%rdi), %xmm2
 	movdqu	-24(%rsi), %xmm1
 	mov	$-24, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 
 	mov	-8(%rsi), %rcx
 	mov	-8(%rdi), %rax
 	cmp	%rax, %rcx
+# ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+# else
 	jne	L(diffin8bytes)
+# endif
 	xor	%eax, %eax
 	ret
 
@@ -1232,32 +1767,52 @@ L(73bytes):
 	mov	$-73, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(57bytes):
 	movdqu	-57(%rdi), %xmm2
 	movdqu	-57(%rsi), %xmm1
 	mov	$-57, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(41bytes):
 	movdqu	-41(%rdi), %xmm2
 	movdqu	-41(%rsi), %xmm1
 	mov	$-41, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(25bytes):
 	movdqu	-25(%rdi), %xmm2
 	movdqu	-25(%rsi), %xmm1
 	mov	$-25, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 	mov	-9(%rdi), %rax
 	mov	-9(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	movzbl	-1(%rdi), %eax
 	movzbl	-1(%rsi), %ecx
 	sub	%ecx, %eax
@@ -1270,35 +1825,60 @@ L(74bytes):
 	mov	$-74, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(58bytes):
 	movdqu	-58(%rdi), %xmm2
 	movdqu	-58(%rsi), %xmm1
 	mov	$-58, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(42bytes):
 	movdqu	-42(%rdi), %xmm2
 	movdqu	-42(%rsi), %xmm1
 	mov	$-42, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(26bytes):
 	movdqu	-26(%rdi), %xmm2
 	movdqu	-26(%rsi), %xmm1
 	mov	$-26, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 	mov	-10(%rdi), %rax
 	mov	-10(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	movzwl	-2(%rdi), %eax
 	movzwl	-2(%rsi), %ecx
+#  ifdef USE_AS_BCMP
+	sub	%ecx, %eax
+	ret
+#  else
 	jmp	L(diffin2bytes)
+#  endif
 
 	.p2align 4
 L(75bytes):
@@ -1307,37 +1887,61 @@ L(75bytes):
 	mov	$-75, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(59bytes):
 	movdqu	-59(%rdi), %xmm2
 	movdqu	-59(%rsi), %xmm1
 	mov	$-59, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(43bytes):
 	movdqu	-43(%rdi), %xmm2
 	movdqu	-43(%rsi), %xmm1
 	mov	$-43, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(27bytes):
 	movdqu	-27(%rdi), %xmm2
 	movdqu	-27(%rsi), %xmm1
 	mov	$-27, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 	mov	-11(%rdi), %rax
 	mov	-11(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	mov	-4(%rdi), %eax
 	mov	-4(%rsi), %ecx
+#  ifdef USE_AS_BCMP
+	sub	%ecx, %eax
+#  else
 	cmp	%eax, %ecx
 	jne	L(diffin4bytes)
 	xor	%eax, %eax
+#  endif
 	ret
 # endif
 	.p2align 4
@@ -1347,41 +1951,66 @@ L(76bytes):
 	mov	$-76, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(60bytes):
 	movdqu	-60(%rdi), %xmm2
 	movdqu	-60(%rsi), %xmm1
 	mov	$-60, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(44bytes):
 	movdqu	-44(%rdi), %xmm2
 	movdqu	-44(%rsi), %xmm1
 	mov	$-44, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(28bytes):
 	movdqu	-28(%rdi), %xmm2
 	movdqu	-28(%rsi), %xmm1
 	mov	$-28, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 	mov	-12(%rdi), %rax
 	mov	-12(%rsi), %rcx
 	cmp	%rax, %rcx
+# ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+# else
 	jne	L(diffin8bytes)
+# endif
 	mov	-4(%rsi), %ecx
-# ifndef USE_AS_WMEMCMP
+# ifdef USE_AS_BCMP
 	mov	-4(%rdi), %eax
-	cmp	%eax, %ecx
+	sub	%ecx, %eax
 # else
+#  ifndef USE_AS_WMEMCMP
+	mov	-4(%rdi), %eax
+	cmp	%eax, %ecx
+#  else
 	cmp	-4(%rdi), %ecx
-# endif
+#  endif
 	jne	L(diffin4bytes)
 	xor	%eax, %eax
+# endif
 	ret
 
 # ifndef USE_AS_WMEMCMP
@@ -1393,38 +2022,62 @@ L(77bytes):
 	mov	$-77, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(61bytes):
 	movdqu	-61(%rdi), %xmm2
 	movdqu	-61(%rsi), %xmm1
 	mov	$-61, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(45bytes):
 	movdqu	-45(%rdi), %xmm2
 	movdqu	-45(%rsi), %xmm1
 	mov	$-45, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(29bytes):
 	movdqu	-29(%rdi), %xmm2
 	movdqu	-29(%rsi), %xmm1
 	mov	$-29, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 
 	mov	-13(%rdi), %rax
 	mov	-13(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	xor	%eax, %eax
 	ret
 
@@ -1435,36 +2088,60 @@ L(78bytes):
 	mov	$-78, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(62bytes):
 	movdqu	-62(%rdi), %xmm2
 	movdqu	-62(%rsi), %xmm1
 	mov	$-62, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(46bytes):
 	movdqu	-46(%rdi), %xmm2
 	movdqu	-46(%rsi), %xmm1
 	mov	$-46, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(30bytes):
 	movdqu	-30(%rdi), %xmm2
 	movdqu	-30(%rsi), %xmm1
 	mov	$-30, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 	mov	-14(%rdi), %rax
 	mov	-14(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	xor	%eax, %eax
 	ret
 
@@ -1475,36 +2152,60 @@ L(79bytes):
 	mov	$-79, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(63bytes):
 	movdqu	-63(%rdi), %xmm2
 	movdqu	-63(%rsi), %xmm1
 	mov	$-63, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(47bytes):
 	movdqu	-47(%rdi), %xmm2
 	movdqu	-47(%rsi), %xmm1
 	mov	$-47, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 L(31bytes):
 	movdqu	-31(%rdi), %xmm2
 	movdqu	-31(%rsi), %xmm1
 	mov	$-31, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+#  ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+#  else
 	jnc	L(less16bytes)
+#  endif
 	mov	-15(%rdi), %rax
 	mov	-15(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
 	cmp	%rax, %rcx
+#  ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+#  else
 	jne	L(diffin8bytes)
+#  endif
 	xor	%eax, %eax
 	ret
 # endif
@@ -1515,37 +2216,58 @@ L(64bytes):
 	mov	$-64, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(48bytes):
 	movdqu	-48(%rdi), %xmm2
 	movdqu	-48(%rsi), %xmm1
 	mov	$-48, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 L(32bytes):
 	movdqu	-32(%rdi), %xmm2
 	movdqu	-32(%rsi), %xmm1
 	mov	$-32, %dl
 	pxor	%xmm1, %xmm2
 	ptest	%xmm2, %xmm0
+# ifdef USE_AS_BCMP
+	jnc	L(return_not_equals)
+# else
 	jnc	L(less16bytes)
+# endif
 
 	mov	-16(%rdi), %rax
 	mov	-16(%rsi), %rcx
 	cmp	%rax, %rcx
+# ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+# else
 	jne	L(diffin8bytes)
+# endif
 
 	mov	-8(%rdi), %rax
 	mov	-8(%rsi), %rcx
 	cmp	%rax, %rcx
+# ifdef USE_AS_BCMP
+	jne	L(return_not_equals)
+# else
 	jne	L(diffin8bytes)
+# endif
 	xor	%eax, %eax
 	ret
 
 /*
  * Aligned 8 bytes to avoid 2 branch "taken" in one 16 alinged code block.
  */
+# ifndef USE_AS_BCMP
 	.p2align 3
 L(less16bytes):
 	movsbq	%dl, %rdx
@@ -1561,16 +2283,16 @@ L(diffin8bytes):
 	shr	$32, %rcx
 	shr	$32, %rax
 
-# ifdef USE_AS_WMEMCMP
+#  ifdef USE_AS_WMEMCMP
 /* for wmemcmp */
 	cmp	%eax, %ecx
 	jne	L(diffin4bytes)
 	xor	%eax, %eax
 	ret
-# endif
+#  endif
 
 L(diffin4bytes):
-# ifndef USE_AS_WMEMCMP
+#  ifndef USE_AS_WMEMCMP
 	cmp	%cx, %ax
 	jne	L(diffin2bytes)
 	shr	$16, %ecx
@@ -1589,7 +2311,7 @@ L(end):
 	and	$0xff, %ecx
 	sub	%ecx, %eax
 	ret
-# else
+#  else
 
 /* for wmemcmp */
 	mov	$1, %eax
@@ -1601,6 +2323,15 @@ L(end):
 L(nequal_bigger):
 	ret
 
+L(unreal_case):
+	xor	%eax, %eax
+	ret
+#  endif
+# else
+	.p2align 4
+L(return_not_equals):
+	mov	$1, %eax
+	ret
 L(unreal_case):
 	xor	%eax, %eax
 	ret

From patchwork Mon Sep 13 23:05:07 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 44964
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 652B53858013
	for <patchwork@sourceware.org>; Mon, 13 Sep 2021 23:22:57 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 652B53858013
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1631575377;
	bh=kdvQOjsgXLyf0hgfOQw6tpKx/23I+8dDjwajFRnbWe4=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=XZuKbg3ZSMbUZUqN3Pnm+jcnv9/+jbfV20LiA63xBke0/lxs2ZURKh9pjHH2aXOpC
	 37qNPhq+xct9NSyELJR6z7D1b7JgXPZy/+pvSVRA1t+UJykIHtpwaTMbyxHtBy/V5y
	 +Or/MGk/r3D9h/YN73lwK7m/BR5JSRz7bohfofec=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-il1-x133.google.com (mail-il1-x133.google.com
 [IPv6:2607:f8b0:4864:20::133])
 by sourceware.org (Postfix) with ESMTPS id 31C313858413
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 23:21:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 31C313858413
Received: by mail-il1-x133.google.com with SMTP id a1so11921922ilj.6
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 16:21:06 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=kdvQOjsgXLyf0hgfOQw6tpKx/23I+8dDjwajFRnbWe4=;
 b=mho6+PVWoQFF6Tlb5JP5qpEdGVRkRc89xxQIL/bWw3LkVPhzyTKwEgOREMjUMC2djO
 B4xB5PlHAL8Q5NmTHA/TYKQcfcdbVrIA5r6ZTJaA/sR9CCbPa+PvGJ2/YI/0GmL932bX
 S9NjPpGNpB7KFu68tDn6UZej02lqGUKEq1Um5+VUPVMLSUVWizWkjr150aII9gTW3A7+
 S8omwvFvs2vAsROOyppCigK6vu/vdBRW2B9n3vcPtXphpvp1GIz1xyM+ZpYYJfG2HPJP
 xqn97h0jxAus7j9yNqjz6vNEEe0QIFM6nydgoty2V4nmXgysO0vkBJoDZ/FW4rxW0+O/
 nlUg==
X-Gm-Message-State: AOAM5330XiLlR0We8ueh6RZRVX8HOdSDmBEZM/GgkA3aFfqune0YiHpp
 x8hISKluSFdbXnL+m71qtluyyabtJZU=
X-Google-Smtp-Source: 
 ABdhPJySk2WqBbQnpZTBImf65MTMS/WjQ9fHG34RzPv+mmWan4qxe4NDhvLej6HgU8iajoZLtNJ6Zw==
X-Received: by 2002:a05:6e02:144c:: with SMTP id
 p12mr4659334ilo.292.1631575265381;
 Mon, 13 Sep 2021 16:21:05 -0700 (PDT)
Received: from localhost.localdomain (mobile-130-126-255-38.near.illinois.edu.
 [130.126.255.38])
 by smtp.googlemail.com with ESMTPSA id s5sm5508857iol.33.2021.09.13.16.21.04
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 13 Sep 2021 16:21:05 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH 4/5] x86_64: Add avx2 optimized bcmp implementation in
 bcmp-avx2.S
Date: Mon, 13 Sep 2021 18:05:07 -0500
Message-Id: <20210913230506.546749-4-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20210913230506.546749-1-goldstein.w.n@gmail.com>
References: <20210913230506.546749-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

No bug. This commit adds new optimized bcmp implementation for avx2.

The primary optimizations are 1) skipping the logic to find the
difference of the first mismatched byte and 2) not updating src/dst
addresses as the non-equals logic does not need to be reused by
different areas.

The entry alignment has been fixed at 64. In throughput sensitive
functions which bcmp can potentially be frontend loop performance is
important to opimized for. This is impossible/difficult to do/maintain
with only 16 byte fixed alignment.

test-memcmp, test-bcmp, and test-wmemcmp are all passing.
---
 sysdeps/x86/sysdep.h                       |   6 +-
 sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S   |   4 +-
 sysdeps/x86_64/multiarch/bcmp-avx2.S       | 304 ++++++++++++++++++++-
 sysdeps/x86_64/multiarch/ifunc-bcmp.h      |   4 +-
 sysdeps/x86_64/multiarch/ifunc-impl-list.c |   2 -
 5 files changed, 308 insertions(+), 12 deletions(-)

diff --git a/sysdeps/x86/sysdep.h b/sysdeps/x86/sysdep.h
index cac1d762fb..4895179c10 100644
--- a/sysdeps/x86/sysdep.h
+++ b/sysdeps/x86/sysdep.h
@@ -78,15 +78,17 @@ enum cf_protection_level
 #define ASM_SIZE_DIRECTIVE(name) .size name,.-name;
 
 /* Define an entry point visible from C.  */
-#define	ENTRY(name)							      \
+#define	ENTRY_P2ALIGN(name, alignment)					      \
   .globl C_SYMBOL_NAME(name);						      \
   .type C_SYMBOL_NAME(name),@function;					      \
-  .align ALIGNARG(4);							      \
+  .align ALIGNARG(alignment);						      \
   C_LABEL(name)								      \
   cfi_startproc;							      \
   _CET_ENDBR;								      \
   CALL_MCOUNT
 
+#define ENTRY(name) ENTRY_P2ALIGN (name, 4)
+
 #undef	END
 #define END(name)							      \
   cfi_endproc;								      \
diff --git a/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S b/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S
index d742257e4e..28976daff0 100644
--- a/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S
+++ b/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S
@@ -1,5 +1,5 @@
-#ifndef MEMCMP
-# define MEMCMP __bcmp_avx2_rtm
+#ifndef BCMP
+# define BCMP __bcmp_avx2_rtm
 #endif
 
 #define ZERO_UPPER_VEC_REGISTERS_RETURN \
diff --git a/sysdeps/x86_64/multiarch/bcmp-avx2.S b/sysdeps/x86_64/multiarch/bcmp-avx2.S
index 93a9a20b17..eb77ae5c4a 100644
--- a/sysdeps/x86_64/multiarch/bcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/bcmp-avx2.S
@@ -16,8 +16,304 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#ifndef MEMCMP
-# define MEMCMP	__bcmp_avx2
-#endif
+#if IS_IN (libc)
+
+/* bcmp is implemented as:
+   1. Use ymm vector compares when possible. The only case where
+      vector compares is not possible for when size < VEC_SIZE
+      and loading from either s1 or s2 would cause a page cross.
+   2. Use xmm vector compare when size >= 8 bytes.
+   3. Optimistically compare up to first 4 * VEC_SIZE one at a
+      to check for early mismatches. Only do this if its guranteed the
+      work is not wasted.
+   4. If size is 8 * VEC_SIZE or less, unroll the loop.
+   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
+      area.
+   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
+   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
+   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
+
+# include <sysdep.h>
+
+# ifndef BCMP
+#  define BCMP	__bcmp_avx2
+# endif
+
+# define VPCMPEQ	vpcmpeqb
+
+# ifndef VZEROUPPER
+#  define VZEROUPPER	vzeroupper
+# endif
+
+# ifndef SECTION
+#  define SECTION(p)	p##.avx
+# endif
+
+# define VEC_SIZE 32
+# define PAGE_SIZE	4096
+
+	.section SECTION(.text), "ax", @progbits
+ENTRY_P2ALIGN (BCMP, 6)
+# ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %edx
+# endif
+	cmp	$VEC_SIZE, %RDX_LP
+	jb	L(less_vec)
+
+	/* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
+	vmovdqu	(%rsi), %ymm1
+	VPCMPEQ	(%rdi), %ymm1, %ymm1
+	vpmovmskb %ymm1, %eax
+	incl	%eax
+	jnz	L(return_neq0)
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(last_1x_vec)
+
+	/* Check second VEC no matter what.  */
+	vmovdqu	VEC_SIZE(%rsi), %ymm2
+	VPCMPEQ	VEC_SIZE(%rdi), %ymm2, %ymm2
+	vpmovmskb %ymm2, %eax
+	/* If all 4 VEC where equal eax will be all 1s so incl will overflow
+	   and set zero flag.  */
+	incl	%eax
+	jnz	L(return_neq0)
+
+	/* Less than 4 * VEC.  */
+	cmpq	$(VEC_SIZE * 4), %rdx
+	jbe	L(last_2x_vec)
+
+	/* Check third and fourth VEC no matter what.  */
+	vmovdqu	(VEC_SIZE * 2)(%rsi), %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
+	vpmovmskb %ymm3, %eax
+	incl	%eax
+	jnz	L(return_neq0)
+
+	vmovdqu	(VEC_SIZE * 3)(%rsi), %ymm4
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
+	vpmovmskb %ymm4, %eax
+	incl	%eax
+	jnz	L(return_neq0)
+
+	/* Go to 4x VEC loop.  */
+	cmpq	$(VEC_SIZE * 8), %rdx
+	ja	L(more_8x_vec)
+
+	/* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any
+	   branches.  */
+
+	/* Adjust rsi and rdi to avoid indexed address mode. This end up
+	   saving a 16 bytes of code, prevents unlamination, and bottlenecks in
+	   the AGU.  */
+	addq	%rdx, %rsi
+	vmovdqu	-(VEC_SIZE * 4)(%rsi), %ymm1
+	vmovdqu	-(VEC_SIZE * 3)(%rsi), %ymm2
+	addq	%rdx, %rdi
+
+	VPCMPEQ	-(VEC_SIZE * 4)(%rdi), %ymm1, %ymm1
+	VPCMPEQ	-(VEC_SIZE * 3)(%rdi), %ymm2, %ymm2
+
+	vmovdqu	-(VEC_SIZE * 2)(%rsi), %ymm3
+	VPCMPEQ	-(VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
+	vmovdqu	-VEC_SIZE(%rsi), %ymm4
+	VPCMPEQ	-VEC_SIZE(%rdi), %ymm4, %ymm4
 
-#include "bcmp-avx2.S"
+	/* Reduce VEC0 - VEC4.  */
+	vpand	%ymm1, %ymm2, %ymm2
+	vpand	%ymm3, %ymm4, %ymm4
+	vpand	%ymm2, %ymm4, %ymm4
+	vpmovmskb %ymm4, %eax
+	incl	%eax
+L(return_neq0):
+L(return_vzeroupper):
+	ZERO_UPPER_VEC_REGISTERS_RETURN
+
+	/* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32 byte
+	   aligned.  */
+	.p2align 5
+L(less_vec):
+	/* Check if one or less char. This is necessary for size = 0 but is
+	   also faster for size = 1.  */
+	cmpl	$1, %edx
+	jbe	L(one_or_less)
+
+	/* Check if loading one VEC from either s1 or s2 could cause a page
+	   cross. This can have false positives but is by far the fastest
+	   method.  */
+	movl	%edi, %eax
+	orl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	jg	L(page_cross_less_vec)
+
+	/* No page cross possible.  */
+	vmovdqu	(%rsi), %ymm2
+	VPCMPEQ	(%rdi), %ymm2, %ymm2
+	vpmovmskb %ymm2, %eax
+	incl	%eax
+	/* Result will be zero if s1 and s2 match. Otherwise first set bit
+	   will be first mismatch.  */
+	bzhil	%edx, %eax, %eax
+	VZEROUPPER_RETURN
+
+	/* Relatively cold but placing close to L(less_vec) for 2 byte jump
+	   encoding.  */
+	.p2align 4
+L(one_or_less):
+	jb	L(zero)
+	movzbl	(%rsi), %ecx
+	movzbl	(%rdi), %eax
+	subl	%ecx, %eax
+	/* No ymm register was touched.  */
+	ret
+	/* Within the same 16 byte block is L(one_or_less).  */
+L(zero):
+	xorl	%eax, %eax
+	ret
+
+	.p2align 4
+L(last_1x_vec):
+	vmovdqu	-(VEC_SIZE * 1)(%rsi, %rdx), %ymm1
+	VPCMPEQ	-(VEC_SIZE * 1)(%rdi, %rdx), %ymm1, %ymm1
+	vpmovmskb %ymm1, %eax
+	incl	%eax
+	VZEROUPPER_RETURN
+
+	.p2align 4
+L(last_2x_vec):
+	vmovdqu	-(VEC_SIZE * 2)(%rsi, %rdx), %ymm1
+	VPCMPEQ	-(VEC_SIZE * 2)(%rdi, %rdx), %ymm1, %ymm1
+	vmovdqu	-(VEC_SIZE * 1)(%rsi, %rdx), %ymm2
+	VPCMPEQ	-(VEC_SIZE * 1)(%rdi, %rdx), %ymm2, %ymm2
+	vpand	%ymm1, %ymm2, %ymm2
+	vpmovmskb %ymm2, %eax
+	incl	%eax
+	VZEROUPPER_RETURN
+
+	.p2align 4
+L(more_8x_vec):
+	/* Set end of s1 in rdx.  */
+	leaq	-(VEC_SIZE * 4)(%rdi, %rdx), %rdx
+	/* rsi stores s2 - s1. This allows loop to only update one pointer.
+	 */
+	subq	%rdi, %rsi
+	/* Align s1 pointer.  */
+	andq	$-VEC_SIZE, %rdi
+	/* Adjust because first 4x vec where check already.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	.p2align 4
+L(loop_4x_vec):
+	/* rsi has s2 - s1 so get correct address by adding s1 (in rdi).  */
+	vmovdqu	(%rsi, %rdi), %ymm1
+	VPCMPEQ	(%rdi), %ymm1, %ymm1
+
+	vmovdqu	VEC_SIZE(%rsi, %rdi), %ymm2
+	VPCMPEQ	VEC_SIZE(%rdi), %ymm2, %ymm2
+
+	vmovdqu	(VEC_SIZE * 2)(%rsi, %rdi), %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
+
+	vmovdqu	(VEC_SIZE * 3)(%rsi, %rdi), %ymm4
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
+
+	vpand	%ymm1, %ymm2, %ymm2
+	vpand	%ymm3, %ymm4, %ymm4
+	vpand	%ymm2, %ymm4, %ymm4
+	vpmovmskb %ymm4, %eax
+	incl	%eax
+	jnz	L(return_neq1)
+	subq	$-(VEC_SIZE * 4), %rdi
+	/* Check if s1 pointer at end.  */
+	cmpq	%rdx, %rdi
+	jb	L(loop_4x_vec)
+
+	vmovdqu	(VEC_SIZE * 3)(%rsi, %rdx), %ymm4
+	VPCMPEQ	(VEC_SIZE * 3)(%rdx), %ymm4, %ymm4
+	subq	%rdx, %rdi
+	/* rdi has 4 * VEC_SIZE - remaining length.  */
+	cmpl	$(VEC_SIZE * 3), %edi
+	jae	L(8x_last_1x_vec)
+	/* Load regardless of branch.  */
+	vmovdqu	(VEC_SIZE * 2)(%rsi, %rdx), %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rdx), %ymm3, %ymm3
+	cmpl	$(VEC_SIZE * 2), %edi
+	jae	L(8x_last_2x_vec)
+	/* Check last 4 VEC.  */
+	vmovdqu	VEC_SIZE(%rsi, %rdx), %ymm1
+	VPCMPEQ	VEC_SIZE(%rdx), %ymm1, %ymm1
+
+	vmovdqu	(%rsi, %rdx), %ymm2
+	VPCMPEQ	(%rdx), %ymm2, %ymm2
+
+	vpand	%ymm3, %ymm4, %ymm4
+	vpand	%ymm1, %ymm2, %ymm3
+L(8x_last_2x_vec):
+	vpand	%ymm3, %ymm4, %ymm4
+L(8x_last_1x_vec):
+	vpmovmskb %ymm4, %eax
+	/* Restore s1 pointer to rdi.  */
+	incl	%eax
+L(return_neq1):
+	VZEROUPPER_RETURN
+
+	/* Relatively cold case as page cross are unexpected.  */
+	.p2align 4
+L(page_cross_less_vec):
+	cmpl	$16, %edx
+	jae	L(between_16_31)
+	cmpl	$8, %edx
+	ja	L(between_9_15)
+	cmpl	$4, %edx
+	jb	L(between_2_3)
+	/* From 4 to 8 bytes.  No branch when size == 4.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	subl	%ecx, %eax
+	movl	-4(%rdi, %rdx), %ecx
+	movl	-4(%rsi, %rdx), %esi
+	subl	%esi, %ecx
+	orl	%ecx, %eax
+	ret
+
+	.p2align 4,, 8
+L(between_9_15):
+	vmovq	(%rdi), %xmm1
+	vmovq	(%rsi), %xmm2
+	VPCMPEQ	%xmm1, %xmm2, %xmm3
+	vmovq	-8(%rdi, %rdx), %xmm1
+	vmovq	-8(%rsi, %rdx), %xmm2
+	VPCMPEQ	%xmm1, %xmm2, %xmm2
+	vpand	%xmm2, %xmm3, %xmm3
+	vpmovmskb %xmm3, %eax
+	subl	$0xffff, %eax
+	/* No ymm register was touched.  */
+	ret
+
+	.p2align 4,, 8
+L(between_16_31):
+	/* From 16 to 31 bytes.  No branch when size == 16.  */
+	vmovdqu	(%rsi), %xmm1
+	VPCMPEQ	(%rdi), %xmm1, %xmm1
+	vmovdqu	-16(%rsi, %rdx), %xmm2
+	VPCMPEQ	-16(%rdi, %rdx), %xmm2, %xmm2
+	vpand	%xmm1, %xmm2, %xmm2
+	vpmovmskb %xmm2, %eax
+	subl	$0xffff, %eax
+	/* No ymm register was touched.  */
+	ret
+
+	.p2align 4,, 8
+L(between_2_3):
+	/* From 2 to 3 bytes.  No branch when size == 2.  */
+	movzwl	(%rdi), %eax
+	movzwl	(%rsi), %ecx
+	subl	%ecx, %eax
+	movzbl	-1(%rdi, %rdx), %edi
+	movzbl	-1(%rsi, %rdx), %esi
+	subl	%edi, %esi
+	orl	%esi, %eax
+	/* No ymm register was touched.  */
+	ret
+END (BCMP)
+#endif
diff --git a/sysdeps/x86_64/multiarch/ifunc-bcmp.h b/sysdeps/x86_64/multiarch/ifunc-bcmp.h
index b0dacd8526..f94516e5ee 100644
--- a/sysdeps/x86_64/multiarch/ifunc-bcmp.h
+++ b/sysdeps/x86_64/multiarch/ifunc-bcmp.h
@@ -32,11 +32,11 @@ IFUNC_SELECTOR (void)
 
   if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
       && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
-      && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     {
       if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
-	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
+	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
+	  && CPU_FEATURE_USABLE_P (cpu_features, MOVBE))
 	return OPTIMIZE (evex);
 
       if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index dd0c393c7d..cda0316928 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -42,13 +42,11 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
   IFUNC_IMPL (i, name, bcmp,
 	      IFUNC_IMPL_ADD (array, i, bcmp,
 			      (CPU_FEATURE_USABLE (AVX2)
-                   && CPU_FEATURE_USABLE (MOVBE)
 			       && CPU_FEATURE_USABLE (BMI2)),
 			      __bcmp_avx2)
 	      IFUNC_IMPL_ADD (array, i, bcmp,
 			      (CPU_FEATURE_USABLE (AVX2)
 			       && CPU_FEATURE_USABLE (BMI2)
-                   && CPU_FEATURE_USABLE (MOVBE)
 			       && CPU_FEATURE_USABLE (RTM)),
 			      __bcmp_avx2_rtm)
 	      IFUNC_IMPL_ADD (array, i, bcmp,

From patchwork Mon Sep 13 23:05:08 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 44966
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 413C03857C47
	for <patchwork@sourceware.org>; Mon, 13 Sep 2021 23:24:22 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 413C03857C47
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1631575462;
	bh=oC2IBbwzXpPhRKim4R2u3FAiw0euTHifnt6rWHEqIGY=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=NsX4kiA5PUGuIKBfaZSHn8U/cqZOYh/c/UgL3+F4gQxnjJlHJHCb85yK378J85d3S
	 UNGmucMl96Xno5xKDNR/rkMey/lm3qNyjJwm/UXXPaw8pH1Pp2I51paqYmtHPDsH9O
	 cJLWF5al1pDMcd2GZK5BjcWbrQP5VXo0VuyabliE=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-io1-xd31.google.com (mail-io1-xd31.google.com
 [IPv6:2607:f8b0:4864:20::d31])
 by sourceware.org (Postfix) with ESMTPS id CF1A43858002
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 23:21:07 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CF1A43858002
Received: by mail-io1-xd31.google.com with SMTP id y18so14474367ioc.1
 for <libc-alpha@sourceware.org>; Mon, 13 Sep 2021 16:21:07 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=oC2IBbwzXpPhRKim4R2u3FAiw0euTHifnt6rWHEqIGY=;
 b=5NsbuftGQoA9bD1uwxh51Fhi0OgOprm0HhOozc2b/50gQFhcVE13DhWObjaEui3P+h
 j5ZXyU3+i84hKeEjQPdWU2JFzypGHGLwWPWaLpmxHopNbgOKTTPDNFaB2FU70ICW9tq5
 aqhjVSrt6qzBQuOLxSVTgg7nTx8qNImk2S2nfDi4DEVxPayOhHD9UTG8L/pjXHJBbf6U
 ps2z117T8S6vS3smbJq7Pzobi25+jKsQ14agS3wRYz2kZOr0xIbhU40kqRaHpmcSbAeD
 oRcdlS4YBSWhYv0astTudB4vrqadaMgLReEaELpKc5wx4x9XudQzzB05R96V/lHT76DH
 8btA==
X-Gm-Message-State: AOAM532/RRJaGsQEIf25ZacRemRYKAUf1OVIKqk7QmZom8/zTUctTOnw
 W2KSbufFVMF/f72hP8iPS/p2fRw0u70=
X-Google-Smtp-Source: 
 ABdhPJxmXW6FnQPKwF6tckbr/JrM5jUgv34foXanmwzihh7ZARtnO7cbLVijLLzmW+r/LuG3DYw5qQ==
X-Received: by 2002:a05:6638:d94:: with SMTP id
 l20mr11782076jaj.134.1631575266749;
 Mon, 13 Sep 2021 16:21:06 -0700 (PDT)
Received: from localhost.localdomain (mobile-130-126-255-38.near.illinois.edu.
 [130.126.255.38])
 by smtp.googlemail.com with ESMTPSA id s5sm5508857iol.33.2021.09.13.16.21.06
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 13 Sep 2021 16:21:06 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH 5/5] x86_64: Add evex optimized bcmp implementation in
 bcmp-evex.S
Date: Mon, 13 Sep 2021 18:05:08 -0500
Message-Id: <20210913230506.546749-5-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20210913230506.546749-1-goldstein.w.n@gmail.com>
References: <20210913230506.546749-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

No bug. This commit adds new optimized bcmp implementation for evex.

The primary optimizations are 1) skipping the logic to find the
difference of the first mismatched byte and 2) not updating src/dst
addresses as the non-equals logic does not need to be reused by
different areas.

The entry alignment has been fixed at 64. In throughput sensitive
functions which bcmp can potentially be frontend loop performance is
important to opimized for. This is impossible/difficult to do/maintain
with only 16 byte fixed alignment.

test-memcmp, test-bcmp, and test-wmemcmp are all passing.
---
 sysdeps/x86_64/multiarch/bcmp-evex.S       | 305 ++++++++++++++++++++-
 sysdeps/x86_64/multiarch/ifunc-bcmp.h      |   3 +-
 sysdeps/x86_64/multiarch/ifunc-impl-list.c |   1 -
 3 files changed, 302 insertions(+), 7 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/bcmp-evex.S b/sysdeps/x86_64/multiarch/bcmp-evex.S
index ade52e8c68..1bfe824eb4 100644
--- a/sysdeps/x86_64/multiarch/bcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/bcmp-evex.S
@@ -16,8 +16,305 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#ifndef MEMCMP
-# define MEMCMP	__bcmp_evex
-#endif
+#if IS_IN (libc)
+
+/* bcmp is implemented as:
+   1. Use ymm vector compares when possible. The only case where
+      vector compares is not possible for when size < VEC_SIZE
+      and loading from either s1 or s2 would cause a page cross.
+   2. Use xmm vector compare when size >= 8 bytes.
+   3. Optimistically compare up to first 4 * VEC_SIZE one at a
+      to check for early mismatches. Only do this if its guranteed the
+      work is not wasted.
+   4. If size is 8 * VEC_SIZE or less, unroll the loop.
+   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
+      area.
+   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
+   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
+   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
+
+# include <sysdep.h>
+
+# ifndef BCMP
+#  define BCMP	__bcmp_evex
+# endif
+
+# define VMOVU	vmovdqu64
+# define VPCMP	vpcmpub
+# define VPTEST	vptestmb
+
+# define VEC_SIZE	32
+# define PAGE_SIZE	4096
+
+# define YMM0		ymm16
+# define YMM1		ymm17
+# define YMM2		ymm18
+# define YMM3		ymm19
+# define YMM4		ymm20
+# define YMM5		ymm21
+# define YMM6		ymm22
+
+
+	.section .text.evex, "ax", @progbits
+ENTRY_P2ALIGN (BCMP, 6)
+# ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %edx
+# endif
+	cmp	$VEC_SIZE, %RDX_LP
+	jb	L(less_vec)
+
+	/* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
+	VMOVU	(%rsi), %YMM1
+	/* Use compare not equals to directly check for mismatch.  */
+	VPCMP	$4, (%rdi), %YMM1, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(last_1x_vec)
+
+	/* Check second VEC no matter what.  */
+	VMOVU	VEC_SIZE(%rsi), %YMM2
+	VPCMP	$4, VEC_SIZE(%rdi), %YMM2, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	/* Less than 4 * VEC.  */
+	cmpq	$(VEC_SIZE * 4), %rdx
+	jbe	L(last_2x_vec)
+
+	/* Check third and fourth VEC no matter what.  */
+	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM3
+	VPCMP	$4, (VEC_SIZE * 2)(%rdi), %YMM3, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM4
+	VPCMP	$4, (VEC_SIZE * 3)(%rdi), %YMM4, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	/* Go to 4x VEC loop.  */
+	cmpq	$(VEC_SIZE * 8), %rdx
+	ja	L(more_8x_vec)
+
+	/* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any
+	   branches.  */
+
+	VMOVU	-(VEC_SIZE * 4)(%rsi, %rdx), %YMM1
+	VMOVU	-(VEC_SIZE * 3)(%rsi, %rdx), %YMM2
+	addq	%rdx, %rdi
+
+	/* Wait to load from s1 until addressed adjust due to unlamination.
+	 */
+
+	/* vpxor will be all 0s if s1 and s2 are equal. Otherwise it will
+	   have some 1s.  */
+	vpxorq	-(VEC_SIZE * 4)(%rdi), %YMM1, %YMM1
+	vpxorq	-(VEC_SIZE * 3)(%rdi), %YMM2, %YMM2
+
+	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %YMM3
+	vpxorq	-(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
+	/* Or together YMM1, YMM2, and YMM3 into YMM3.  */
+	vpternlogd $0xfe, %YMM1, %YMM2, %YMM3
 
-#include "memcmp-evex-movbe.S"
+	VMOVU	-(VEC_SIZE)(%rsi, %rdx), %YMM4
+	/* Ternary logic to xor (VEC_SIZE * 3)(%rdi) with YMM4 while oring
+	   with YMM3. Result is stored in YMM4.  */
+	vpternlogd $0xde, -(VEC_SIZE)(%rdi), %YMM3, %YMM4
+	/* Compare YMM4 with 0. If any 1s s1 and s2 don't match.  */
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+L(return_neq0):
+	ret
+
+	/* Fits in padding needed to .p2align 5 L(less_vec).  */
+L(last_1x_vec):
+	VMOVU	-(VEC_SIZE * 1)(%rsi, %rdx), %YMM1
+	VPCMP	$4, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %k1
+	kmovd	%k1, %eax
+	ret
+
+	/* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32 byte
+	   aligned.  */
+	.p2align 5
+L(less_vec):
+	/* Check if one or less char. This is necessary for size = 0 but is
+	   also faster for size = 1.  */
+	cmpl	$1, %edx
+	jbe	L(one_or_less)
+
+	/* Check if loading one VEC from either s1 or s2 could cause a page
+	   cross. This can have false positives but is by far the fastest
+	   method.  */
+	movl	%edi, %eax
+	orl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	jg	L(page_cross_less_vec)
+
+	/* No page cross possible.  */
+	VMOVU	(%rsi), %YMM2
+	VPCMP	$4, (%rdi), %YMM2, %k1
+	kmovd	%k1, %eax
+	/* Result will be zero if s1 and s2 match. Otherwise first set bit
+	   will be first mismatch.  */
+	bzhil	%edx, %eax, %eax
+	ret
+
+	/* Relatively cold but placing close to L(less_vec) for 2 byte jump
+	   encoding.  */
+	.p2align 4
+L(one_or_less):
+	jb	L(zero)
+	movzbl	(%rsi), %ecx
+	movzbl	(%rdi), %eax
+	subl	%ecx, %eax
+	/* No ymm register was touched.  */
+	ret
+	/* Within the same 16 byte block is L(one_or_less).  */
+L(zero):
+	xorl	%eax, %eax
+	ret
+
+	.p2align 4
+L(last_2x_vec):
+	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %YMM1
+	vpxorq	-(VEC_SIZE * 2)(%rdi, %rdx), %YMM1, %YMM1
+	VMOVU	-(VEC_SIZE * 1)(%rsi, %rdx), %YMM2
+	vpternlogd $0xde, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %YMM2
+	VPTEST	%YMM2, %YMM2, %k1
+	kmovd	%k1, %eax
+	ret
+
+	.p2align 4
+L(more_8x_vec):
+	/* Set end of s1 in rdx.  */
+	leaq	-(VEC_SIZE * 4)(%rdi, %rdx), %rdx
+	/* rsi stores s2 - s1. This allows loop to only update one pointer.
+	 */
+	subq	%rdi, %rsi
+	/* Align s1 pointer.  */
+	andq	$-VEC_SIZE, %rdi
+	/* Adjust because first 4x vec where check already.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	.p2align 4
+L(loop_4x_vec):
+	VMOVU	(%rsi, %rdi), %YMM1
+	vpxorq	(%rdi), %YMM1, %YMM1
+
+	VMOVU	VEC_SIZE(%rsi, %rdi), %YMM2
+	vpxorq	VEC_SIZE(%rdi), %YMM2, %YMM2
+
+	VMOVU	(VEC_SIZE * 2)(%rsi, %rdi), %YMM3
+	vpxorq	(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
+	vpternlogd $0xfe, %YMM1, %YMM2, %YMM3
+
+	VMOVU	(VEC_SIZE * 3)(%rsi, %rdi), %YMM4
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rdi), %YMM3, %YMM4
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq2)
+	subq	$-(VEC_SIZE * 4), %rdi
+	cmpq	%rdx, %rdi
+	jb	L(loop_4x_vec)
+
+	subq	%rdx, %rdi
+	VMOVU	(VEC_SIZE * 3)(%rsi, %rdx), %YMM4
+	vpxorq	(VEC_SIZE * 3)(%rdx), %YMM4, %YMM4
+	/* rdi has 4 * VEC_SIZE - remaining length.  */
+	cmpl	$(VEC_SIZE * 3), %edi
+	jae	L(8x_last_1x_vec)
+	/* Load regardless of branch.  */
+	VMOVU	(VEC_SIZE * 2)(%rsi, %rdx), %YMM3
+	/* Ternary logic to xor (VEC_SIZE * 2)(%rdx) with YMM3 while oring
+	   with YMM4. Result is stored in YMM4.  */
+	vpternlogd $0xf6, (VEC_SIZE * 2)(%rdx), %YMM3, %YMM4
+	cmpl	$(VEC_SIZE * 2), %edi
+	jae	L(8x_last_2x_vec)
+
+	VMOVU	VEC_SIZE(%rsi, %rdx), %YMM2
+	vpxorq	VEC_SIZE(%rdx), %YMM2, %YMM2
+
+	VMOVU	(%rsi, %rdx), %YMM1
+	vpxorq	(%rdx), %YMM1, %YMM1
+
+	vpternlogd $0xfe, %YMM1, %YMM2, %YMM4
+L(8x_last_1x_vec):
+L(8x_last_2x_vec):
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+L(return_neq2):
+	ret
+
+	/* Relatively cold case as page cross are unexpected.  */
+	.p2align 4
+L(page_cross_less_vec):
+	cmpl	$16, %edx
+	jae	L(between_16_31)
+	cmpl	$8, %edx
+	ja	L(between_9_15)
+	cmpl	$4, %edx
+	jb	L(between_2_3)
+	/* From 4 to 8 bytes.  No branch when size == 4.  */
+	movl	(%rdi), %eax
+	movl	(%rsi), %ecx
+	subl	%ecx, %eax
+	movl	-4(%rdi, %rdx), %ecx
+	movl	-4(%rsi, %rdx), %esi
+	subl	%esi, %ecx
+	orl	%ecx, %eax
+	ret
+
+	.p2align 4,, 8
+L(between_9_15):
+	/* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
+	 */
+	vmovq	(%rdi), %xmm1
+	vmovq	(%rsi), %xmm2
+	vpcmpeqb %xmm1, %xmm2, %xmm3
+	vmovq	-8(%rdi, %rdx), %xmm1
+	vmovq	-8(%rsi, %rdx), %xmm2
+	vpcmpeqb %xmm1, %xmm2, %xmm2
+	vpand	%xmm2, %xmm3, %xmm3
+	vpmovmskb %xmm3, %eax
+	subl	$0xffff, %eax
+	/* No ymm register was touched.  */
+	ret
+
+	.p2align 4,, 8
+L(between_16_31):
+	/* From 16 to 31 bytes.  No branch when size == 16.  */
+
+	/* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
+	 */
+	vmovdqu	(%rsi), %xmm1
+	vpcmpeqb (%rdi), %xmm1, %xmm1
+	vmovdqu	-16(%rsi, %rdx), %xmm2
+	vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2
+	vpand	%xmm1, %xmm2, %xmm2
+	vpmovmskb %xmm2, %eax
+	subl	$0xffff, %eax
+	/* No ymm register was touched.  */
+	ret
+
+	.p2align 4,, 8
+L(between_2_3):
+	/* From 2 to 3 bytes.  No branch when size == 2.  */
+	movzwl	(%rdi), %eax
+	movzwl	(%rsi), %ecx
+	subl	%ecx, %eax
+	movzbl	-1(%rdi, %rdx), %edi
+	movzbl	-1(%rsi, %rdx), %esi
+	subl	%edi, %esi
+	orl	%esi, %eax
+	/* No ymm register was touched.  */
+	ret
+END (BCMP)
+#endif
diff --git a/sysdeps/x86_64/multiarch/ifunc-bcmp.h b/sysdeps/x86_64/multiarch/ifunc-bcmp.h
index f94516e5ee..51f251d0c9 100644
--- a/sysdeps/x86_64/multiarch/ifunc-bcmp.h
+++ b/sysdeps/x86_64/multiarch/ifunc-bcmp.h
@@ -35,8 +35,7 @@ IFUNC_SELECTOR (void)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     {
       if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
-	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
-	  && CPU_FEATURE_USABLE_P (cpu_features, MOVBE))
+	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
 	return OPTIMIZE (evex);
 
       if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index cda0316928..abbb4e407f 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -52,7 +52,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, bcmp,
 			      (CPU_FEATURE_USABLE (AVX512VL)
 			       && CPU_FEATURE_USABLE (AVX512BW)
-                   && CPU_FEATURE_USABLE (MOVBE)
 			       && CPU_FEATURE_USABLE (BMI2)),
 			      __bcmp_evex)
 	      IFUNC_IMPL_ADD (array, i, bcmp, CPU_FEATURE_USABLE (SSE4_1),