From patchwork Mon Dec 29 02:41:49 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Zheng Ziyang <zheng.ziyang@zte.com.cn>
X-Patchwork-Id: 127137
Return-Path: <libc-alpha-bounces~patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from vm01.sourceware.org (localhost [127.0.0.1])
	by sourceware.org (Postfix) with ESMTP id D53FA4BA2E29
	for <patchwork@sourceware.org>; Mon, 29 Dec 2025 02:43:11 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D53FA4BA2E29
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mxct.zte.com.cn (mxct.zte.com.cn [183.62.165.209])
 by sourceware.org (Postfix) with ESMTPS id 8F0834BA2E04
 for <libc-alpha@sourceware.org>; Mon, 29 Dec 2025 02:42:31 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8F0834BA2E04
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=zte.com.cn
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=zte.com.cn
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8F0834BA2E04
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=183.62.165.209
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1766976153; cv=none;
 b=sEAu5EWMvt8qoActQ8LtldleuIKOkX23k/Cxvs8inAG08v0ax1m7ySl6UfCPG08cUgSvDTqdG6gkiPo7zRp+V0Vx6fEVCnhOKITpBg85ShktnU7LGtGghw2BM5OAv7lyglQazzDxrmPwwULGA7TuBsPZfLGidI6cOwS2YVU3VYk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1766976153; c=relaxed/simple;
 bh=+FhHJS/BpU51s4dTq7pTUFpKpwKG7o/VuN9/eBnVBwI=;
 h=From:To:Subject:Date:Message-Id:MIME-Version;
 b=V4sWUNQ91usoKRFFmC66+qRP1gGfbomIsOp/HTc7gXpNF80aicsLWPB0zPgAtkjz4Am3PghqK4LBa5hZdUAYM1j34XKAHwvgezk39dy6SlQT6PD/xV8r89A76VQkJjoj0WKzObfKCnrFQljvCdMIoYuJMhp/NBEQB6urH1yBEEU=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8F0834BA2E04
Received: from mse-fl1.zte.com.cn (unknown [10.5.228.132])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
 (No client certificate requested)
 by mxct.zte.com.cn (FangMail) with ESMTPS id 4dfgVV3qnnz4xVd2;
 Mon, 29 Dec 2025 10:42:26 +0800 (CST)
Received: from szxl2zmapp05.zte.com.cn ([10.1.32.37])
 by mse-fl1.zte.com.cn with SMTP id 5BT2gFHm006609;
 Mon, 29 Dec 2025 10:42:15 +0800 (+08)
 (envelope-from zheng.ziyang@zte.com.cn)
Received: from localhost.localdomain (unknown [10.4.24.57])
 by smtp (Zmail) with SMTP; Mon, 29 Dec 2025 10:42:16 +0800
X-Zmail-TransId: 3e816951ea88000-b89ef
X-Zmail-LocalSMTP: 1
X-Zmail-RealSender: zheng.ziyang@zte.com.cn
From: Zheng Ziyang <zheng.ziyang@zte.com.cn>
To: libc-alpha@sourceware.org
Cc: adhemerval.zanella@linaro.org, bergner@tenstorrent.com,
 palmer@dabbelt.com,
 darius@bluespec.com, schwab@linux-m68k.org,
 zhengziyang <zheng.ziyang@zte.com.cn>
Subject: [PATCH] riscv: Add optimised memcmp implementation using RVV
 extension
Date: Mon, 29 Dec 2025 10:41:49 +0800
Message-Id: <20251229024149.832-1-zheng.ziyang@zte.com.cn>
X-Mailer: git-send-email 2.21.0.windows.1
MIME-Version: 1.0
X-MAIL: mse-fl1.zte.com.cn 5BT2gFHm006609
X-TLS: YES
X-SPF-DOMAIN: zte.com.cn
X-ENVELOPE-SENDER: zheng.ziyang@zte.com.cn
X-SPF: None
X-SOURCE-IP: 10.5.228.132 unknown Mon, 29 Dec 2025 10:42:26 +0800
X-Fangmail-Anti-Spam-Filtered: true
X-Fangmail-MID-QID: 6951EA92.001/4dfgVV3qnnz4xVd2
X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HTML_MESSAGE, KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_VALIDITY_RPBL_BLOCKED,
 RCVD_IN_VALIDITY_SAFE_BLOCKED, SCC_10_SHORT_WORD_LINES,
 SCC_5_SHORT_WORD_LINES,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org

From: zhengziyang <zheng.ziyang@zte.com.cn>

This patch adds an optimised memcmp implementation for RISC-V using the
RVV extension.

It dispatches based on buffer length N:
- Fully unrolled scalar path for N≤4 to avoid vector setup overhead
- Vector processing with LMUL=m2 for N>4, using masked operations for
  tail handling

Key optimisation techniques:
1. Fast Mismatch Detection: Uses vmsne.vv (set if not equal) combined
   with vfirst.m to instantly locate the first differing byte index
   within vector registers, eliminating scalar comparison loops.

2. Efficient Vector Loop: Processes data in large chunks with LMUL=m2,
   maximizing memory throughput for medium and large buffers.

3. Zero-overhead Scalar Path: For tiny buffers (1-4 bytes), bypasses
   vector setup entirely with a fully unrolled linear instruction
   sequence, avoiding loop control overhead (increment/compare/branch).

4. Clean Tail Handling: Remaining bytes that do not fill a full vector
   are processed via a single masked vector load/compare operation,
   avoiding separate tail loops.

The implementation assumes RVV 1.0 with VLEN >= 128, supports arbitrary
VLEN configurations, and works on both RV32 and RV64 platforms. No
page-size assumptions are made.

Performance improvements (relative speedup %) over __memcmp_generic
baseline:

| Test Category      | Config (VLENB)        | vs. __memcmp_generic |
|--------------------|-----------------------|----------------------|
| **memcmp-default** | XuanTie C920 (128)    | +54.6%               |
|                    | Spacemit(R) X60 (256) | +44.8%               |

Signed-off-by: Zheng Ziyang <zheng.ziyang@zte.com.cn>
---
 string/memcmp.c                               |   4 +-
 sysdeps/riscv/multiarch/memcmp-generic.c      |  26 +++
 sysdeps/riscv/multiarch/memcmp_vector.S       | 161 ++++++++++++++++++
 .../unix/sysv/linux/riscv/multiarch/Makefile  |   3 +
 .../linux/riscv/multiarch/ifunc-impl-list.c   |   5 +
 .../unix/sysv/linux/riscv/multiarch/memcmp.c  |  57 +++++++
 6 files changed, 254 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/riscv/multiarch/memcmp-generic.c
 create mode 100644 sysdeps/riscv/multiarch/memcmp_vector.S
 create mode 100644 sysdeps/unix/sysv/linux/riscv/multiarch/memcmp.c

-- 
2.21.0.windows.1

diff --git a/string/memcmp.c b/string/memcmp.c
index cd595ce95e..5f8b0698f0 100644
--- a/string/memcmp.c
+++ b/string/memcmp.c
@@ -353,9 +353,9 @@ MEMCMP (const void *s1, const void *s2, size_t len)
 libc_hidden_builtin_def(memcmp)
 #ifdef weak_alias
 # undef bcmp
-weak_alias (memcmp, bcmp)
+weak_alias (MEMCMP, bcmp)
 #endif
 
 #undef __memcmpeq
-strong_alias (memcmp, __memcmpeq)
+strong_alias (MEMCMP, __memcmpeq)
 libc_hidden_def(__memcmpeq)
diff --git a/sysdeps/riscv/multiarch/memcmp-generic.c b/sysdeps/riscv/multiarch/memcmp-generic.c
new file mode 100644
index 0000000000..a5ddb2bec8
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memcmp-generic.c
@@ -0,0 +1,26 @@
+/* Re-include the default memcpy implementation.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+#if IS_IN(libc)
+# define MEMCMP __memcmp_generic
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(x)
+#endif
+#include <string/memcmp.c>
diff --git a/sysdeps/riscv/multiarch/memcmp_vector.S b/sysdeps/riscv/multiarch/memcmp_vector.S
new file mode 100644
index 0000000000..0fdbe28fd6
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memcmp_vector.S
@@ -0,0 +1,161 @@
+/* memcmp for RISC-V, ignoring buffer alignment
+   Copyright (C) 2024-2025 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU Lesser General Public License. If not, see
+   <https://www.gnu.org/licenses/>. 
+*/
+
+/* Optimised memcmp for riscv with vector extension */
+/*
+ * Core Design Logic:
+ * 1. Vector Processing: Use the Vector Extension (RVV) with LMUL=m2 to 
+ * process data in large chunks, maximizing memory throughput for buffers 
+ * larger than 4 bytes.
+ * 2. Fast Mismatch Detection: Utilize the 'vmsne.vv' (set if not equal) and 
+ * 'vfirst.m' instructions to instantly locate the index of the first 
+ * differing byte within a vector register without a scalar loop.
+ * 3. Early Scalar Threshold: For very small buffers (1 to 4 bytes), skip 
+ * vector setup overhead entirely and jump to a specialized scalar path.
+ * 4. Fully Unrolled Scalar Path: The scalar handler for small sizes avoids 
+ * loop control overhead (increment/compare/branch) by unrolling the 
+ * comparison logic into a linear sequence of instructions.
+ * 5. Tail Handling: Remaining bytes that do not fill a full vector are 
+ * handled by a single masked vector load/compare operation.
+ */
+
+#include <sysdep.h>
+
+.text
+.p2align 2
+ENTRY(__memcmp_vector)
+
+    /* Function Signature: int memcmp(const void *s1, const void *s2, size_t n)
+     * a0: const void *s1
+     * a1: const void *s2
+     * a2: size_t n
+     */
+
+    /* ----------------------------------------------------
+     * Initial Checks & Optimizations
+     * ---------------------------------------------------- */
+    mv          t2, zero
+    beqz        a2, .Lret_zero   /* If length is 0, return 0 immediately. */
+
+    /* Small Size Optimization:
+     * If length < 5 (i.e., 1, 2, 3, 4 bytes), the overhead of configuring
+     * the vector unit (vsetvli) outweighs the benefit. Jump to the 
+     * unrolled scalar path. */
+    li          t1, 5
+    bltu        a2, t1, .Lscalar
+
+    /* ----------------------------------------------------
+     * Vector Loop (Main Body)
+     * Processes blocks of (VLENB * 2) bytes per iteration.
+     * ---------------------------------------------------- */
+    vsetvli     t1, a2, e8, m2, ta, ma
+ 
+1:
+    vle8.v      v2, (a0)        /* Load vector from s1. */
+    vle8.v      v4, (a1)        /* Load vector from s2. */
+    vmsne.vv    v0, v2, v4      /* Compare: v0[i] = 1 if v2[i] != v4[i]. */
+    vfirst.m    t3, v0          /* Find index of first mismatch (-1 if none). */
+    bgez        t3, .Lvec_diff  /* If t3 >= 0, a difference was found. */
+
+    /* No difference in this block. Advance pointers and counters. */
+    add         a0, a0, t1
+    add         a1, a1, t1
+    sub         a2, a2, t1
+    bgeu        a2, t1, 1b      /* Continue loop if enough data remains. */
+
+    /* Check for remaining tail data. */
+    bnez        a2, .Ltail
+    mv          a0, zero        /* No tail, fully equal. Return 0. */
+    ret
+
+    /* ----------------------------------------------------
+     * Vector Tail Handling
+     * We know a2 > 0 and a2 < current VL. Process the rest.
+     * ---------------------------------------------------- */
+.Ltail:
+    vsetvli     t1, a2, e8, m2, ta, ma
+    vle8.v      v2, (a0)
+    vle8.v      v4, (a1)
+    vmsne.vv    v0, v2, v4
+    vfirst.m    t3, v0
+    bgez        t3, .Lvec_diff
+    mv          a0, zero        /* Tail matched. Return 0. */
+    ret
+
+    /* ----------------------------------------------------
+     * Difference Calculation (Vector Path)
+     * ---------------------------------------------------- */
+.Lvec_diff:
+    /* Mismatch found at index t3 inside the vector.
+     * Calculate absolute addresses, load bytes, and return difference. */
+    add         a0, a0, t3
+    lbu         t0, (a0)
+    add         a1, a1, t3
+    lbu         t1, (a1)
+    sub         a0, t0, t1      /* return (s1[i] - s2[i]) */
+    ret
+
+    /* ----------------------------------------------------
+     * Scalar Path (Fully Unrolled)
+     * Optimized for n = 1, 2, 3, 4.
+     * Eliminates loop branching penalties.
+     * ---------------------------------------------------- */
+.Lscalar:
+
+    /* Compare Byte 0 */
+    lbu         t0, 0(a0)
+    addi        a2, a2, -1
+    lbu         t1, 0(a1)
+    bne         t0, t1, .Lscalar_diff
+    
+    /* Check termination for n=1 */
+    beqz        a2, .Lret_zero 
+
+    /* Compare Byte 1 */
+    lbu         t0, 1(a0)
+    addi        a2, a2, -1
+    lbu         t1, 1(a1)
+    bne         t0, t1, .Lscalar_diff
+    
+    /* Check termination for n=2 */
+    beqz        a2, .Lret_zero
+
+    /* Compare Byte 2 */
+    lbu         t0, 2(a0)
+    addi        a2, a2, -1
+    lbu         t1, 2(a1)
+    bne         t0, t1, .Lscalar_diff
+
+    /* Check termination for n=3 */
+    beqz        a2, .Lret_zero
+
+    /* Compare Byte 3 (Implicitly n=4 here) */
+    lbu         t0, 3(a0)
+    lbu         t1, 3(a1)
+    /* Fall through to subtraction logic */
+
+.Lscalar_diff:
+    sub         a0, t0, t1
+    ret
+
+.Lret_zero:
+    li          a0, 0
+    ret
+
+END(__memcmp_vector)
diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
index 1d26966ded..95027df6b1 100644
--- a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
+++ b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
@@ -6,6 +6,9 @@ sysdep_routines += \
   memset \
   memset-generic \
   memset-vector \
+  memcmp \
+  memcmp-generic \
+  memcmp_vector \
   # sysdep_routines
 
 CFLAGS-memcpy_noalignment.c += -mno-strict-align
diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
index 87456f3370..d5cb49aa5c 100644
--- a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
@@ -53,5 +53,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			      __memset_vector)
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
 
+  IFUNC_IMPL (i, name, memcmp,
+	      IFUNC_IMPL_ADD (array, i, memcmp, rvv_enabled,
+			      __memcmp_vector)
+	      IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_generic))
+
   return 0;
 }
diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/memcmp.c b/sysdeps/unix/sysv/linux/riscv/multiarch/memcmp.c
new file mode 100644
index 0000000000..aa7db13f1f
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/multiarch/memcmp.c
@@ -0,0 +1,57 @@
+/* Multiple versions of memcpy.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2017-2024 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#if IS_IN (libc)
+/* Redefine memcpy so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef memcmp
+# define memcmp __redirect_memcmp
+# include <stdint.h>
+# include <string.h>
+# include <ifunc-init.h>
+# include <riscv-ifunc.h>
+# include <sys/hwprobe.h>
+
+extern __typeof (__redirect_memcmp) __libc_memcmp;
+
+extern __typeof (__redirect_memcmp) __memcmp_generic attribute_hidden;
+extern __typeof (__redirect_memcmp) __memcmp_vector attribute_hidden;
+
+static inline __typeof (__redirect_memcmp) *
+select_memcmp_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func)
+{
+  unsigned long long int v;
+  if (__riscv_hwprobe_one (hwprobe_func, RISCV_HWPROBE_KEY_IMA_EXT_0, &v) == 0
+      && (v & RISCV_HWPROBE_IMA_V) == RISCV_HWPROBE_IMA_V)
+    return __memcmp_vector;
+
+  return __memcmp_generic;
+}
+
+riscv_libc_ifunc (__libc_memcmp, select_memcmp_ifunc);
+
+# undef memcmp
+strong_alias (__libc_memcmp, memcmp);
+# ifdef SHARED
+__hidden_ver1 (memcmp, __GI_memcmp, __redirect_memcmp)
+  __attribute__ ((visibility ("hidden"))) __attribute_copy__ (memcmp);
+# endif
+#else
+# include <string/memcmp.c>
+#endif