From patchwork Mon Dec 29 02:41:49 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Zheng Ziyang X-Patchwork-Id: 127137 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from vm01.sourceware.org (localhost [127.0.0.1]) by sourceware.org (Postfix) with ESMTP id D53FA4BA2E29 for ; Mon, 29 Dec 2025 02:43:11 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D53FA4BA2E29 X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mxct.zte.com.cn (mxct.zte.com.cn [183.62.165.209]) by sourceware.org (Postfix) with ESMTPS id 8F0834BA2E04 for ; Mon, 29 Dec 2025 02:42:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8F0834BA2E04 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=zte.com.cn Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=zte.com.cn ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8F0834BA2E04 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=183.62.165.209 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1766976153; cv=none; b=sEAu5EWMvt8qoActQ8LtldleuIKOkX23k/Cxvs8inAG08v0ax1m7ySl6UfCPG08cUgSvDTqdG6gkiPo7zRp+V0Vx6fEVCnhOKITpBg85ShktnU7LGtGghw2BM5OAv7lyglQazzDxrmPwwULGA7TuBsPZfLGidI6cOwS2YVU3VYk= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1766976153; c=relaxed/simple; bh=+FhHJS/BpU51s4dTq7pTUFpKpwKG7o/VuN9/eBnVBwI=; h=From:To:Subject:Date:Message-Id:MIME-Version; b=V4sWUNQ91usoKRFFmC66+qRP1gGfbomIsOp/HTc7gXpNF80aicsLWPB0zPgAtkjz4Am3PghqK4LBa5hZdUAYM1j34XKAHwvgezk39dy6SlQT6PD/xV8r89A76VQkJjoj0WKzObfKCnrFQljvCdMIoYuJMhp/NBEQB6urH1yBEEU= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8F0834BA2E04 Received: from mse-fl1.zte.com.cn (unknown [10.5.228.132]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxct.zte.com.cn (FangMail) with ESMTPS id 4dfgVV3qnnz4xVd2; Mon, 29 Dec 2025 10:42:26 +0800 (CST) Received: from szxl2zmapp05.zte.com.cn ([10.1.32.37]) by mse-fl1.zte.com.cn with SMTP id 5BT2gFHm006609; Mon, 29 Dec 2025 10:42:15 +0800 (+08) (envelope-from zheng.ziyang@zte.com.cn) Received: from localhost.localdomain (unknown [10.4.24.57]) by smtp (Zmail) with SMTP; Mon, 29 Dec 2025 10:42:16 +0800 X-Zmail-TransId: 3e816951ea88000-b89ef X-Zmail-LocalSMTP: 1 X-Zmail-RealSender: zheng.ziyang@zte.com.cn From: Zheng Ziyang To: libc-alpha@sourceware.org Cc: adhemerval.zanella@linaro.org, bergner@tenstorrent.com, palmer@dabbelt.com, darius@bluespec.com, schwab@linux-m68k.org, zhengziyang Subject: [PATCH] riscv: Add optimised memcmp implementation using RVV extension Date: Mon, 29 Dec 2025 10:41:49 +0800 Message-Id: <20251229024149.832-1-zheng.ziyang@zte.com.cn> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 X-MAIL: mse-fl1.zte.com.cn 5BT2gFHm006609 X-TLS: YES X-SPF-DOMAIN: zte.com.cn X-ENVELOPE-SENDER: zheng.ziyang@zte.com.cn X-SPF: None X-SOURCE-IP: 10.5.228.132 unknown Mon, 29 Dec 2025 10:42:26 +0800 X-Fangmail-Anti-Spam-Filtered: true X-Fangmail-MID-QID: 6951EA92.001/4dfgVV3qnnz4xVd2 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, HTML_MESSAGE, KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_VALIDITY_RPBL_BLOCKED, RCVD_IN_VALIDITY_SAFE_BLOCKED, SCC_10_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org From: zhengziyang This patch adds an optimised memcmp implementation for RISC-V using the RVV extension. It dispatches based on buffer length N: - Fully unrolled scalar path for N≤4 to avoid vector setup overhead - Vector processing with LMUL=m2 for N>4, using masked operations for tail handling Key optimisation techniques: 1. Fast Mismatch Detection: Uses vmsne.vv (set if not equal) combined with vfirst.m to instantly locate the first differing byte index within vector registers, eliminating scalar comparison loops. 2. Efficient Vector Loop: Processes data in large chunks with LMUL=m2, maximizing memory throughput for medium and large buffers. 3. Zero-overhead Scalar Path: For tiny buffers (1-4 bytes), bypasses vector setup entirely with a fully unrolled linear instruction sequence, avoiding loop control overhead (increment/compare/branch). 4. Clean Tail Handling: Remaining bytes that do not fill a full vector are processed via a single masked vector load/compare operation, avoiding separate tail loops. The implementation assumes RVV 1.0 with VLEN >= 128, supports arbitrary VLEN configurations, and works on both RV32 and RV64 platforms. No page-size assumptions are made. Performance improvements (relative speedup %) over __memcmp_generic baseline: | Test Category | Config (VLENB) | vs. __memcmp_generic | |--------------------|-----------------------|----------------------| | **memcmp-default** | XuanTie C920 (128) | +54.6% | | | Spacemit(R) X60 (256) | +44.8% | Signed-off-by: Zheng Ziyang --- string/memcmp.c | 4 +- sysdeps/riscv/multiarch/memcmp-generic.c | 26 +++ sysdeps/riscv/multiarch/memcmp_vector.S | 161 ++++++++++++++++++ .../unix/sysv/linux/riscv/multiarch/Makefile | 3 + .../linux/riscv/multiarch/ifunc-impl-list.c | 5 + .../unix/sysv/linux/riscv/multiarch/memcmp.c | 57 +++++++ 6 files changed, 254 insertions(+), 2 deletions(-) create mode 100644 sysdeps/riscv/multiarch/memcmp-generic.c create mode 100644 sysdeps/riscv/multiarch/memcmp_vector.S create mode 100644 sysdeps/unix/sysv/linux/riscv/multiarch/memcmp.c -- 2.21.0.windows.1 diff --git a/string/memcmp.c b/string/memcmp.c index cd595ce95e..5f8b0698f0 100644 --- a/string/memcmp.c +++ b/string/memcmp.c @@ -353,9 +353,9 @@ MEMCMP (const void *s1, const void *s2, size_t len) libc_hidden_builtin_def(memcmp) #ifdef weak_alias # undef bcmp -weak_alias (memcmp, bcmp) +weak_alias (MEMCMP, bcmp) #endif #undef __memcmpeq -strong_alias (memcmp, __memcmpeq) +strong_alias (MEMCMP, __memcmpeq) libc_hidden_def(__memcmpeq) diff --git a/sysdeps/riscv/multiarch/memcmp-generic.c b/sysdeps/riscv/multiarch/memcmp-generic.c new file mode 100644 index 0000000000..a5ddb2bec8 --- /dev/null +++ b/sysdeps/riscv/multiarch/memcmp-generic.c @@ -0,0 +1,26 @@ +/* Re-include the default memcpy implementation. + Copyright (C) 2024 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include + +#if IS_IN(libc) +# define MEMCMP __memcmp_generic +# undef libc_hidden_builtin_def +# define libc_hidden_builtin_def(x) +#endif +#include diff --git a/sysdeps/riscv/multiarch/memcmp_vector.S b/sysdeps/riscv/multiarch/memcmp_vector.S new file mode 100644 index 0000000000..0fdbe28fd6 --- /dev/null +++ b/sysdeps/riscv/multiarch/memcmp_vector.S @@ -0,0 +1,161 @@ +/* memcmp for RISC-V, ignoring buffer alignment + Copyright (C) 2024-2025 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU Lesser General Public License. If not, see + . +*/ + +/* Optimised memcmp for riscv with vector extension */ +/* + * Core Design Logic: + * 1. Vector Processing: Use the Vector Extension (RVV) with LMUL=m2 to + * process data in large chunks, maximizing memory throughput for buffers + * larger than 4 bytes. + * 2. Fast Mismatch Detection: Utilize the 'vmsne.vv' (set if not equal) and + * 'vfirst.m' instructions to instantly locate the index of the first + * differing byte within a vector register without a scalar loop. + * 3. Early Scalar Threshold: For very small buffers (1 to 4 bytes), skip + * vector setup overhead entirely and jump to a specialized scalar path. + * 4. Fully Unrolled Scalar Path: The scalar handler for small sizes avoids + * loop control overhead (increment/compare/branch) by unrolling the + * comparison logic into a linear sequence of instructions. + * 5. Tail Handling: Remaining bytes that do not fill a full vector are + * handled by a single masked vector load/compare operation. + */ + +#include + +.text +.p2align 2 +ENTRY(__memcmp_vector) + + /* Function Signature: int memcmp(const void *s1, const void *s2, size_t n) + * a0: const void *s1 + * a1: const void *s2 + * a2: size_t n + */ + + /* ---------------------------------------------------- + * Initial Checks & Optimizations + * ---------------------------------------------------- */ + mv t2, zero + beqz a2, .Lret_zero /* If length is 0, return 0 immediately. */ + + /* Small Size Optimization: + * If length < 5 (i.e., 1, 2, 3, 4 bytes), the overhead of configuring + * the vector unit (vsetvli) outweighs the benefit. Jump to the + * unrolled scalar path. */ + li t1, 5 + bltu a2, t1, .Lscalar + + /* ---------------------------------------------------- + * Vector Loop (Main Body) + * Processes blocks of (VLENB * 2) bytes per iteration. + * ---------------------------------------------------- */ + vsetvli t1, a2, e8, m2, ta, ma + +1: + vle8.v v2, (a0) /* Load vector from s1. */ + vle8.v v4, (a1) /* Load vector from s2. */ + vmsne.vv v0, v2, v4 /* Compare: v0[i] = 1 if v2[i] != v4[i]. */ + vfirst.m t3, v0 /* Find index of first mismatch (-1 if none). */ + bgez t3, .Lvec_diff /* If t3 >= 0, a difference was found. */ + + /* No difference in this block. Advance pointers and counters. */ + add a0, a0, t1 + add a1, a1, t1 + sub a2, a2, t1 + bgeu a2, t1, 1b /* Continue loop if enough data remains. */ + + /* Check for remaining tail data. */ + bnez a2, .Ltail + mv a0, zero /* No tail, fully equal. Return 0. */ + ret + + /* ---------------------------------------------------- + * Vector Tail Handling + * We know a2 > 0 and a2 < current VL. Process the rest. + * ---------------------------------------------------- */ +.Ltail: + vsetvli t1, a2, e8, m2, ta, ma + vle8.v v2, (a0) + vle8.v v4, (a1) + vmsne.vv v0, v2, v4 + vfirst.m t3, v0 + bgez t3, .Lvec_diff + mv a0, zero /* Tail matched. Return 0. */ + ret + + /* ---------------------------------------------------- + * Difference Calculation (Vector Path) + * ---------------------------------------------------- */ +.Lvec_diff: + /* Mismatch found at index t3 inside the vector. + * Calculate absolute addresses, load bytes, and return difference. */ + add a0, a0, t3 + lbu t0, (a0) + add a1, a1, t3 + lbu t1, (a1) + sub a0, t0, t1 /* return (s1[i] - s2[i]) */ + ret + + /* ---------------------------------------------------- + * Scalar Path (Fully Unrolled) + * Optimized for n = 1, 2, 3, 4. + * Eliminates loop branching penalties. + * ---------------------------------------------------- */ +.Lscalar: + + /* Compare Byte 0 */ + lbu t0, 0(a0) + addi a2, a2, -1 + lbu t1, 0(a1) + bne t0, t1, .Lscalar_diff + + /* Check termination for n=1 */ + beqz a2, .Lret_zero + + /* Compare Byte 1 */ + lbu t0, 1(a0) + addi a2, a2, -1 + lbu t1, 1(a1) + bne t0, t1, .Lscalar_diff + + /* Check termination for n=2 */ + beqz a2, .Lret_zero + + /* Compare Byte 2 */ + lbu t0, 2(a0) + addi a2, a2, -1 + lbu t1, 2(a1) + bne t0, t1, .Lscalar_diff + + /* Check termination for n=3 */ + beqz a2, .Lret_zero + + /* Compare Byte 3 (Implicitly n=4 here) */ + lbu t0, 3(a0) + lbu t1, 3(a1) + /* Fall through to subtraction logic */ + +.Lscalar_diff: + sub a0, t0, t1 + ret + +.Lret_zero: + li a0, 0 + ret + +END(__memcmp_vector) diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile index 1d26966ded..95027df6b1 100644 --- a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile @@ -6,6 +6,9 @@ sysdep_routines += \ memset \ memset-generic \ memset-vector \ + memcmp \ + memcmp-generic \ + memcmp_vector \ # sysdep_routines CFLAGS-memcpy_noalignment.c += -mno-strict-align diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c index 87456f3370..d5cb49aa5c 100644 --- a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c @@ -53,5 +53,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, __memset_vector) IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic)) + IFUNC_IMPL (i, name, memcmp, + IFUNC_IMPL_ADD (array, i, memcmp, rvv_enabled, + __memcmp_vector) + IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_generic)) + return 0; } diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/memcmp.c b/sysdeps/unix/sysv/linux/riscv/multiarch/memcmp.c new file mode 100644 index 0000000000..aa7db13f1f --- /dev/null +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/memcmp.c @@ -0,0 +1,57 @@ +/* Multiple versions of memcpy. + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2017-2024 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#if IS_IN (libc) +/* Redefine memcpy so that the compiler won't complain about the type + mismatch with the IFUNC selector in strong_alias, below. */ +# undef memcmp +# define memcmp __redirect_memcmp +# include +# include +# include +# include +# include + +extern __typeof (__redirect_memcmp) __libc_memcmp; + +extern __typeof (__redirect_memcmp) __memcmp_generic attribute_hidden; +extern __typeof (__redirect_memcmp) __memcmp_vector attribute_hidden; + +static inline __typeof (__redirect_memcmp) * +select_memcmp_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func) +{ + unsigned long long int v; + if (__riscv_hwprobe_one (hwprobe_func, RISCV_HWPROBE_KEY_IMA_EXT_0, &v) == 0 + && (v & RISCV_HWPROBE_IMA_V) == RISCV_HWPROBE_IMA_V) + return __memcmp_vector; + + return __memcmp_generic; +} + +riscv_libc_ifunc (__libc_memcmp, select_memcmp_ifunc); + +# undef memcmp +strong_alias (__libc_memcmp, memcmp); +# ifdef SHARED +__hidden_ver1 (memcmp, __GI_memcmp, __redirect_memcmp) + __attribute__ ((visibility ("hidden"))) __attribute_copy__ (memcmp); +# endif +#else +# include +#endif