From patchwork Wed Nov 26 01:54:58 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 125294
Return-Path: <libc-alpha-bounces~patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 93ED13858C55
	for <patchwork@sourceware.org>; Wed, 26 Nov 2025 01:57:26 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 93ED13858C55
Authentication-Results: sourceware.org;
	dkim=pass (2048-bit key,
 unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256
 header.s=20230601 header.b=ZJ5t+yun
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com
 [IPv6:2607:f8b0:4864:20::432])
 by sourceware.org (Postfix) with ESMTPS id 75A943858D26
 for <libc-alpha@sourceware.org>; Wed, 26 Nov 2025 01:55:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 75A943858D26
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 75A943858D26
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::432
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122107; cv=none;
 b=xPZglPeOVie/wrmX0j4QkYUxmPOrxzDTU6Cr89D7HdCSNr1gNpI4dOe2S/7ACwA1TgTdh1Z6WqmHYq1OCqO+TBy+uuKZw/N4Vf1lZZZTf7ECrEf6ZjXQmnsONXss0BMsfM8JY6JZ/RI2t/fw00hIdd/X91ElwWd8EParokmc85A=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1764122107; c=relaxed/simple;
 bh=0DqgpxHaYNMgRJUKP0SSfLCD+IetrXuoFMWhrdQX78A=;
 h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version;
 b=BrGMOwblMvM4MajscxAN4yJvfsHMOreCam2pIP6PiuliApxOFMxJ+lYU9vwBCQ7e0zy2TtD6TLFpc7BdZOtCgJESX9wQug68fQZElG0/SiORVmM5kP5ox/McEy1L0EnyzhojrrNAGJ8rr1x4GGiUdp8dFQrZR3NRNBWUUeXd7Yc=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 75A943858D26
Received: by mail-pf1-x432.google.com with SMTP id
 d2e1a72fcca58-7aab061e7cbso7121488b3a.1
 for <libc-alpha@sourceware.org>; Tue, 25 Nov 2025 17:55:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1764122105; x=1764726905; darn=sourceware.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=iW3Gk7KW12bG7XwYEd0YCRsrZYhb8l/nvGtkQ3fPqW8=;
 b=ZJ5t+yunSXm2uqw0YMptHSWV/BCKJQFWEBwGS0sfy9zFuFv608+bfaRqTcmLq3l18M
 31ecrUQpJ+BSfudM9IEraVQkPxr87ZnYRhR8dE2lR8MVphAeVwOkIahjokSBORoXqdqR
 XJCV+fiP65ZYYbUQS122VHYNmM3sY3o3WPGwv3K46je3MnKYpUIMS6iWVcuNO2xm50tN
 YvBd0Dj6j3dG494laAEyDEulPZuPIxWEWEhcfnNB4SwMB4gl8RcvgUXzs61qqydq8Do0
 Aanob8HCcxYi+EXLvZhYs8yjMkAC6RMvw0qb8+gKjOPxKflu+J/MPVnnlHU4v0MI9iV4
 266g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1764122105; x=1764726905;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
 :to:cc:subject:date:message-id:reply-to;
 bh=iW3Gk7KW12bG7XwYEd0YCRsrZYhb8l/nvGtkQ3fPqW8=;
 b=CULVTGEp9z8nJQ78kGlXtg5drLsanPpozuZU9qNUMPNjErKu1hlZauP4yY9LL1gMM8
 JflpVcj6r2SR+gClm6lF5vhnDL4jtu0kjIh5V7wO6zeh+/hiJk44t/8PFo1gkkf5CWpS
 xIDWHGixsdXOqbTNOUHmbhhU4emzTqvUA5e9I26MO7LTK9eBrvZy3weK+aRKff/xacJO
 AbeW0/fibmtFyZIT9WuwybURQ8d39V3mXzabWu0bAOifoD4K++A4ZDIGAr0r5mOqWqc+
 4pOQEN3S2U6vnMSRCe/TNYJEezJHPHFf1olF5/Oxn7MV2T5W7aazY4P4OHE/oeEoZ7KF
 cKSw==
X-Gm-Message-State: AOJu0YyfuSW0Ic8FPDbqUL+OS5dkssI0zuQJycmUM2eo0LC7ZSw5xZ4r
 +73LyTV7PNR7rrfIRPp6/1mFNyWK+1BWiRX/BzrWizSNnq9LQLZz6sS78W4FpGrVUfA=
X-Gm-Gg: ASbGncusEaqVwn5rVD9mm/cVk9Y0d5T8kWMEbcwaHmiNxW/T6u3ucxiK3lC0yhDy+/6
 4Lij1zXvSS8R7nKfR1KKhFfPY5P9/pjWkKBoLlaX7rkEEzAV01/UdaDveMiGR3e4YXdYAo/1lQF
 FHxcjcFq31BgHFgDYiTXdSsJu/CHdNhX0sYV/U4yYQD+9rNVdnxk0sxPdv6stuc9/s/L3eZtHX9
 TAP5bFs7LgDcWcZsARhbV8f/eS9lUX2AHGf2uSmcIHzGcGxXawtg9BzfrNvQYTLKoFFnOT1l746
 Jsf/JRRP5ijgo9XadnQNPm2qZ/5f5nGKvc/1Y4zKRSlKVLjRfMNahHV+HOFY5+YUUXjzTB0dSvs
 JNpFm/jTFKCWy4Vcx0LwYoWuYqNr8ycZalGKOJYDhM7qsG3cix6tjMP1HfRYs+ZlF37hqcTxJwp
 2f5kO150QbM5WIYIQ2g1x2Po+S1UaFDuEeO86aFJYkvLn8aWyci7Ff/fkyPWQgRQ==
X-Google-Smtp-Source: 
 AGHT+IHGA1ruzRP3HzY4SlbOaR/FZCQqM0vEnzsexvbADuCjK5ChcJ6isoy9RpJ7wybSH1SfevyrTA==
X-Received: by 2002:a05:6a20:729c:b0:334:8d1f:fa8d with SMTP id
 adf61e73a8af0-3614eb84247mr17912591637.18.1764122105034;
 Tue, 25 Nov 2025 17:55:05 -0800 (PST)
Received: from localhost.localdomain ([203.149.208.29])
 by smtp.gmail.com with ESMTPSA id
 41be03b00d2f7-bd760dbc99asm17508540a12.30.2025.11.25.17.55.03
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 25 Nov 2025 17:55:04 -0800 (PST)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org,
 DJ Delorie <dj@redhat.com>
Subject: [PATCH v4 1/3] x86/string: Factor out large memmove implemention to
 seperate file
Date: Tue, 25 Nov 2025 20:54:58 -0500
Message-ID: <20251126015500.82591-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20251115093318.830179-1-goldstein.w.n@gmail.com>
References: <20251115093318.830179-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org

This is to enable us to support multiple large (size greater than
non-temporal threshold) implementations.

This patch has no affect on the resulting libc.so library.

Reviewed-by: DJ Delorie <dj@redhat.com>
---
 .../memmove-vec-large-page-unrolled.S         | 290 ++++++++++++++++++
 .../multiarch/memmove-vec-unaligned-erms.S    | 272 +---------------
 2 files changed, 297 insertions(+), 265 deletions(-)
 create mode 100644 sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S

diff --git a/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S
new file mode 100644
index 0000000000..21ae89e800
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S
@@ -0,0 +1,290 @@
+/* Non-Temporal page unrolled large memmove implementation.
+   Copyright (C) 2025 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifdef MEMMOVE_LARGE_IMPL
+# error "Multiple large memmove impls included!"
+#endif
+#define MEMMOVE_LARGE_IMPL	1
+
+/* Copies large regions by copying multiple pages at once.  This is
+	beneficial on some older Intel hardware (Broadwell, Skylake, and
+	Icelake).
+   1. If size < 16 * __x86_shared_non_temporal_threshold and
+      source and destination do not page alias, copy from 2 pages
+      at once using non-temporal stores. Page aliasing in this case is
+      considered true if destination's page alignment - sources' page
+      alignment is less than 8 * VEC_SIZE.
+   2. If size >= 16 * __x86_shared_non_temporal_threshold or source
+      and destination do page alias copy from 4 pages at once using
+      non-temporal stores.  */
+
+#ifndef LOG_PAGE_SIZE
+# define LOG_PAGE_SIZE	12
+#endif
+
+#if PAGE_SIZE != (1 << LOG_PAGE_SIZE)
+# error Invalid LOG_PAGE_SIZE
+#endif
+
+/* Byte per page for large_memcpy inner loop.  */
+#if VEC_SIZE == 64
+# define LARGE_LOAD_SIZE	(VEC_SIZE * 2)
+#else
+# define LARGE_LOAD_SIZE	(VEC_SIZE * 4)
+#endif
+
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+   bound for memcpy_large_4x. This is essentially use to to
+   indicate that the copy is far beyond the scope of L3
+   (assuming no user config x86_non_temporal_threshold) and to
+   use a more aggressively unrolled loop.  NB: before
+   increasing the value also update initialization of
+   x86_non_temporal_threshold.  */
+#ifndef LOG_4X_MEMCPY_THRESH
+# define LOG_4X_MEMCPY_THRESH	4
+#endif
+
+#if LARGE_LOAD_SIZE == (VEC_SIZE * 2)
+# define LOAD_ONE_SET(base, offset, vec0, vec1, ...)	\
+	VMOVU	(offset)base, vec0;	\
+	VMOVU	((offset) + VEC_SIZE)base, vec1;
+# define STORE_ONE_SET(base, offset, vec0, vec1, ...)	\
+	VMOVNT	vec0, (offset)base;	\
+	VMOVNT	vec1, ((offset) + VEC_SIZE)base;
+#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4)
+# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3)	\
+	VMOVU	(offset)base, vec0;	\
+	VMOVU	((offset) + VEC_SIZE)base, vec1;	\
+	VMOVU	((offset) + VEC_SIZE * 2)base, vec2;	\
+	VMOVU	((offset) + VEC_SIZE * 3)base, vec3;
+# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3)	\
+	VMOVNT	vec0, (offset)base;	\
+	VMOVNT	vec1, ((offset) + VEC_SIZE)base;	\
+	VMOVNT	vec2, ((offset) + VEC_SIZE * 2)base;	\
+	VMOVNT	vec3, ((offset) + VEC_SIZE * 3)base;
+#else
+# error Invalid LARGE_LOAD_SIZE
+#endif
+
+	.p2align 4,, 10
+#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
+L(large_memcpy_check):
+	/* Entry from L(large_memcpy_2x) has a redundant load of
+	   __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+	   is only use for the non-erms memmove which is generally less
+	   common.  */
+L(large_memcpy):
+	mov	__x86_shared_non_temporal_threshold(%rip), %R11_LP
+	cmp	%R11_LP, %RDX_LP
+	jb	L(more_8x_vec_check)
+	/* To reach this point it is impossible for dst > src and
+	   overlap. Remaining to check is src > dst and overlap. rcx
+	   already contains dst - src. Negate rcx to get src - dst. If
+	   length > rcx then there is overlap and forward copy is best.  */
+	negq	%rcx
+	cmpq	%rcx, %rdx
+	ja	L(more_8x_vec_forward)
+
+	/* Cache align destination. First store the first 64 bytes then
+	   adjust alignments.  */
+
+	/* First vec was also loaded into VEC(0).  */
+# if VEC_SIZE < 64
+	VMOVU	VEC_SIZE(%rsi), %VMM(1)
+#  if VEC_SIZE < 32
+	VMOVU	(VEC_SIZE * 2)(%rsi), %VMM(2)
+	VMOVU	(VEC_SIZE * 3)(%rsi), %VMM(3)
+#  endif
+# endif
+	VMOVU	%VMM(0), (%rdi)
+# if VEC_SIZE < 64
+	VMOVU	%VMM(1), VEC_SIZE(%rdi)
+#  if VEC_SIZE < 32
+	VMOVU	%VMM(2), (VEC_SIZE * 2)(%rdi)
+	VMOVU	%VMM(3), (VEC_SIZE * 3)(%rdi)
+#  endif
+# endif
+
+	/* Adjust source, destination, and size.  */
+	movq	%rdi, %r8
+	andq	$63, %r8
+	/* Get the negative of offset for alignment.  */
+	subq	$64, %r8
+	/* Adjust source.  */
+	subq	%r8, %rsi
+	/* Adjust destination which should be aligned now.  */
+	subq	%r8, %rdi
+	/* Adjust length.  */
+	addq	%r8, %rdx
+
+	/* Test if source and destination addresses will alias. If they
+	   do the larger pipeline in large_memcpy_4x alleviated the
+	   performance drop.  */
+
+	/* ecx contains -(dst - src). not ecx will return dst - src - 1
+	   which works for testing aliasing.  */
+	notl	%ecx
+	movq	%rdx, %r10
+	testl	$(PAGE_SIZE - VEC_SIZE * 8), %ecx
+	jz	L(large_memcpy_4x)
+
+	/* r11 has __x86_shared_non_temporal_threshold.  Shift it left
+	   by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.  */
+	shlq	$LOG_4X_MEMCPY_THRESH, %r11
+	cmp	%r11, %rdx
+	jae	L(large_memcpy_4x)
+
+	/* edx will store remainder size for copying tail.  */
+	andl	$(PAGE_SIZE * 2 - 1), %edx
+	/* r10 stores outer loop counter.  */
+	shrq	$(LOG_PAGE_SIZE + 1), %r10
+	/* Copy 4x VEC at a time from 2 pages.  */
+	.p2align 4
+L(loop_large_memcpy_2x_outer):
+	/* ecx stores inner loop counter.  */
+	movl	$(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx
+L(loop_large_memcpy_2x_inner):
+	PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
+	PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2)
+	PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE)
+	PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2)
+	/* Load vectors from rsi.  */
+	LOAD_ONE_SET ((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3))
+	LOAD_ONE_SET ((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
+	subq	$-LARGE_LOAD_SIZE, %rsi
+	/* Non-temporal store vectors to rdi.  */
+	STORE_ONE_SET ((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3))
+	STORE_ONE_SET ((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
+	subq	$-LARGE_LOAD_SIZE, %rdi
+	decl	%ecx
+	jnz	L(loop_large_memcpy_2x_inner)
+	addq	$PAGE_SIZE, %rdi
+	addq	$PAGE_SIZE, %rsi
+	decq	%r10
+	jne	L(loop_large_memcpy_2x_outer)
+	sfence
+
+	/* Check if only last 4 loads are needed.  */
+	cmpl	$(VEC_SIZE * 4), %edx
+	jbe	L(large_memcpy_2x_end)
+
+	/* Handle the last 2 * PAGE_SIZE bytes.  */
+L(loop_large_memcpy_2x_tail):
+	/* Copy 4 * VEC a time forward with non-temporal stores.  */
+	PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
+	PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE)
+	VMOVU	(%rsi), %VMM(0)
+	VMOVU	VEC_SIZE(%rsi), %VMM(1)
+	VMOVU	(VEC_SIZE * 2)(%rsi), %VMM(2)
+	VMOVU	(VEC_SIZE * 3)(%rsi), %VMM(3)
+	subq	$-(VEC_SIZE * 4), %rsi
+	addl	$-(VEC_SIZE * 4), %edx
+	VMOVA	%VMM(0), (%rdi)
+	VMOVA	%VMM(1), VEC_SIZE(%rdi)
+	VMOVA	%VMM(2), (VEC_SIZE * 2)(%rdi)
+	VMOVA	%VMM(3), (VEC_SIZE * 3)(%rdi)
+	subq	$-(VEC_SIZE * 4), %rdi
+	cmpl	$(VEC_SIZE * 4), %edx
+	ja	L(loop_large_memcpy_2x_tail)
+
+L(large_memcpy_2x_end):
+	/* Store the last 4 * VEC.  */
+	VMOVU	-(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0)
+	VMOVU	-(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1)
+	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2)
+	VMOVU	-VEC_SIZE(%rsi, %rdx), %VMM(3)
+
+	VMOVU	%VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx)
+	VMOVU	%VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx)
+	VMOVU	%VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx)
+	VMOVU	%VMM(3), -VEC_SIZE(%rdi, %rdx)
+	VZEROUPPER_RETURN
+
+	.p2align 4
+L(large_memcpy_4x):
+	/* edx will store remainder size for copying tail.  */
+	andl	$(PAGE_SIZE * 4 - 1), %edx
+	/* r10 stores outer loop counter.  */
+	shrq	$(LOG_PAGE_SIZE + 2), %r10
+	/* Copy 4x VEC at a time from 4 pages.  */
+	.p2align 4
+L(loop_large_memcpy_4x_outer):
+	/* ecx stores inner loop counter.  */
+	movl	$(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx
+L(loop_large_memcpy_4x_inner):
+	/* Only one prefetch set per page as doing 4 pages give more
+	   time for prefetcher to keep up.  */
+	PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
+	PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE)
+	PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE)
+	PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE)
+	/* Load vectors from rsi.  */
+	LOAD_ONE_SET ((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3))
+	LOAD_ONE_SET ((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
+	LOAD_ONE_SET ((%rsi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11))
+	LOAD_ONE_SET ((%rsi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15))
+	subq	$-LARGE_LOAD_SIZE, %rsi
+	/* Non-temporal store vectors to rdi.  */
+	STORE_ONE_SET ((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3))
+	STORE_ONE_SET ((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
+	STORE_ONE_SET ((%rdi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11))
+	STORE_ONE_SET ((%rdi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15))
+	subq	$-LARGE_LOAD_SIZE, %rdi
+	decl	%ecx
+	jnz	L(loop_large_memcpy_4x_inner)
+	addq	$(PAGE_SIZE * 3), %rdi
+	addq	$(PAGE_SIZE * 3), %rsi
+	decq	%r10
+	jne	L(loop_large_memcpy_4x_outer)
+	sfence
+	/* Check if only last 4 loads are needed.  */
+	cmpl	$(VEC_SIZE * 4), %edx
+	jbe	L(large_memcpy_4x_end)
+
+	/* Handle the last 4  * PAGE_SIZE bytes.  */
+L(loop_large_memcpy_4x_tail):
+	/* Copy 4 * VEC a time forward with non-temporal stores.  */
+	PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
+	PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE)
+	VMOVU	(%rsi), %VMM(0)
+	VMOVU	VEC_SIZE(%rsi), %VMM(1)
+	VMOVU	(VEC_SIZE * 2)(%rsi), %VMM(2)
+	VMOVU	(VEC_SIZE * 3)(%rsi), %VMM(3)
+	subq	$-(VEC_SIZE * 4), %rsi
+	addl	$-(VEC_SIZE * 4), %edx
+	VMOVA	%VMM(0), (%rdi)
+	VMOVA	%VMM(1), VEC_SIZE(%rdi)
+	VMOVA	%VMM(2), (VEC_SIZE * 2)(%rdi)
+	VMOVA	%VMM(3), (VEC_SIZE * 3)(%rdi)
+	subq	$-(VEC_SIZE * 4), %rdi
+	cmpl	$(VEC_SIZE * 4), %edx
+	ja	L(loop_large_memcpy_4x_tail)
+
+L(large_memcpy_4x_end):
+	/* Store the last 4 * VEC.  */
+	VMOVU	-(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0)
+	VMOVU	-(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1)
+	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2)
+	VMOVU	-VEC_SIZE(%rsi, %rdx), %VMM(3)
+
+	VMOVU	%VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx)
+	VMOVU	%VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx)
+	VMOVU	%VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx)
+	VMOVU	%VMM(3), -VEC_SIZE(%rdi, %rdx)
+	VZEROUPPER_RETURN
+#endif
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index 5cd8a6286e..70d303687c 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -34,17 +34,8 @@
       __x86_rep_movsb_threshold and less than
       __x86_rep_movsb_stop_threshold, then REP MOVSB will be used.
    7. If size >= __x86_shared_non_temporal_threshold and there is no
-      overlap between destination and source, use non-temporal store
-      instead of aligned store copying from either 2 or 4 pages at
-      once.
-   8. For point 7) if size < 16 * __x86_shared_non_temporal_threshold
-      and source and destination do not page alias, copy from 2 pages
-      at once using non-temporal stores. Page aliasing in this case is
-      considered true if destination's page alignment - sources' page
-      alignment is less than 8 * VEC_SIZE.
-   9. If size >= 16 * __x86_shared_non_temporal_threshold or source
-      and destination do page alias copy from 4 pages at once using
-      non-temporal stores.  */
+      overlap between destination and source, the exact method varies
+      and is set with MEMMOVE_VEC_LARGE_IMPL".  */
 
 #include <sysdep.h>
 
@@ -95,31 +86,6 @@
 # error Unsupported PAGE_SIZE
 #endif
 
-#ifndef LOG_PAGE_SIZE
-# define LOG_PAGE_SIZE 12
-#endif
-
-#if PAGE_SIZE != (1 << LOG_PAGE_SIZE)
-# error Invalid LOG_PAGE_SIZE
-#endif
-
-/* Byte per page for large_memcpy inner loop.  */
-#if VEC_SIZE == 64
-# define LARGE_LOAD_SIZE (VEC_SIZE * 2)
-#else
-# define LARGE_LOAD_SIZE (VEC_SIZE * 4)
-#endif
-
-/* Amount to shift __x86_shared_non_temporal_threshold by for
-   bound for memcpy_large_4x. This is essentially use to to
-   indicate that the copy is far beyond the scope of L3
-   (assuming no user config x86_non_temporal_threshold) and to
-   use a more aggressively unrolled loop.  NB: before
-   increasing the value also update initialization of
-   x86_non_temporal_threshold.  */
-#ifndef LOG_4X_MEMCPY_THRESH
-# define LOG_4X_MEMCPY_THRESH 4
-#endif
 
 /* Avoid short distance rep movsb only with non-SSE vector.  */
 #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB
@@ -160,26 +126,8 @@
 # error Unsupported PREFETCH_SIZE!
 #endif
 
-#if LARGE_LOAD_SIZE == (VEC_SIZE * 2)
-# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \
-	VMOVU	(offset)base, vec0; \
-	VMOVU	((offset) + VEC_SIZE)base, vec1;
-# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \
-	VMOVNT  vec0, (offset)base; \
-	VMOVNT  vec1, ((offset) + VEC_SIZE)base;
-#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4)
-# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \
-	VMOVU	(offset)base, vec0; \
-	VMOVU	((offset) + VEC_SIZE)base, vec1; \
-	VMOVU	((offset) + VEC_SIZE * 2)base, vec2; \
-	VMOVU	((offset) + VEC_SIZE * 3)base, vec3;
-# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \
-	VMOVNT	vec0, (offset)base; \
-	VMOVNT	vec1, ((offset) + VEC_SIZE)base; \
-	VMOVNT	vec2, ((offset) + VEC_SIZE * 2)base; \
-	VMOVNT	vec3, ((offset) + VEC_SIZE * 3)base;
-#else
-# error Invalid LARGE_LOAD_SIZE
+#ifndef MEMMOVE_VEC_LARGE_IMPL
+# define MEMMOVE_VEC_LARGE_IMPL	"memmove-vec-large-page-unrolled.S"
 #endif
 
 #ifndef SECTION
@@ -426,7 +374,7 @@ L(more_8x_vec):
 #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
 	/* Check non-temporal store threshold.  */
 	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
-	ja	L(large_memcpy_2x)
+	ja	L(large_memcpy)
 #endif
 	/* To reach this point there cannot be overlap and dst > src. So
 	   check for overlap and src > dst in which case correctness
@@ -613,7 +561,7 @@ L(movsb):
 	/* If above __x86_rep_movsb_stop_threshold most likely is
 	   candidate for NT moves as well.  */
 	cmp	__x86_rep_movsb_stop_threshold(%rip), %RDX_LP
-	jae	L(large_memcpy_2x_check)
+	jae	L(large_memcpy_check)
 # if AVOID_SHORT_DISTANCE_REP_MOVSB || ALIGN_MOVSB
 	/* Only avoid short movsb if CPU has FSRM.  */
 #  if X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB < 256
@@ -673,214 +621,8 @@ L(skip_short_movsb_check):
 # endif
 #endif
 
-	.p2align 4,, 10
-#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
-L(large_memcpy_2x_check):
-	/* Entry from L(large_memcpy_2x) has a redundant load of
-	   __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
-	   is only use for the non-erms memmove which is generally less
-	   common.  */
-L(large_memcpy_2x):
-	mov	__x86_shared_non_temporal_threshold(%rip), %R11_LP
-	cmp	%R11_LP, %RDX_LP
-	jb	L(more_8x_vec_check)
-	/* To reach this point it is impossible for dst > src and
-	   overlap. Remaining to check is src > dst and overlap. rcx
-	   already contains dst - src. Negate rcx to get src - dst. If
-	   length > rcx then there is overlap and forward copy is best.  */
-	negq	%rcx
-	cmpq	%rcx, %rdx
-	ja	L(more_8x_vec_forward)
-
-	/* Cache align destination. First store the first 64 bytes then
-	   adjust alignments.  */
-
-	/* First vec was also loaded into VEC(0).  */
-# if VEC_SIZE < 64
-	VMOVU	VEC_SIZE(%rsi), %VMM(1)
-#  if VEC_SIZE < 32
-	VMOVU	(VEC_SIZE * 2)(%rsi), %VMM(2)
-	VMOVU	(VEC_SIZE * 3)(%rsi), %VMM(3)
-#  endif
-# endif
-	VMOVU	%VMM(0), (%rdi)
-# if VEC_SIZE < 64
-	VMOVU	%VMM(1), VEC_SIZE(%rdi)
-#  if VEC_SIZE < 32
-	VMOVU	%VMM(2), (VEC_SIZE * 2)(%rdi)
-	VMOVU	%VMM(3), (VEC_SIZE * 3)(%rdi)
-#  endif
-# endif
+#include MEMMOVE_VEC_LARGE_IMPL
 
-	/* Adjust source, destination, and size.  */
-	movq	%rdi, %r8
-	andq	$63, %r8
-	/* Get the negative of offset for alignment.  */
-	subq	$64, %r8
-	/* Adjust source.  */
-	subq	%r8, %rsi
-	/* Adjust destination which should be aligned now.  */
-	subq	%r8, %rdi
-	/* Adjust length.  */
-	addq	%r8, %rdx
-
-	/* Test if source and destination addresses will alias. If they
-	   do the larger pipeline in large_memcpy_4x alleviated the
-	   performance drop.  */
-
-	/* ecx contains -(dst - src). not ecx will return dst - src - 1
-	   which works for testing aliasing.  */
-	notl	%ecx
-	movq	%rdx, %r10
-	testl	$(PAGE_SIZE - VEC_SIZE * 8), %ecx
-	jz	L(large_memcpy_4x)
-
-	/* r11 has __x86_shared_non_temporal_threshold.  Shift it left
-	   by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
-	 */
-	shlq	$LOG_4X_MEMCPY_THRESH, %r11
-	cmp	%r11, %rdx
-	jae	L(large_memcpy_4x)
-
-	/* edx will store remainder size for copying tail.  */
-	andl	$(PAGE_SIZE * 2 - 1), %edx
-	/* r10 stores outer loop counter.  */
-	shrq	$(LOG_PAGE_SIZE + 1), %r10
-	/* Copy 4x VEC at a time from 2 pages.  */
-	.p2align 4
-L(loop_large_memcpy_2x_outer):
-	/* ecx stores inner loop counter.  */
-	movl	$(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx
-L(loop_large_memcpy_2x_inner):
-	PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE)
-	PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2)
-	PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE)
-	PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2)
-	/* Load vectors from rsi.  */
-	LOAD_ONE_SET((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3))
-	LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
-	subq	$-LARGE_LOAD_SIZE, %rsi
-	/* Non-temporal store vectors to rdi.  */
-	STORE_ONE_SET((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3))
-	STORE_ONE_SET((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
-	subq	$-LARGE_LOAD_SIZE, %rdi
-	decl	%ecx
-	jnz	L(loop_large_memcpy_2x_inner)
-	addq	$PAGE_SIZE, %rdi
-	addq	$PAGE_SIZE, %rsi
-	decq	%r10
-	jne	L(loop_large_memcpy_2x_outer)
-	sfence
-
-	/* Check if only last 4 loads are needed.  */
-	cmpl	$(VEC_SIZE * 4), %edx
-	jbe	L(large_memcpy_2x_end)
-
-	/* Handle the last 2 * PAGE_SIZE bytes.  */
-L(loop_large_memcpy_2x_tail):
-	/* Copy 4 * VEC a time forward with non-temporal stores.  */
-	PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
-	PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE)
-	VMOVU	(%rsi), %VMM(0)
-	VMOVU	VEC_SIZE(%rsi), %VMM(1)
-	VMOVU	(VEC_SIZE * 2)(%rsi), %VMM(2)
-	VMOVU	(VEC_SIZE * 3)(%rsi), %VMM(3)
-	subq	$-(VEC_SIZE * 4), %rsi
-	addl	$-(VEC_SIZE * 4), %edx
-	VMOVA	%VMM(0), (%rdi)
-	VMOVA	%VMM(1), VEC_SIZE(%rdi)
-	VMOVA	%VMM(2), (VEC_SIZE * 2)(%rdi)
-	VMOVA	%VMM(3), (VEC_SIZE * 3)(%rdi)
-	subq	$-(VEC_SIZE * 4), %rdi
-	cmpl	$(VEC_SIZE * 4), %edx
-	ja	L(loop_large_memcpy_2x_tail)
-
-L(large_memcpy_2x_end):
-	/* Store the last 4 * VEC.  */
-	VMOVU	-(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0)
-	VMOVU	-(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1)
-	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2)
-	VMOVU	-VEC_SIZE(%rsi, %rdx), %VMM(3)
-
-	VMOVU	%VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx)
-	VMOVU	%VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx)
-	VMOVU	%VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx)
-	VMOVU	%VMM(3), -VEC_SIZE(%rdi, %rdx)
-	VZEROUPPER_RETURN
-
-	.p2align 4
-L(large_memcpy_4x):
-	/* edx will store remainder size for copying tail.  */
-	andl	$(PAGE_SIZE * 4 - 1), %edx
-	/* r10 stores outer loop counter.  */
-	shrq	$(LOG_PAGE_SIZE + 2), %r10
-	/* Copy 4x VEC at a time from 4 pages.  */
-	.p2align 4
-L(loop_large_memcpy_4x_outer):
-	/* ecx stores inner loop counter.  */
-	movl	$(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx
-L(loop_large_memcpy_4x_inner):
-	/* Only one prefetch set per page as doing 4 pages give more
-	   time for prefetcher to keep up.  */
-	PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE)
-	PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE)
-	PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE)
-	PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE)
-	/* Load vectors from rsi.  */
-	LOAD_ONE_SET((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3))
-	LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
-	LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11))
-	LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15))
-	subq	$-LARGE_LOAD_SIZE, %rsi
-	/* Non-temporal store vectors to rdi.  */
-	STORE_ONE_SET((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3))
-	STORE_ONE_SET((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
-	STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11))
-	STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15))
-	subq	$-LARGE_LOAD_SIZE, %rdi
-	decl	%ecx
-	jnz	L(loop_large_memcpy_4x_inner)
-	addq	$(PAGE_SIZE * 3), %rdi
-	addq	$(PAGE_SIZE * 3), %rsi
-	decq	%r10
-	jne	L(loop_large_memcpy_4x_outer)
-	sfence
-	/* Check if only last 4 loads are needed.  */
-	cmpl	$(VEC_SIZE * 4), %edx
-	jbe	L(large_memcpy_4x_end)
-
-	/* Handle the last 4  * PAGE_SIZE bytes.  */
-L(loop_large_memcpy_4x_tail):
-	/* Copy 4 * VEC a time forward with non-temporal stores.  */
-	PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE)
-	PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE)
-	VMOVU	(%rsi), %VMM(0)
-	VMOVU	VEC_SIZE(%rsi), %VMM(1)
-	VMOVU	(VEC_SIZE * 2)(%rsi), %VMM(2)
-	VMOVU	(VEC_SIZE * 3)(%rsi), %VMM(3)
-	subq	$-(VEC_SIZE * 4), %rsi
-	addl	$-(VEC_SIZE * 4), %edx
-	VMOVA	%VMM(0), (%rdi)
-	VMOVA	%VMM(1), VEC_SIZE(%rdi)
-	VMOVA	%VMM(2), (VEC_SIZE * 2)(%rdi)
-	VMOVA	%VMM(3), (VEC_SIZE * 3)(%rdi)
-	subq	$-(VEC_SIZE * 4), %rdi
-	cmpl	$(VEC_SIZE * 4), %edx
-	ja	L(loop_large_memcpy_4x_tail)
-
-L(large_memcpy_4x_end):
-	/* Store the last 4 * VEC.  */
-	VMOVU	-(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0)
-	VMOVU	-(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1)
-	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2)
-	VMOVU	-VEC_SIZE(%rsi, %rdx), %VMM(3)
-
-	VMOVU	%VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx)
-	VMOVU	%VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx)
-	VMOVU	%VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx)
-	VMOVU	%VMM(3), -VEC_SIZE(%rdi, %rdx)
-	VZEROUPPER_RETURN
-#endif
 END (MEMMOVE_SYMBOL (__memmove, unaligned_erms))
 
 #if IS_IN (libc)

From patchwork Wed Nov 26 01:54:59 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 125293
Return-Path: <libc-alpha-bounces~patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 6441E3858C39
	for <patchwork@sourceware.org>; Wed, 26 Nov 2025 01:57:18 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6441E3858C39
Authentication-Results: sourceware.org;
	dkim=pass (2048-bit key,
 unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256
 header.s=20230601 header.b=OktalpZk
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com
 [IPv6:2607:f8b0:4864:20::102e])
 by sourceware.org (Postfix) with ESMTPS id 209C33858D29
 for <libc-alpha@sourceware.org>; Wed, 26 Nov 2025 01:55:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 209C33858D29
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 209C33858D29
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::102e
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122110; cv=none;
 b=lV1nrwsbeW8U+rXGgZ2ZUkxxet/DaF5Az4M1nLlFe3DaVOiisvNTWFIc3QMjHO6ppWDtq3WBr5Ra97ZZPkv5/UJWbmdHAU2kXanR+Ma+8EMvNnjTd/G8Dr52gJ2glxg4WIl5VSKJLzuK9niv1rdo+wREm7Aa/R3dQiBOyxTgVMw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1764122110; c=relaxed/simple;
 bh=hQJDDmAN+G9yABT/+lXnSOyKQFq/Dwoht25VW7XydW0=;
 h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version;
 b=rozVG1NTyfffejawEpgsYJisZ5SuSoW+UH4qtcf3+shm99ZWppBxMTpgnAfP/rIj5/vdKI+lL6jCSceCD19MTDE2Fu+CTx7kzsxydDRbFEQYb5RjAT3OKrw8DZ7/fZC/LgyfnfS38zMpWwoxAC90hubGht3lmEPe0grYaRsRQa4=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 209C33858D29
Received: by mail-pj1-x102e.google.com with SMTP id
 98e67ed59e1d1-343f52d15efso5793528a91.3
 for <libc-alpha@sourceware.org>; Tue, 25 Nov 2025 17:55:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1764122109; x=1764726909; darn=sourceware.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=elA4nuFIItH3qJUFITTMkisr3OQPTAmvZDM7L3jMZ1w=;
 b=OktalpZk355xPjy9BsKBEsGdxaRyC1R6aFrIeV9SoGhw9sbhocB44WCPUwLTJJF9rA
 TkvMOevdiWPZEGCSSgx56AFTkPv45q9D/e+iqtSxOjYRhoV5bz5BTtv2gkI+KgMoEoYf
 XkNsGr2CEUTFcIjIAHWFi/twm6P938TDfCM2WEi7VLLxT23rcpNKTFT4YMAU75ASD+6W
 nhHKAeAkNfzxs9iFzWQf0P5VTPm0HP1iNnAmlWVZLtkLxa5Igx1tuFbUfj2BVtVm+b8V
 nHpl278D738DMMDje46Gwvg9S3a1zvuKhtrW+nVzIc9ayM9PCneHefKAWyugeOo5Muxr
 RyQw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1764122109; x=1764726909;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
 :to:cc:subject:date:message-id:reply-to;
 bh=elA4nuFIItH3qJUFITTMkisr3OQPTAmvZDM7L3jMZ1w=;
 b=p5rLEWGdfH5DuNdnr29+gfQVmmGw9Q88bxHbi1mJ5cT1nW4hfQzMxnOq4JKSpat6zD
 KGABnIb3I9wtqCLwcAgAZVgZDZRd2H8mYh07dM3N+fawXNCxwFROgbZqUJXXiLNPfrSE
 3chnRC06HBdp1RGiihSHHr/105989XnSlWHFRniFbFAQajR0QmETp3aHUFKlfrDIBRDI
 2ANXbyW4qh6eRMk1DXtn0dpTvJRR7Z3/zQgV6H0+7umgdw582q01dmwhGllbQw6lgIdB
 nTHFfq7vEtDIpdIJgzJGogrmh9rUCKUZR6w7skKwysUI+cUzLRDZopVjLKow9F/KKYfF
 a3PQ==
X-Gm-Message-State: AOJu0Yz2axu3d7iJcgwdtWGI7U6RquxWGhghah9cIMQl2NMUGaCIntYS
 FZmI2gja+mWZ2xSrgRO6NcaJXtQK+4fOYgPrxVzZxFdunY1lJLh3PJtTiorT7JgdaPg=
X-Gm-Gg: ASbGncvlF9esxrlSpjQLoSViH0qNqUA+jR6AXpNvT/iIIAQiblCYqOFUx425Rdf5xKf
 M/ILfHLn/UrMA7g33Sr66ycFxCrbUlae+jYUdR94K94gAsaFobvvhXNSm5QQXzqaofMh1xFHnN3
 SRzg75wJcftnT79UjpD+Q5iJjnEFCtDT6I/cTSQnHqyC1SZfOY9jXWOpkGmemJCNPuPjbhwWjgW
 uPTMvsPucR9PW1lnwaPnps87Xcly9MkM6dKArEkpK+Cnj7rS3orO5R5ZWGGSbSb3Nmb9fqNI1sO
 1uUnY5W9/Wou035zSGTfgDrSORMvybqbwSqok//kTmIsxAk8onCgp1JXK9/rlsNPCLBsk0Nd9xJ
 3GTp6jfB2AxREFYE08z0nrzAbsBSlF5HDpyVyKE4Rt3BvkGhd3f/ixXgZMkT5JYA9sB0qhGbLh3
 jYVO81evkeqS9uD7Mts59UPa5zLWlrqnFBQN4XopBrzK+0kCIPSqwapWPfEvlfcg==
X-Google-Smtp-Source: 
 AGHT+IGWt7Xol0N/Y95Q2uTM/Zqs7eMeSe1ZvtRKvuMV9iqbDJcKArv9ev4z0KyY8B893p2qrIlrFQ==
X-Received: by 2002:a17:90b:4a91:b0:33d:a0fd:257b with SMTP id
 98e67ed59e1d1-3475ed7d931mr4490806a91.36.1764122108694;
 Tue, 25 Nov 2025 17:55:08 -0800 (PST)
Received: from localhost.localdomain ([203.149.208.29])
 by smtp.gmail.com with ESMTPSA id
 41be03b00d2f7-bd760dbc99asm17508540a12.30.2025.11.25.17.55.06
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 25 Nov 2025 17:55:08 -0800 (PST)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org,
 DJ Delorie <dj@redhat.com>
Subject: [PATCH v4 2/3] x86/string: Use simpler approach for large memcpy [BZ
 #32475]
Date: Tue, 25 Nov 2025 20:54:59 -0500
Message-ID: <20251126015500.82591-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20251126015500.82591-1-goldstein.w.n@gmail.com>
References: <20251115093318.830179-1-goldstein.w.n@gmail.com>
 <20251126015500.82591-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org

The new approach does a simple 4x non-temporal loop (forwards or
backwards to avoid 4k aliasing). This is similiar what we used to do prior to:

commit 1a8605b6cd257e8a74e29b5b71c057211f5fb847
Author: noah <goldstein.w.n@gmail.com>
Date:   Sat Apr 3 04:12:15 2021 -0400

    x86: Update large memcpy case in memmove-vec-unaligned-erms.S

But with 4k aliasing detection to avoid a known pathological slow
case.

The multi-page approach yielded 5-15% better performance for the size
ranges covered by bench-memcpy-large (roughly 64KB-32MB) on the tested
platforms but has some notable draw backs.

The drawbacks stem from the fact that the multi-page approach is
significantly less "canonical" a form of memcpy and thus is likely to
have less reliably "good" performance on untested platforms (including
future ones) and configurations (i.e > 2GB copies from BZ #32475).

Since there are known slow cases with the multi-page approach (that
far exceed 15%) and the multi-page approach is much more brittle, it
seems prudent to switch to this simpler, more reliable, better
future-proofed implementation.

Tested on x86_64.

Reviewed-by: DJ Delorie <dj@redhat.com>
---
 sysdeps/x86_64/multiarch/memmove-vec-large.S  | 125 ++++++++++++++++++
 .../multiarch/memmove-vec-unaligned-erms.S    |   2 +-
 2 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/x86_64/multiarch/memmove-vec-large.S

diff --git a/sysdeps/x86_64/multiarch/memmove-vec-large.S b/sysdeps/x86_64/multiarch/memmove-vec-large.S
new file mode 100644
index 0000000000..4c398d4602
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmove-vec-large.S
@@ -0,0 +1,125 @@
+/* Non-Temporal large memmove implementation.
+   Copyright (C) 2025 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifdef MEMMOVE_LARGE_IMPL
+# error "Multiple large memmove impls included!"
+#endif
+#define MEMMOVE_LARGE_IMPL	1
+
+/* Copies large regions by with a 4x unrolled loop of non-temporal
+   stores.  */
+
+#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
+L(large_memcpy_check):
+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+	jb	L(more_8x_vec_check)
+L(large_memcpy):
+	/* To reach this point it is impossible for dst > src and
+	   overlap. Remaining to check is src > dst and overlap. rcx
+	   already contains dst - src. Negate rcx to get src - dst. If
+	   length > rcx then there is overlap and forward copy is best.  */
+	negq	%rcx
+	cmpq	%rcx, %rdx
+	ja	L(more_8x_vec_forward)
+
+	/* We are doing non-temporal copy and no overlap. Choose forward
+	   or backward copy based on avoiding 4k aliasing. ecx already
+	   contains src - dst. We check if:
+	   (src % 4096) - (dst % 4096) > (4096 - 512)
+	   If true then we risk aliasing.  */
+	andl	$(PAGE_SIZE - 1), %ecx
+	cmpl	$(PAGE_SIZE - 512), %ecx
+	ja	L(large_backward)
+
+	subq	%rdi, %rsi
+
+	/* Store the first VEC.  */
+	VMOVU	%VMM(0), (%rdi)
+
+	/* Store end of buffer minus tail in rdx.  */
+	leaq	(VEC_SIZE * -4)(%rdi, %rdx), %rdx
+
+	/* Align DST.  */
+	orq	$(VEC_SIZE - 1), %rdi
+	incq	%rdi
+	leaq	(%rdi, %rsi), %rcx
+	/* Dont use multi-byte nop to align.  */
+	.p2align 4,, 11
+L(loop_4x_nt_forward):
+	PREFETCH_ONE_SET (1, (%rcx), VEC_SIZE * 8)
+	/* Copy 4 * VEC a time forward.  */
+	VMOVU	(VEC_SIZE * 0)(%rcx), %VMM(1)
+	VMOVU	(VEC_SIZE * 1)(%rcx), %VMM(2)
+	VMOVU	(VEC_SIZE * 2)(%rcx), %VMM(3)
+	VMOVU	(VEC_SIZE * 3)(%rcx), %VMM(4)
+	subq	$-(VEC_SIZE * 4), %rcx
+	VMOVNT	%VMM(1), (VEC_SIZE * 0)(%rdi)
+	VMOVNT	%VMM(2), (VEC_SIZE * 1)(%rdi)
+	VMOVNT	%VMM(3), (VEC_SIZE * 2)(%rdi)
+	VMOVNT	%VMM(4), (VEC_SIZE * 3)(%rdi)
+	subq	$-(VEC_SIZE * 4), %rdi
+	cmpq	%rdi, %rdx
+	ja	L(loop_4x_nt_forward)
+	sfence
+
+	VMOVU	(VEC_SIZE * 0)(%rsi, %rdx), %VMM(1)
+	VMOVU	(VEC_SIZE * 1)(%rsi, %rdx), %VMM(2)
+	VMOVU	(VEC_SIZE * 2)(%rsi, %rdx), %VMM(3)
+	VMOVU	(VEC_SIZE * 3)(%rsi, %rdx), %VMM(4)
+	VMOVU	%VMM(1), (VEC_SIZE * 0)(%rdx)
+	VMOVU	%VMM(2), (VEC_SIZE * 1)(%rdx)
+	VMOVU	%VMM(3), (VEC_SIZE * 2)(%rdx)
+	VMOVU	%VMM(4), (VEC_SIZE * 3)(%rdx)
+	VZEROUPPER_RETURN
+
+	.p2align 4,, 10
+L(large_backward):
+	leaq	(VEC_SIZE * -4 - 1)(%rdi, %rdx), %rcx
+	VMOVU	(VEC_SIZE * -1)(%rsi, %rdx), %VMM(5)
+	VMOVU	%VMM(5), (VEC_SIZE * -1)(%rdi, %rdx)
+	andq	$-(VEC_SIZE), %rcx
+	subq	%rdi, %rsi
+	leaq	(%rsi, %rcx), %rdx
+	/* Don't use multi-byte nop to align.  */
+	.p2align 4,, 11
+L(loop_4x_nt_backward):
+	PREFETCH_ONE_SET (-1, (%rdx), -VEC_SIZE * 8)
+	VMOVU	(VEC_SIZE * 3)(%rdx), %VMM(1)
+	VMOVU	(VEC_SIZE * 2)(%rdx), %VMM(2)
+	VMOVU	(VEC_SIZE * 1)(%rdx), %VMM(3)
+	VMOVU	(VEC_SIZE * 0)(%rdx), %VMM(4)
+	addq	$(VEC_SIZE * -4), %rdx
+	VMOVNT	%VMM(1), (VEC_SIZE * 3)(%rcx)
+	VMOVNT	%VMM(2), (VEC_SIZE * 2)(%rcx)
+	VMOVNT	%VMM(3), (VEC_SIZE * 1)(%rcx)
+	VMOVNT	%VMM(4), (VEC_SIZE * 0)(%rcx)
+	addq	$(VEC_SIZE * -4), %rcx
+	cmpq	%rcx, %rdi
+	jb	L(loop_4x_nt_backward)
+
+	sfence
+	VMOVU	(VEC_SIZE * 3)(%rsi, %rdi), %VMM(4)
+	VMOVU	(VEC_SIZE * 2)(%rsi, %rdi), %VMM(3)
+	VMOVU	(VEC_SIZE * 1)(%rsi, %rdi), %VMM(2)
+	/* We already loaded VMM(0).  */
+	VMOVU	%VMM(4), (VEC_SIZE * 3)(%rdi)
+	VMOVU	%VMM(3), (VEC_SIZE * 2)(%rdi)
+	VMOVU	%VMM(2), (VEC_SIZE * 1)(%rdi)
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	VZEROUPPER_RETURN
+#endif
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index 70d303687c..7c4765286d 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -127,7 +127,7 @@
 #endif
 
 #ifndef MEMMOVE_VEC_LARGE_IMPL
-# define MEMMOVE_VEC_LARGE_IMPL	"memmove-vec-large-page-unrolled.S"
+# define MEMMOVE_VEC_LARGE_IMPL	"memmove-vec-large.S"
 #endif
 
 #ifndef SECTION

From patchwork Wed Nov 26 01:55:00 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 125295
Return-Path: <libc-alpha-bounces~patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 0B7413858CB6
	for <patchwork@sourceware.org>; Wed, 26 Nov 2025 01:58:19 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0B7413858CB6
Authentication-Results: sourceware.org;
	dkim=pass (2048-bit key,
 unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256
 header.s=20230601 header.b=BiLlrPK6
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com
 [IPv6:2607:f8b0:4864:20::435])
 by sourceware.org (Postfix) with ESMTPS id 123333858D32
 for <libc-alpha@sourceware.org>; Wed, 26 Nov 2025 01:55:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 123333858D32
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 123333858D32
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::435
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122114; cv=none;
 b=cAmjIMrMyWTzRvGji4qPQ0zXu9kNjJosff8oXHliWnH6hFuGVZ+iu091Wg5CXZdo/V6W60/VqBfOVH9ImZQkovwvcsLYxgOvPpKV+0Je1dq0MAcDWRwHzIYyksuN1a2D1yIxryMoWMV6f3WFf90P24/5uNE5dYh91MEfJADcZ4w=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1764122114; c=relaxed/simple;
 bh=RFxxHY3PWY2+WIfyOUm2l5rGiDWBt3gJfJeJcnyUoFA=;
 h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version;
 b=LhPNwQOm22h/YRIAAWKP7dYv8pTZg+S6gUKUPHAe5YblKa4S5o/+PAt71Ekt2yro6ubtY4Ofg1THnY5huIsNK7ji/XV/F5pENL6zQrPIct4YR/GQtWEqD9WfkgNFRsUys+m8sltcjZxbS0tD+0iNU9gz0q1l1mGS5KQwK0on2wg=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 123333858D32
Received: by mail-pf1-x435.google.com with SMTP id
 d2e1a72fcca58-7bab7c997eeso6842533b3a.0
 for <libc-alpha@sourceware.org>; Tue, 25 Nov 2025 17:55:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1764122113; x=1764726913; darn=sourceware.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=oo8zCwLLSy8CNUWcRxlakkQ55QLD/aYZepRLcoSGyLQ=;
 b=BiLlrPK6LR/SuwDjgE5asWatjS5nJ1/DHcmbOFqrLkf1fky/eIWjytEwOSFtqS5SZ2
 VukoozQRseNAqjiSdpHxMpBQnvPxOWKAECFJS7z1m08EY2D8Z2zhkfXvlcJ+u71T5Ld2
 EfWUCw3kd2IjDYyB9gEqeFAb33sHDFyB2EEj4HPeDMyADGqooSbqR2ZU6ELWShmTIJpN
 qr3A6HQBnbUwMGakgzdHmnQWAMg0gKFVfmYHyPmIc01MSOsN3Rt5ANEdsYulQafo6Pv2
 h6WFBH3+NTfR7X5kyr0p38UnbngCSJUCruFSQE8wOABw0zsCrgrDdnSA0d5hRlIyxD1V
 /Kjg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1764122113; x=1764726913;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
 :to:cc:subject:date:message-id:reply-to;
 bh=oo8zCwLLSy8CNUWcRxlakkQ55QLD/aYZepRLcoSGyLQ=;
 b=pcD506TRKi8k+mMdfuHaki8QBz75yYjiHGvIOqp/UiT/pCw+Zd2tfhBiGC1SykRU0x
 NL8JEUaD0UOMiEcjEE12wPjZ+7W4REF89QKivKTJx8PlbDGrSxgqfesZDmA7JBkCYWja
 xSnTNukM9lekNeJWIp5KnVTS6owp3ZbdZQNJCNWiIT7VcPm0LZKgwWTB/m1PUbIGFj3Y
 jrFwi9t8KISsY2b86GEp45nwiIJ1F7ojq+uElek09Z+Wkm2MU0s2E2AHwBn4Z0y5MgUU
 gf7ATO4JUlPlX4I1ccevBxJkHW6zHUSw6Tx0wk9LeP4QtkdYCv7bmFpzWhL1LexNVEs2
 iSQQ==
X-Gm-Message-State: AOJu0YzXbOtHpsZBimQmSNx1i3ftUs6XRsvDIKA04fYhOt0bTN8sVSiD
 fqnGKE3VzXWDPpRqgjPVW9PMriXSfZAMvMAOmwtUKWsyFUM/2Qhazq1tbX6DGrRav/4=
X-Gm-Gg: ASbGncu5GCy6/0Qm+DVXoCv9ScQIQ5nHnsCL+/ZnK2RvFyEyQyb6FAkNflcJ3AS4lNq
 666rGVjI66gUr5NJjkCMR+eNuvXVuZ2nHbN8aARHqPYcXMaypDpwH3O0zu6YPTffRFZBMl6U9UI
 1xgSqG9mANsilhOOPE4CAKUjq6gJKoSCtBP0JplPHehCjSmmpKoVZmCsFKkF7UYzs715PBc4B9V
 35U9ytYLQ8ZGpnoUdg/dWRktbE/8ZhhavBF83xtDZhlA+71jgyaGEPZt2jWNGphMkbjOygCIvi3
 xMRy47Hg3KVVU0D5GXsu1Ul3FMo6vph6DOJQgQa+LDlmnKQxC6ByyrMPWzh8jKo7HMMQdxplfLI
 k/IJT0sYg98+wIbJE1aoXBPgoCl2RA+tvDjkZhm6eOi5GGTdnkyLXURqrq4FD2wOxJWOiEhhm6h
 zwczZjkgBtijS1vRtZBj1K0EUxs6TRrmcC2L3FbPMWEZBHzXI43cofxUEQQIUetA==
X-Google-Smtp-Source: 
 AGHT+IHgVYoy/OO64vFvKjwAMofuRFRhPTzIF3KbpNS4I6/13IFHueDBV5dscDb/5CJcZl09Y7SaDQ==
X-Received: by 2002:a05:6a20:939f:b0:359:9d33:df08 with SMTP id
 adf61e73a8af0-36150e72016mr17605107637.18.1764122112574;
 Tue, 25 Nov 2025 17:55:12 -0800 (PST)
Received: from localhost.localdomain ([203.149.208.29])
 by smtp.gmail.com with ESMTPSA id
 41be03b00d2f7-bd760dbc99asm17508540a12.30.2025.11.25.17.55.10
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 25 Nov 2025 17:55:12 -0800 (PST)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org,
 DJ Delorie <dj@redhat.com>
Subject: [PATCH v4 3/3] x86/string: Add version of memmove with page unrolled
 large impl
Date: Tue, 25 Nov 2025 20:55:00 -0500
Message-ID: <20251126015500.82591-3-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20251126015500.82591-1-goldstein.w.n@gmail.com>
References: <20251115093318.830179-1-goldstein.w.n@gmail.com>
 <20251126015500.82591-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org

The page unrolled version has been shown to be the best performing on
Intel SnB through ICX hardware.

Reviewed-by: DJ Delorie <dj@redhat.com>
---
 sysdeps/x86/cpu-features.c                    |  10 ++
 sysdeps/x86/cpu-tunables.c                    |   6 +
 ...cpu-features-preferred_feature_index_1.def |   1 +
 sysdeps/x86/tst-hwcap-tunables.c              |   4 +-
 sysdeps/x86_64/multiarch/Makefile             |   3 +
 sysdeps/x86_64/multiarch/ifunc-impl-list.c    | 120 ++++++++++++++++++
 sysdeps/x86_64/multiarch/ifunc-memmove.h      |  75 +++++++----
 ...ove-avx-unaligned-erms-page-unrolled-rtm.S |   5 +
 ...memmove-avx-unaligned-erms-page-unrolled.S |   5 +
 .../memmove-avx-unaligned-erms-rtm.S          |   2 +
 ...emmove-evex-unaligned-erms-page-unrolled.S |   5 +
 11 files changed, 211 insertions(+), 25 deletions(-)
 create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S
 create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S
 create mode 100644 sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index ecf10ce44d..36803aa53f 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -924,6 +924,11 @@ disable_tsx:
 	case INTEL_BIGCORE_HASWELL:
 	case INTEL_BIGCORE_BROADWELL:
 	  cpu_features->cachesize_non_temporal_divisor = 8;
+	  /* Benchmarks indicate page unrolled large implementation
+	 performs better than standard copy loop on HSW (and
+	 presumably SnB).  */
+	  cpu_features->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy]
+	      |= bit_arch_Prefer_Page_Unrolled_Large_Copy;
 	  goto default_tuning;
 
 	  /* Newer Bigcore microarch (larger non-temporal store
@@ -944,6 +949,11 @@ disable_tsx:
 	case INTEL_BIGCORE_ICELAKE:
 	case INTEL_BIGCORE_TIGERLAKE:
 	case INTEL_BIGCORE_ROCKETLAKE:
+	  /* Benchmarks indicate page unrolled large implementation
+	 performs better than standard copy loop on Skylake/SKX/ICX.  */
+	  cpu_features->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy]
+	      |= bit_arch_Prefer_Page_Unrolled_Large_Copy;
+	  [[fallthrough]];
 	case INTEL_BIGCORE_RAPTORLAKE:
 	case INTEL_BIGCORE_METEORLAKE:
 	case INTEL_BIGCORE_LUNARLAKE:
diff --git a/sysdeps/x86/cpu-tunables.c b/sysdeps/x86/cpu-tunables.c
index 74cd5b9377..17fdbf2ff3 100644
--- a/sysdeps/x86/cpu-tunables.c
+++ b/sysdeps/x86/cpu-tunables.c
@@ -259,6 +259,12 @@ TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp)
 		(n, cpu_features, Prefer_PMINUB_for_stringop, SSE2, 26);
 	    }
 	  break;
+	case 31:
+	  {
+	    CHECK_GLIBC_IFUNC_PREFERRED_BOTH (
+		n, cpu_features, Prefer_Page_Unrolled_Large_Copy, 31);
+	  }
+	  break;
 	}
     }
 }
diff --git a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
index 0f14aaf071..7bff2b0441 100644
--- a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
+++ b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def
@@ -35,3 +35,4 @@ BIT (Prefer_FSRM)
 BIT (Avoid_Short_Distance_REP_MOVSB)
 BIT (Avoid_Non_Temporal_Memset)
 BIT (Avoid_STOSB)
+BIT (Prefer_Page_Unrolled_Large_Copy)
diff --git a/sysdeps/x86/tst-hwcap-tunables.c b/sysdeps/x86/tst-hwcap-tunables.c
index 3e06048dcc..985153fb38 100644
--- a/sysdeps/x86/tst-hwcap-tunables.c
+++ b/sysdeps/x86/tst-hwcap-tunables.c
@@ -61,7 +61,7 @@ static const struct test_t
     "-Prefer_ERMS,-Prefer_FSRM,-AVX,-AVX2,-AVX512F,-AVX512VL,"
     "-SSE4_1,-SSE4_2,-SSSE3,-Fast_Unaligned_Load,-ERMS,"
     "-AVX_Fast_Unaligned_Load,-Avoid_Non_Temporal_Memset,"
-    "-Avoid_STOSB",
+    "-Avoid_STOSB,-Prefer_Page_Unrolled_Large_Copy",
     test_1,
     array_length (test_1)
   },
@@ -70,7 +70,7 @@ static const struct test_t
     ",-,-Prefer_ERMS,-Prefer_FSRM,-AVX,-AVX2,-AVX512F,-AVX512VL,"
     "-SSE4_1,-SSE4_2,-SSSE3,-Fast_Unaligned_Load,,-,"
     "-ERMS,-AVX_Fast_Unaligned_Load,-Avoid_Non_Temporal_Memset,"
-    "-Avoid_STOSB,-,",
+    "-Avoid_STOSB,-Prefer_Page_Unrolled_Large_Copy,-,",
     test_1,
     array_length (test_1)
   }
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index 696cb66991..381eaef455 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -16,11 +16,14 @@ sysdep_routines += \
   memcmpeq-evex \
   memcmpeq-sse2 \
   memmove-avx-unaligned-erms \
+  memmove-avx-unaligned-erms-page-unrolled \
+  memmove-avx-unaligned-erms-page-unrolled-rtm \
   memmove-avx-unaligned-erms-rtm \
   memmove-avx512-no-vzeroupper \
   memmove-avx512-unaligned-erms \
   memmove-erms \
   memmove-evex-unaligned-erms \
+  memmove-evex-unaligned-erms-page-unrolled \
   memmove-sse2-unaligned-erms \
   memmove-ssse3 \
   memrchr-avx2 \
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index c2dcadd1a9..f9add65d24 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -133,23 +133,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __memmove_chk_evex_unaligned)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __memmove_chk_evex_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __memmove_chk_evex_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __memmove_chk_evex_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
 				     CPU_FEATURE_USABLE (AVX),
 				     __memmove_chk_avx_unaligned)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
+				     CPU_FEATURE_USABLE (AVX),
+				     __memmove_chk_avx_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
 				     CPU_FEATURE_USABLE (AVX),
 				     __memmove_chk_avx_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
+				     CPU_FEATURE_USABLE (AVX),
+				     __memmove_chk_avx_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memmove_chk_avx_unaligned_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __memmove_chk_avx_unaligned_page_unrolled_rtm)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memmove_chk_avx_unaligned_erms_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __memmove_chk_avx_unaligned_erms_page_unrolled_rtm)
 	      /* By V3 we assume fast aligned copy.  */
 	      X86_IFUNC_IMPL_ADD_V2 (array, i, __memmove_chk,
 				     CPU_FEATURE_USABLE (SSSE3),
@@ -180,23 +200,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, memmove,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __memmove_evex_unaligned)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, memmove,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __memmove_evex_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, memmove,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __memmove_evex_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, memmove,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __memmove_evex_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
 				     CPU_FEATURE_USABLE (AVX),
 				     __memmove_avx_unaligned)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
+				     CPU_FEATURE_USABLE (AVX),
+				     __memmove_avx_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
 				     CPU_FEATURE_USABLE (AVX),
 				     __memmove_avx_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
+				     CPU_FEATURE_USABLE (AVX),
+				     __memmove_avx_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memmove_avx_unaligned_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __memmove_avx_unaligned_page_unrolled_rtm)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memmove_avx_unaligned_erms_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __memmove_avx_unaligned_erms_page_unrolled_rtm)
 	      /* By V3 we assume fast aligned copy.  */
 	      X86_IFUNC_IMPL_ADD_V2 (array, i, memmove,
 				     CPU_FEATURE_USABLE (SSSE3),
@@ -1140,23 +1180,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __memcpy_chk_evex_unaligned)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __memcpy_chk_evex_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __memcpy_chk_evex_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __memcpy_chk_evex_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
 				     CPU_FEATURE_USABLE (AVX),
 				     __memcpy_chk_avx_unaligned)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
+				     CPU_FEATURE_USABLE (AVX),
+				     __memcpy_chk_avx_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
 				     CPU_FEATURE_USABLE (AVX),
 				     __memcpy_chk_avx_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
+				     CPU_FEATURE_USABLE (AVX),
+				     __memcpy_chk_avx_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memcpy_chk_avx_unaligned_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __memcpy_chk_avx_unaligned_page_unrolled_rtm)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memcpy_chk_avx_unaligned_erms_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __memcpy_chk_avx_unaligned_erms_page_unrolled_rtm)
 	      /* By V3 we assume fast aligned copy.  */
 	      X86_IFUNC_IMPL_ADD_V2 (array, i, __memcpy_chk,
 				     CPU_FEATURE_USABLE (SSSE3),
@@ -1187,23 +1247,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __memcpy_evex_unaligned)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __memcpy_evex_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __memcpy_evex_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __memcpy_evex_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
 				     CPU_FEATURE_USABLE (AVX),
 				     __memcpy_avx_unaligned)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
+				     CPU_FEATURE_USABLE (AVX),
+				     __memcpy_avx_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
 				     CPU_FEATURE_USABLE (AVX),
 				     __memcpy_avx_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
+				     CPU_FEATURE_USABLE (AVX),
+				     __memcpy_avx_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memcpy_avx_unaligned_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __memcpy_avx_unaligned_page_unrolled_rtm)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memcpy_avx_unaligned_erms_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __memcpy_avx_unaligned_erms_page_unrolled_rtm)
 	      /* By V3 we assume fast aligned copy.  */
 	      X86_IFUNC_IMPL_ADD_V2 (array, i, memcpy,
 				     CPU_FEATURE_USABLE (SSSE3),
@@ -1234,23 +1314,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __mempcpy_chk_evex_unaligned)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __mempcpy_chk_evex_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __mempcpy_chk_evex_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __mempcpy_chk_evex_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
 				     CPU_FEATURE_USABLE (AVX),
 				     __mempcpy_chk_avx_unaligned)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
+				     CPU_FEATURE_USABLE (AVX),
+				     __mempcpy_chk_avx_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
 				     CPU_FEATURE_USABLE (AVX),
 				     __mempcpy_chk_avx_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
+				     CPU_FEATURE_USABLE (AVX),
+				     __mempcpy_chk_avx_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __mempcpy_chk_avx_unaligned_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __mempcpy_chk_avx_unaligned_page_unrolled_rtm)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __mempcpy_chk_avx_unaligned_erms_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __mempcpy_chk_avx_unaligned_erms_page_unrolled_rtm)
 	      /* By V3 we assume fast aligned copy.  */
 	      X86_IFUNC_IMPL_ADD_V2 (array, i, __mempcpy_chk,
 				     CPU_FEATURE_USABLE (SSSE3),
@@ -1281,23 +1381,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __mempcpy_evex_unaligned)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __mempcpy_evex_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy,
 				     CPU_FEATURE_USABLE (AVX512VL),
 				     __mempcpy_evex_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy,
+				     CPU_FEATURE_USABLE (AVX512VL),
+				     __mempcpy_evex_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
 				     CPU_FEATURE_USABLE (AVX),
 				     __mempcpy_avx_unaligned)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
+				     CPU_FEATURE_USABLE (AVX),
+				     __mempcpy_avx_unaligned_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
 				     CPU_FEATURE_USABLE (AVX),
 				     __mempcpy_avx_unaligned_erms)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
+				     CPU_FEATURE_USABLE (AVX),
+				     __mempcpy_avx_unaligned_erms_page_unrolled)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __mempcpy_avx_unaligned_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __mempcpy_avx_unaligned_page_unrolled_rtm)
 	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __mempcpy_avx_unaligned_erms_rtm)
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
+				     (CPU_FEATURE_USABLE (AVX)
+				      && CPU_FEATURE_USABLE (RTM)),
+				     __mempcpy_avx_unaligned_erms_page_unrolled_rtm)
 	      /* By V3 we assume fast aligned copy.  */
 	      X86_IFUNC_IMPL_ADD_V2 (array, i, mempcpy,
 				     CPU_FEATURE_USABLE (SSSE3),
diff --git a/sysdeps/x86_64/multiarch/ifunc-memmove.h b/sysdeps/x86_64/multiarch/ifunc-memmove.h
index de0ac73a2a..6d5df8a9eb 100644
--- a/sysdeps/x86_64/multiarch/ifunc-memmove.h
+++ b/sysdeps/x86_64/multiarch/ifunc-memmove.h
@@ -28,18 +28,27 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned_erms)
 extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_no_vzeroupper)
   attribute_hidden;
 
-extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned)
-  attribute_hidden;
-extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned_erms)
-  attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) attribute_hidden;
+extern __typeof (REDIRECT_NAME)
+    OPTIMIZE (evex_unaligned_page_unrolled) attribute_hidden;
+extern __typeof (REDIRECT_NAME)
+    OPTIMIZE (evex_unaligned_erms) attribute_hidden;
+extern __typeof (REDIRECT_NAME)
+    OPTIMIZE (evex_unaligned_erms_page_unrolled) attribute_hidden;
 
 extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned) attribute_hidden;
-extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms)
-  attribute_hidden;
-extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm)
-  attribute_hidden;
-extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms_rtm)
-  attribute_hidden;
+extern __typeof (REDIRECT_NAME)
+    OPTIMIZE (avx_unaligned_page_unrolled) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms) attribute_hidden;
+extern __typeof (REDIRECT_NAME)
+    OPTIMIZE (avx_unaligned_erms_page_unrolled) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm) attribute_hidden;
+extern __typeof (REDIRECT_NAME)
+    OPTIMIZE (avx_unaligned_page_unrolled_rtm) attribute_hidden;
+extern __typeof (REDIRECT_NAME)
+    OPTIMIZE (avx_unaligned_erms_rtm) attribute_hidden;
+extern __typeof (REDIRECT_NAME)
+    OPTIMIZE (avx_unaligned_erms_page_unrolled_rtm) attribute_hidden;
 
 extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;
 
@@ -71,40 +80,60 @@ IFUNC_SELECTOR (void)
       return OPTIMIZE (avx512_no_vzeroupper);
     }
 
-  if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
-				   AVX_Fast_Unaligned_Load, ))
+  if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load, ))
     {
       if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL))
 	{
 	  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
-	    return OPTIMIZE (evex_unaligned_erms);
-
+	    {
+	      if (CPU_FEATURES_ARCH_P (cpu_features,
+				       Prefer_Page_Unrolled_Large_Copy))
+		return OPTIMIZE (evex_unaligned_erms_page_unrolled);
+	      return OPTIMIZE (evex_unaligned_erms);
+	    }
+
+	  if (CPU_FEATURES_ARCH_P (cpu_features,
+				   Prefer_Page_Unrolled_Large_Copy))
+	    return OPTIMIZE (evex_unaligned_page_unrolled);
 	  return OPTIMIZE (evex_unaligned);
 	}
 
       if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
 	{
 	  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
-	    return OPTIMIZE (avx_unaligned_erms_rtm);
-
+	    {
+	      if (CPU_FEATURES_ARCH_P (cpu_features,
+				       Prefer_Page_Unrolled_Large_Copy))
+		return OPTIMIZE (avx_unaligned_erms_page_unrolled_rtm);
+	      return OPTIMIZE (avx_unaligned_erms_rtm);
+	    }
+	  if (CPU_FEATURES_ARCH_P (cpu_features,
+				   Prefer_Page_Unrolled_Large_Copy))
+	    return OPTIMIZE (avx_unaligned_page_unrolled_rtm);
 	  return OPTIMIZE (avx_unaligned_rtm);
 	}
 
-      if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
-				       Prefer_No_VZEROUPPER, !))
+      if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER, !))
 	{
 	  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
-	    return OPTIMIZE (avx_unaligned_erms);
-
+	    {
+	      if (CPU_FEATURES_ARCH_P (cpu_features,
+				       Prefer_Page_Unrolled_Large_Copy))
+		return OPTIMIZE (avx_unaligned_erms_page_unrolled);
+	      return OPTIMIZE (avx_unaligned_erms);
+	    }
+	  if (CPU_FEATURES_ARCH_P (cpu_features,
+				   Prefer_Page_Unrolled_Large_Copy))
+	    return OPTIMIZE (avx_unaligned_page_unrolled);
 	  return OPTIMIZE (avx_unaligned);
 	}
     }
 
   if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, SSSE3)
       /* Leave this as runtime check.  The SSSE3 is optimized almost
-         exclusively for avoiding unaligned memory access during the
-         copy and by and large is not better than the sse2
-         implementation as a general purpose memmove.  */
+	 exclusively for avoiding unaligned memory access during the
+	 copy and by and large is not better than the sse2
+	 implementation as a general purpose memmove.  */
       && !CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Copy))
     {
       return OPTIMIZE (ssse3);
diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S
new file mode 100644
index 0000000000..683d903243
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S
@@ -0,0 +1,5 @@
+#ifndef MEMMOVE_SYMBOL
+# define MEMMOVE_SYMBOL(p,s)	p##_avx_##s##_page_unrolled_rtm
+#endif
+#define MEMMOVE_VEC_LARGE_IMPL	"memmove-vec-large-page-unrolled.S"
+#include "memmove-avx-unaligned-erms-rtm.S"
diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S
new file mode 100644
index 0000000000..57b518e16f
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S
@@ -0,0 +1,5 @@
+#ifndef MEMMOVE_SYMBOL
+# define MEMMOVE_SYMBOL(p,s)	p##_avx_##s##_page_unrolled
+#endif
+#define MEMMOVE_VEC_LARGE_IMPL	"memmove-vec-large-page-unrolled.S"
+#include "memmove-avx-unaligned-erms.S"
diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S
index 20746e6713..36e864e935 100644
--- a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S
+++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S
@@ -2,7 +2,9 @@
 
 # include "x86-avx-rtm-vecs.h"
 
+#ifndef MEMMOVE_SYMBOL
 # define MEMMOVE_SYMBOL(p,s)	p##_avx_##s##_rtm
+#endif
 
 # include "memmove-vec-unaligned-erms.S"
 #endif
diff --git a/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S
new file mode 100644
index 0000000000..371b454819
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S
@@ -0,0 +1,5 @@
+#ifndef MEMMOVE_SYMBOL
+# define MEMMOVE_SYMBOL(p,s)	p##_evex_##s##_page_unrolled
+#endif
+#define MEMMOVE_VEC_LARGE_IMPL	"memmove-vec-large-page-unrolled.S"
+#include "memmove-evex-unaligned-erms.S"