From patchwork Wed Nov 26 01:54:58 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 125294 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 93ED13858C55 for ; Wed, 26 Nov 2025 01:57:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 93ED13858C55 Authentication-Results: sourceware.org; dkim=pass (2048-bit key, unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=ZJ5t+yun X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by sourceware.org (Postfix) with ESMTPS id 75A943858D26 for ; Wed, 26 Nov 2025 01:55:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 75A943858D26 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 75A943858D26 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::432 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122107; cv=none; b=xPZglPeOVie/wrmX0j4QkYUxmPOrxzDTU6Cr89D7HdCSNr1gNpI4dOe2S/7ACwA1TgTdh1Z6WqmHYq1OCqO+TBy+uuKZw/N4Vf1lZZZTf7ECrEf6ZjXQmnsONXss0BMsfM8JY6JZ/RI2t/fw00hIdd/X91ElwWd8EParokmc85A= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122107; c=relaxed/simple; bh=0DqgpxHaYNMgRJUKP0SSfLCD+IetrXuoFMWhrdQX78A=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=BrGMOwblMvM4MajscxAN4yJvfsHMOreCam2pIP6PiuliApxOFMxJ+lYU9vwBCQ7e0zy2TtD6TLFpc7BdZOtCgJESX9wQug68fQZElG0/SiORVmM5kP5ox/McEy1L0EnyzhojrrNAGJ8rr1x4GGiUdp8dFQrZR3NRNBWUUeXd7Yc= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 75A943858D26 Received: by mail-pf1-x432.google.com with SMTP id d2e1a72fcca58-7aab061e7cbso7121488b3a.1 for ; Tue, 25 Nov 2025 17:55:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764122105; x=1764726905; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=iW3Gk7KW12bG7XwYEd0YCRsrZYhb8l/nvGtkQ3fPqW8=; b=ZJ5t+yunSXm2uqw0YMptHSWV/BCKJQFWEBwGS0sfy9zFuFv608+bfaRqTcmLq3l18M 31ecrUQpJ+BSfudM9IEraVQkPxr87ZnYRhR8dE2lR8MVphAeVwOkIahjokSBORoXqdqR XJCV+fiP65ZYYbUQS122VHYNmM3sY3o3WPGwv3K46je3MnKYpUIMS6iWVcuNO2xm50tN YvBd0Dj6j3dG494laAEyDEulPZuPIxWEWEhcfnNB4SwMB4gl8RcvgUXzs61qqydq8Do0 Aanob8HCcxYi+EXLvZhYs8yjMkAC6RMvw0qb8+gKjOPxKflu+J/MPVnnlHU4v0MI9iV4 266g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764122105; x=1764726905; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=iW3Gk7KW12bG7XwYEd0YCRsrZYhb8l/nvGtkQ3fPqW8=; b=CULVTGEp9z8nJQ78kGlXtg5drLsanPpozuZU9qNUMPNjErKu1hlZauP4yY9LL1gMM8 JflpVcj6r2SR+gClm6lF5vhnDL4jtu0kjIh5V7wO6zeh+/hiJk44t/8PFo1gkkf5CWpS xIDWHGixsdXOqbTNOUHmbhhU4emzTqvUA5e9I26MO7LTK9eBrvZy3weK+aRKff/xacJO AbeW0/fibmtFyZIT9WuwybURQ8d39V3mXzabWu0bAOifoD4K++A4ZDIGAr0r5mOqWqc+ 4pOQEN3S2U6vnMSRCe/TNYJEezJHPHFf1olF5/Oxn7MV2T5W7aazY4P4OHE/oeEoZ7KF cKSw== X-Gm-Message-State: AOJu0YyfuSW0Ic8FPDbqUL+OS5dkssI0zuQJycmUM2eo0LC7ZSw5xZ4r +73LyTV7PNR7rrfIRPp6/1mFNyWK+1BWiRX/BzrWizSNnq9LQLZz6sS78W4FpGrVUfA= X-Gm-Gg: ASbGncusEaqVwn5rVD9mm/cVk9Y0d5T8kWMEbcwaHmiNxW/T6u3ucxiK3lC0yhDy+/6 4Lij1zXvSS8R7nKfR1KKhFfPY5P9/pjWkKBoLlaX7rkEEzAV01/UdaDveMiGR3e4YXdYAo/1lQF FHxcjcFq31BgHFgDYiTXdSsJu/CHdNhX0sYV/U4yYQD+9rNVdnxk0sxPdv6stuc9/s/L3eZtHX9 TAP5bFs7LgDcWcZsARhbV8f/eS9lUX2AHGf2uSmcIHzGcGxXawtg9BzfrNvQYTLKoFFnOT1l746 Jsf/JRRP5ijgo9XadnQNPm2qZ/5f5nGKvc/1Y4zKRSlKVLjRfMNahHV+HOFY5+YUUXjzTB0dSvs JNpFm/jTFKCWy4Vcx0LwYoWuYqNr8ycZalGKOJYDhM7qsG3cix6tjMP1HfRYs+ZlF37hqcTxJwp 2f5kO150QbM5WIYIQ2g1x2Po+S1UaFDuEeO86aFJYkvLn8aWyci7Ff/fkyPWQgRQ== X-Google-Smtp-Source: AGHT+IHGA1ruzRP3HzY4SlbOaR/FZCQqM0vEnzsexvbADuCjK5ChcJ6isoy9RpJ7wybSH1SfevyrTA== X-Received: by 2002:a05:6a20:729c:b0:334:8d1f:fa8d with SMTP id adf61e73a8af0-3614eb84247mr17912591637.18.1764122105034; Tue, 25 Nov 2025 17:55:05 -0800 (PST) Received: from localhost.localdomain ([203.149.208.29]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd760dbc99asm17508540a12.30.2025.11.25.17.55.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Nov 2025 17:55:04 -0800 (PST) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org, DJ Delorie Subject: [PATCH v4 1/3] x86/string: Factor out large memmove implemention to seperate file Date: Tue, 25 Nov 2025 20:54:58 -0500 Message-ID: <20251126015500.82591-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20251115093318.830179-1-goldstein.w.n@gmail.com> References: <20251115093318.830179-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org This is to enable us to support multiple large (size greater than non-temporal threshold) implementations. This patch has no affect on the resulting libc.so library. Reviewed-by: DJ Delorie --- .../memmove-vec-large-page-unrolled.S | 290 ++++++++++++++++++ .../multiarch/memmove-vec-unaligned-erms.S | 272 +--------------- 2 files changed, 297 insertions(+), 265 deletions(-) create mode 100644 sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S diff --git a/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S new file mode 100644 index 0000000000..21ae89e800 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S @@ -0,0 +1,290 @@ +/* Non-Temporal page unrolled large memmove implementation. + Copyright (C) 2025 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifdef MEMMOVE_LARGE_IMPL +# error "Multiple large memmove impls included!" +#endif +#define MEMMOVE_LARGE_IMPL 1 + +/* Copies large regions by copying multiple pages at once. This is + beneficial on some older Intel hardware (Broadwell, Skylake, and + Icelake). + 1. If size < 16 * __x86_shared_non_temporal_threshold and + source and destination do not page alias, copy from 2 pages + at once using non-temporal stores. Page aliasing in this case is + considered true if destination's page alignment - sources' page + alignment is less than 8 * VEC_SIZE. + 2. If size >= 16 * __x86_shared_non_temporal_threshold or source + and destination do page alias copy from 4 pages at once using + non-temporal stores. */ + +#ifndef LOG_PAGE_SIZE +# define LOG_PAGE_SIZE 12 +#endif + +#if PAGE_SIZE != (1 << LOG_PAGE_SIZE) +# error Invalid LOG_PAGE_SIZE +#endif + +/* Byte per page for large_memcpy inner loop. */ +#if VEC_SIZE == 64 +# define LARGE_LOAD_SIZE (VEC_SIZE * 2) +#else +# define LARGE_LOAD_SIZE (VEC_SIZE * 4) +#endif + +/* Amount to shift __x86_shared_non_temporal_threshold by for + bound for memcpy_large_4x. This is essentially use to to + indicate that the copy is far beyond the scope of L3 + (assuming no user config x86_non_temporal_threshold) and to + use a more aggressively unrolled loop. NB: before + increasing the value also update initialization of + x86_non_temporal_threshold. */ +#ifndef LOG_4X_MEMCPY_THRESH +# define LOG_4X_MEMCPY_THRESH 4 +#endif + +#if LARGE_LOAD_SIZE == (VEC_SIZE * 2) +# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; +# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; +#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4) +# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; \ + VMOVU ((offset) + VEC_SIZE * 2)base, vec2; \ + VMOVU ((offset) + VEC_SIZE * 3)base, vec3; +# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; \ + VMOVNT vec2, ((offset) + VEC_SIZE * 2)base; \ + VMOVNT vec3, ((offset) + VEC_SIZE * 3)base; +#else +# error Invalid LARGE_LOAD_SIZE +#endif + + .p2align 4,, 10 +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) +L(large_memcpy_check): + /* Entry from L(large_memcpy_2x) has a redundant load of + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x) + is only use for the non-erms memmove which is generally less + common. */ +L(large_memcpy): + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP + cmp %R11_LP, %RDX_LP + jb L(more_8x_vec_check) + /* To reach this point it is impossible for dst > src and + overlap. Remaining to check is src > dst and overlap. rcx + already contains dst - src. Negate rcx to get src - dst. If + length > rcx then there is overlap and forward copy is best. */ + negq %rcx + cmpq %rcx, %rdx + ja L(more_8x_vec_forward) + + /* Cache align destination. First store the first 64 bytes then + adjust alignments. */ + + /* First vec was also loaded into VEC(0). */ +# if VEC_SIZE < 64 + VMOVU VEC_SIZE(%rsi), %VMM(1) +# if VEC_SIZE < 32 + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) +# endif +# endif + VMOVU %VMM(0), (%rdi) +# if VEC_SIZE < 64 + VMOVU %VMM(1), VEC_SIZE(%rdi) +# if VEC_SIZE < 32 + VMOVU %VMM(2), (VEC_SIZE * 2)(%rdi) + VMOVU %VMM(3), (VEC_SIZE * 3)(%rdi) +# endif +# endif + + /* Adjust source, destination, and size. */ + movq %rdi, %r8 + andq $63, %r8 + /* Get the negative of offset for alignment. */ + subq $64, %r8 + /* Adjust source. */ + subq %r8, %rsi + /* Adjust destination which should be aligned now. */ + subq %r8, %rdi + /* Adjust length. */ + addq %r8, %rdx + + /* Test if source and destination addresses will alias. If they + do the larger pipeline in large_memcpy_4x alleviated the + performance drop. */ + + /* ecx contains -(dst - src). not ecx will return dst - src - 1 + which works for testing aliasing. */ + notl %ecx + movq %rdx, %r10 + testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx + jz L(large_memcpy_4x) + + /* r11 has __x86_shared_non_temporal_threshold. Shift it left + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold. */ + shlq $LOG_4X_MEMCPY_THRESH, %r11 + cmp %r11, %rdx + jae L(large_memcpy_4x) + + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 2 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $(LOG_PAGE_SIZE + 1), %r10 + /* Copy 4x VEC at a time from 2 pages. */ + .p2align 4 +L(loop_large_memcpy_2x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_2x_inner): + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2) + /* Load vectors from rsi. */ + LOAD_ONE_SET ((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) + LOAD_ONE_SET ((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) + subq $-LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET ((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) + STORE_ONE_SET ((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) + subq $-LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_2x_inner) + addq $PAGE_SIZE, %rdi + addq $PAGE_SIZE, %rsi + decq %r10 + jne L(loop_large_memcpy_2x_outer) + sfence + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_2x_end) + + /* Handle the last 2 * PAGE_SIZE bytes. */ +L(loop_large_memcpy_2x_tail): + /* Copy 4 * VEC a time forward with non-temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) + VMOVU (%rsi), %VMM(0) + VMOVU VEC_SIZE(%rsi), %VMM(1) + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) + subq $-(VEC_SIZE * 4), %rsi + addl $-(VEC_SIZE * 4), %edx + VMOVA %VMM(0), (%rdi) + VMOVA %VMM(1), VEC_SIZE(%rdi) + VMOVA %VMM(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VMM(3), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_2x_tail) + +L(large_memcpy_2x_end): + /* Store the last 4 * VEC. */ + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(3) + + VMOVU %VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VMM(3), -VEC_SIZE(%rdi, %rdx) + VZEROUPPER_RETURN + + .p2align 4 +L(large_memcpy_4x): + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 4 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $(LOG_PAGE_SIZE + 2), %r10 + /* Copy 4x VEC at a time from 4 pages. */ + .p2align 4 +L(loop_large_memcpy_4x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_4x_inner): + /* Only one prefetch set per page as doing 4 pages give more + time for prefetcher to keep up. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE) + /* Load vectors from rsi. */ + LOAD_ONE_SET ((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) + LOAD_ONE_SET ((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) + LOAD_ONE_SET ((%rsi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11)) + LOAD_ONE_SET ((%rsi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15)) + subq $-LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET ((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) + STORE_ONE_SET ((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) + STORE_ONE_SET ((%rdi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11)) + STORE_ONE_SET ((%rdi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15)) + subq $-LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_4x_inner) + addq $(PAGE_SIZE * 3), %rdi + addq $(PAGE_SIZE * 3), %rsi + decq %r10 + jne L(loop_large_memcpy_4x_outer) + sfence + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_4x_end) + + /* Handle the last 4 * PAGE_SIZE bytes. */ +L(loop_large_memcpy_4x_tail): + /* Copy 4 * VEC a time forward with non-temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) + VMOVU (%rsi), %VMM(0) + VMOVU VEC_SIZE(%rsi), %VMM(1) + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) + subq $-(VEC_SIZE * 4), %rsi + addl $-(VEC_SIZE * 4), %edx + VMOVA %VMM(0), (%rdi) + VMOVA %VMM(1), VEC_SIZE(%rdi) + VMOVA %VMM(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VMM(3), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_4x_tail) + +L(large_memcpy_4x_end): + /* Store the last 4 * VEC. */ + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(3) + + VMOVU %VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VMM(3), -VEC_SIZE(%rdi, %rdx) + VZEROUPPER_RETURN +#endif diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 5cd8a6286e..70d303687c 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -34,17 +34,8 @@ __x86_rep_movsb_threshold and less than __x86_rep_movsb_stop_threshold, then REP MOVSB will be used. 7. If size >= __x86_shared_non_temporal_threshold and there is no - overlap between destination and source, use non-temporal store - instead of aligned store copying from either 2 or 4 pages at - once. - 8. For point 7) if size < 16 * __x86_shared_non_temporal_threshold - and source and destination do not page alias, copy from 2 pages - at once using non-temporal stores. Page aliasing in this case is - considered true if destination's page alignment - sources' page - alignment is less than 8 * VEC_SIZE. - 9. If size >= 16 * __x86_shared_non_temporal_threshold or source - and destination do page alias copy from 4 pages at once using - non-temporal stores. */ + overlap between destination and source, the exact method varies + and is set with MEMMOVE_VEC_LARGE_IMPL". */ #include @@ -95,31 +86,6 @@ # error Unsupported PAGE_SIZE #endif -#ifndef LOG_PAGE_SIZE -# define LOG_PAGE_SIZE 12 -#endif - -#if PAGE_SIZE != (1 << LOG_PAGE_SIZE) -# error Invalid LOG_PAGE_SIZE -#endif - -/* Byte per page for large_memcpy inner loop. */ -#if VEC_SIZE == 64 -# define LARGE_LOAD_SIZE (VEC_SIZE * 2) -#else -# define LARGE_LOAD_SIZE (VEC_SIZE * 4) -#endif - -/* Amount to shift __x86_shared_non_temporal_threshold by for - bound for memcpy_large_4x. This is essentially use to to - indicate that the copy is far beyond the scope of L3 - (assuming no user config x86_non_temporal_threshold) and to - use a more aggressively unrolled loop. NB: before - increasing the value also update initialization of - x86_non_temporal_threshold. */ -#ifndef LOG_4X_MEMCPY_THRESH -# define LOG_4X_MEMCPY_THRESH 4 -#endif /* Avoid short distance rep movsb only with non-SSE vector. */ #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB @@ -160,26 +126,8 @@ # error Unsupported PREFETCH_SIZE! #endif -#if LARGE_LOAD_SIZE == (VEC_SIZE * 2) -# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \ - VMOVU (offset)base, vec0; \ - VMOVU ((offset) + VEC_SIZE)base, vec1; -# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \ - VMOVNT vec0, (offset)base; \ - VMOVNT vec1, ((offset) + VEC_SIZE)base; -#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4) -# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ - VMOVU (offset)base, vec0; \ - VMOVU ((offset) + VEC_SIZE)base, vec1; \ - VMOVU ((offset) + VEC_SIZE * 2)base, vec2; \ - VMOVU ((offset) + VEC_SIZE * 3)base, vec3; -# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ - VMOVNT vec0, (offset)base; \ - VMOVNT vec1, ((offset) + VEC_SIZE)base; \ - VMOVNT vec2, ((offset) + VEC_SIZE * 2)base; \ - VMOVNT vec3, ((offset) + VEC_SIZE * 3)base; -#else -# error Invalid LARGE_LOAD_SIZE +#ifndef MEMMOVE_VEC_LARGE_IMPL +# define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" #endif #ifndef SECTION @@ -426,7 +374,7 @@ L(more_8x_vec): #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) /* Check non-temporal store threshold. */ cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_memcpy_2x) + ja L(large_memcpy) #endif /* To reach this point there cannot be overlap and dst > src. So check for overlap and src > dst in which case correctness @@ -613,7 +561,7 @@ L(movsb): /* If above __x86_rep_movsb_stop_threshold most likely is candidate for NT moves as well. */ cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP - jae L(large_memcpy_2x_check) + jae L(large_memcpy_check) # if AVOID_SHORT_DISTANCE_REP_MOVSB || ALIGN_MOVSB /* Only avoid short movsb if CPU has FSRM. */ # if X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB < 256 @@ -673,214 +621,8 @@ L(skip_short_movsb_check): # endif #endif - .p2align 4,, 10 -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) -L(large_memcpy_2x_check): - /* Entry from L(large_memcpy_2x) has a redundant load of - __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x) - is only use for the non-erms memmove which is generally less - common. */ -L(large_memcpy_2x): - mov __x86_shared_non_temporal_threshold(%rip), %R11_LP - cmp %R11_LP, %RDX_LP - jb L(more_8x_vec_check) - /* To reach this point it is impossible for dst > src and - overlap. Remaining to check is src > dst and overlap. rcx - already contains dst - src. Negate rcx to get src - dst. If - length > rcx then there is overlap and forward copy is best. */ - negq %rcx - cmpq %rcx, %rdx - ja L(more_8x_vec_forward) - - /* Cache align destination. First store the first 64 bytes then - adjust alignments. */ - - /* First vec was also loaded into VEC(0). */ -# if VEC_SIZE < 64 - VMOVU VEC_SIZE(%rsi), %VMM(1) -# if VEC_SIZE < 32 - VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) - VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) -# endif -# endif - VMOVU %VMM(0), (%rdi) -# if VEC_SIZE < 64 - VMOVU %VMM(1), VEC_SIZE(%rdi) -# if VEC_SIZE < 32 - VMOVU %VMM(2), (VEC_SIZE * 2)(%rdi) - VMOVU %VMM(3), (VEC_SIZE * 3)(%rdi) -# endif -# endif +#include MEMMOVE_VEC_LARGE_IMPL - /* Adjust source, destination, and size. */ - movq %rdi, %r8 - andq $63, %r8 - /* Get the negative of offset for alignment. */ - subq $64, %r8 - /* Adjust source. */ - subq %r8, %rsi - /* Adjust destination which should be aligned now. */ - subq %r8, %rdi - /* Adjust length. */ - addq %r8, %rdx - - /* Test if source and destination addresses will alias. If they - do the larger pipeline in large_memcpy_4x alleviated the - performance drop. */ - - /* ecx contains -(dst - src). not ecx will return dst - src - 1 - which works for testing aliasing. */ - notl %ecx - movq %rdx, %r10 - testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx - jz L(large_memcpy_4x) - - /* r11 has __x86_shared_non_temporal_threshold. Shift it left - by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold. - */ - shlq $LOG_4X_MEMCPY_THRESH, %r11 - cmp %r11, %rdx - jae L(large_memcpy_4x) - - /* edx will store remainder size for copying tail. */ - andl $(PAGE_SIZE * 2 - 1), %edx - /* r10 stores outer loop counter. */ - shrq $(LOG_PAGE_SIZE + 1), %r10 - /* Copy 4x VEC at a time from 2 pages. */ - .p2align 4 -L(loop_large_memcpy_2x_outer): - /* ecx stores inner loop counter. */ - movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx -L(loop_large_memcpy_2x_inner): - PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2) - /* Load vectors from rsi. */ - LOAD_ONE_SET((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) - LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) - subq $-LARGE_LOAD_SIZE, %rsi - /* Non-temporal store vectors to rdi. */ - STORE_ONE_SET((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) - STORE_ONE_SET((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) - subq $-LARGE_LOAD_SIZE, %rdi - decl %ecx - jnz L(loop_large_memcpy_2x_inner) - addq $PAGE_SIZE, %rdi - addq $PAGE_SIZE, %rsi - decq %r10 - jne L(loop_large_memcpy_2x_outer) - sfence - - /* Check if only last 4 loads are needed. */ - cmpl $(VEC_SIZE * 4), %edx - jbe L(large_memcpy_2x_end) - - /* Handle the last 2 * PAGE_SIZE bytes. */ -L(loop_large_memcpy_2x_tail): - /* Copy 4 * VEC a time forward with non-temporal stores. */ - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) - VMOVU (%rsi), %VMM(0) - VMOVU VEC_SIZE(%rsi), %VMM(1) - VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) - VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) - subq $-(VEC_SIZE * 4), %rsi - addl $-(VEC_SIZE * 4), %edx - VMOVA %VMM(0), (%rdi) - VMOVA %VMM(1), VEC_SIZE(%rdi) - VMOVA %VMM(2), (VEC_SIZE * 2)(%rdi) - VMOVA %VMM(3), (VEC_SIZE * 3)(%rdi) - subq $-(VEC_SIZE * 4), %rdi - cmpl $(VEC_SIZE * 4), %edx - ja L(loop_large_memcpy_2x_tail) - -L(large_memcpy_2x_end): - /* Store the last 4 * VEC. */ - VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0) - VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1) - VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2) - VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(3) - - VMOVU %VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx) - VMOVU %VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx) - VMOVU %VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx) - VMOVU %VMM(3), -VEC_SIZE(%rdi, %rdx) - VZEROUPPER_RETURN - - .p2align 4 -L(large_memcpy_4x): - /* edx will store remainder size for copying tail. */ - andl $(PAGE_SIZE * 4 - 1), %edx - /* r10 stores outer loop counter. */ - shrq $(LOG_PAGE_SIZE + 2), %r10 - /* Copy 4x VEC at a time from 4 pages. */ - .p2align 4 -L(loop_large_memcpy_4x_outer): - /* ecx stores inner loop counter. */ - movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx -L(loop_large_memcpy_4x_inner): - /* Only one prefetch set per page as doing 4 pages give more - time for prefetcher to keep up. */ - PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE) - /* Load vectors from rsi. */ - LOAD_ONE_SET((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) - LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) - LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11)) - LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15)) - subq $-LARGE_LOAD_SIZE, %rsi - /* Non-temporal store vectors to rdi. */ - STORE_ONE_SET((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) - STORE_ONE_SET((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) - STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11)) - STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15)) - subq $-LARGE_LOAD_SIZE, %rdi - decl %ecx - jnz L(loop_large_memcpy_4x_inner) - addq $(PAGE_SIZE * 3), %rdi - addq $(PAGE_SIZE * 3), %rsi - decq %r10 - jne L(loop_large_memcpy_4x_outer) - sfence - /* Check if only last 4 loads are needed. */ - cmpl $(VEC_SIZE * 4), %edx - jbe L(large_memcpy_4x_end) - - /* Handle the last 4 * PAGE_SIZE bytes. */ -L(loop_large_memcpy_4x_tail): - /* Copy 4 * VEC a time forward with non-temporal stores. */ - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) - VMOVU (%rsi), %VMM(0) - VMOVU VEC_SIZE(%rsi), %VMM(1) - VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) - VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) - subq $-(VEC_SIZE * 4), %rsi - addl $-(VEC_SIZE * 4), %edx - VMOVA %VMM(0), (%rdi) - VMOVA %VMM(1), VEC_SIZE(%rdi) - VMOVA %VMM(2), (VEC_SIZE * 2)(%rdi) - VMOVA %VMM(3), (VEC_SIZE * 3)(%rdi) - subq $-(VEC_SIZE * 4), %rdi - cmpl $(VEC_SIZE * 4), %edx - ja L(loop_large_memcpy_4x_tail) - -L(large_memcpy_4x_end): - /* Store the last 4 * VEC. */ - VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0) - VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1) - VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2) - VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(3) - - VMOVU %VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx) - VMOVU %VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx) - VMOVU %VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx) - VMOVU %VMM(3), -VEC_SIZE(%rdi, %rdx) - VZEROUPPER_RETURN -#endif END (MEMMOVE_SYMBOL (__memmove, unaligned_erms)) #if IS_IN (libc) From patchwork Wed Nov 26 01:54:59 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 125293 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 6441E3858C39 for ; Wed, 26 Nov 2025 01:57:18 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6441E3858C39 Authentication-Results: sourceware.org; dkim=pass (2048-bit key, unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=OktalpZk X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by sourceware.org (Postfix) with ESMTPS id 209C33858D29 for ; Wed, 26 Nov 2025 01:55:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 209C33858D29 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 209C33858D29 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::102e ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122110; cv=none; b=lV1nrwsbeW8U+rXGgZ2ZUkxxet/DaF5Az4M1nLlFe3DaVOiisvNTWFIc3QMjHO6ppWDtq3WBr5Ra97ZZPkv5/UJWbmdHAU2kXanR+Ma+8EMvNnjTd/G8Dr52gJ2glxg4WIl5VSKJLzuK9niv1rdo+wREm7Aa/R3dQiBOyxTgVMw= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122110; c=relaxed/simple; bh=hQJDDmAN+G9yABT/+lXnSOyKQFq/Dwoht25VW7XydW0=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=rozVG1NTyfffejawEpgsYJisZ5SuSoW+UH4qtcf3+shm99ZWppBxMTpgnAfP/rIj5/vdKI+lL6jCSceCD19MTDE2Fu+CTx7kzsxydDRbFEQYb5RjAT3OKrw8DZ7/fZC/LgyfnfS38zMpWwoxAC90hubGht3lmEPe0grYaRsRQa4= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 209C33858D29 Received: by mail-pj1-x102e.google.com with SMTP id 98e67ed59e1d1-343f52d15efso5793528a91.3 for ; Tue, 25 Nov 2025 17:55:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764122109; x=1764726909; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=elA4nuFIItH3qJUFITTMkisr3OQPTAmvZDM7L3jMZ1w=; b=OktalpZk355xPjy9BsKBEsGdxaRyC1R6aFrIeV9SoGhw9sbhocB44WCPUwLTJJF9rA TkvMOevdiWPZEGCSSgx56AFTkPv45q9D/e+iqtSxOjYRhoV5bz5BTtv2gkI+KgMoEoYf XkNsGr2CEUTFcIjIAHWFi/twm6P938TDfCM2WEi7VLLxT23rcpNKTFT4YMAU75ASD+6W nhHKAeAkNfzxs9iFzWQf0P5VTPm0HP1iNnAmlWVZLtkLxa5Igx1tuFbUfj2BVtVm+b8V nHpl278D738DMMDje46Gwvg9S3a1zvuKhtrW+nVzIc9ayM9PCneHefKAWyugeOo5Muxr RyQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764122109; x=1764726909; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=elA4nuFIItH3qJUFITTMkisr3OQPTAmvZDM7L3jMZ1w=; b=p5rLEWGdfH5DuNdnr29+gfQVmmGw9Q88bxHbi1mJ5cT1nW4hfQzMxnOq4JKSpat6zD KGABnIb3I9wtqCLwcAgAZVgZDZRd2H8mYh07dM3N+fawXNCxwFROgbZqUJXXiLNPfrSE 3chnRC06HBdp1RGiihSHHr/105989XnSlWHFRniFbFAQajR0QmETp3aHUFKlfrDIBRDI 2ANXbyW4qh6eRMk1DXtn0dpTvJRR7Z3/zQgV6H0+7umgdw582q01dmwhGllbQw6lgIdB nTHFfq7vEtDIpdIJgzJGogrmh9rUCKUZR6w7skKwysUI+cUzLRDZopVjLKow9F/KKYfF a3PQ== X-Gm-Message-State: AOJu0Yz2axu3d7iJcgwdtWGI7U6RquxWGhghah9cIMQl2NMUGaCIntYS FZmI2gja+mWZ2xSrgRO6NcaJXtQK+4fOYgPrxVzZxFdunY1lJLh3PJtTiorT7JgdaPg= X-Gm-Gg: ASbGncvlF9esxrlSpjQLoSViH0qNqUA+jR6AXpNvT/iIIAQiblCYqOFUx425Rdf5xKf M/ILfHLn/UrMA7g33Sr66ycFxCrbUlae+jYUdR94K94gAsaFobvvhXNSm5QQXzqaofMh1xFHnN3 SRzg75wJcftnT79UjpD+Q5iJjnEFCtDT6I/cTSQnHqyC1SZfOY9jXWOpkGmemJCNPuPjbhwWjgW uPTMvsPucR9PW1lnwaPnps87Xcly9MkM6dKArEkpK+Cnj7rS3orO5R5ZWGGSbSb3Nmb9fqNI1sO 1uUnY5W9/Wou035zSGTfgDrSORMvybqbwSqok//kTmIsxAk8onCgp1JXK9/rlsNPCLBsk0Nd9xJ 3GTp6jfB2AxREFYE08z0nrzAbsBSlF5HDpyVyKE4Rt3BvkGhd3f/ixXgZMkT5JYA9sB0qhGbLh3 jYVO81evkeqS9uD7Mts59UPa5zLWlrqnFBQN4XopBrzK+0kCIPSqwapWPfEvlfcg== X-Google-Smtp-Source: AGHT+IGWt7Xol0N/Y95Q2uTM/Zqs7eMeSe1ZvtRKvuMV9iqbDJcKArv9ev4z0KyY8B893p2qrIlrFQ== X-Received: by 2002:a17:90b:4a91:b0:33d:a0fd:257b with SMTP id 98e67ed59e1d1-3475ed7d931mr4490806a91.36.1764122108694; Tue, 25 Nov 2025 17:55:08 -0800 (PST) Received: from localhost.localdomain ([203.149.208.29]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd760dbc99asm17508540a12.30.2025.11.25.17.55.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Nov 2025 17:55:08 -0800 (PST) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org, DJ Delorie Subject: [PATCH v4 2/3] x86/string: Use simpler approach for large memcpy [BZ #32475] Date: Tue, 25 Nov 2025 20:54:59 -0500 Message-ID: <20251126015500.82591-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20251126015500.82591-1-goldstein.w.n@gmail.com> References: <20251115093318.830179-1-goldstein.w.n@gmail.com> <20251126015500.82591-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org The new approach does a simple 4x non-temporal loop (forwards or backwards to avoid 4k aliasing). This is similiar what we used to do prior to: commit 1a8605b6cd257e8a74e29b5b71c057211f5fb847 Author: noah Date: Sat Apr 3 04:12:15 2021 -0400 x86: Update large memcpy case in memmove-vec-unaligned-erms.S But with 4k aliasing detection to avoid a known pathological slow case. The multi-page approach yielded 5-15% better performance for the size ranges covered by bench-memcpy-large (roughly 64KB-32MB) on the tested platforms but has some notable draw backs. The drawbacks stem from the fact that the multi-page approach is significantly less "canonical" a form of memcpy and thus is likely to have less reliably "good" performance on untested platforms (including future ones) and configurations (i.e > 2GB copies from BZ #32475). Since there are known slow cases with the multi-page approach (that far exceed 15%) and the multi-page approach is much more brittle, it seems prudent to switch to this simpler, more reliable, better future-proofed implementation. Tested on x86_64. Reviewed-by: DJ Delorie --- sysdeps/x86_64/multiarch/memmove-vec-large.S | 125 ++++++++++++++++++ .../multiarch/memmove-vec-unaligned-erms.S | 2 +- 2 files changed, 126 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/multiarch/memmove-vec-large.S diff --git a/sysdeps/x86_64/multiarch/memmove-vec-large.S b/sysdeps/x86_64/multiarch/memmove-vec-large.S new file mode 100644 index 0000000000..4c398d4602 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-vec-large.S @@ -0,0 +1,125 @@ +/* Non-Temporal large memmove implementation. + Copyright (C) 2025 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifdef MEMMOVE_LARGE_IMPL +# error "Multiple large memmove impls included!" +#endif +#define MEMMOVE_LARGE_IMPL 1 + +/* Copies large regions by with a 4x unrolled loop of non-temporal + stores. */ + +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) +L(large_memcpy_check): + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + jb L(more_8x_vec_check) +L(large_memcpy): + /* To reach this point it is impossible for dst > src and + overlap. Remaining to check is src > dst and overlap. rcx + already contains dst - src. Negate rcx to get src - dst. If + length > rcx then there is overlap and forward copy is best. */ + negq %rcx + cmpq %rcx, %rdx + ja L(more_8x_vec_forward) + + /* We are doing non-temporal copy and no overlap. Choose forward + or backward copy based on avoiding 4k aliasing. ecx already + contains src - dst. We check if: + (src % 4096) - (dst % 4096) > (4096 - 512) + If true then we risk aliasing. */ + andl $(PAGE_SIZE - 1), %ecx + cmpl $(PAGE_SIZE - 512), %ecx + ja L(large_backward) + + subq %rdi, %rsi + + /* Store the first VEC. */ + VMOVU %VMM(0), (%rdi) + + /* Store end of buffer minus tail in rdx. */ + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx + + /* Align DST. */ + orq $(VEC_SIZE - 1), %rdi + incq %rdi + leaq (%rdi, %rsi), %rcx + /* Dont use multi-byte nop to align. */ + .p2align 4,, 11 +L(loop_4x_nt_forward): + PREFETCH_ONE_SET (1, (%rcx), VEC_SIZE * 8) + /* Copy 4 * VEC a time forward. */ + VMOVU (VEC_SIZE * 0)(%rcx), %VMM(1) + VMOVU (VEC_SIZE * 1)(%rcx), %VMM(2) + VMOVU (VEC_SIZE * 2)(%rcx), %VMM(3) + VMOVU (VEC_SIZE * 3)(%rcx), %VMM(4) + subq $-(VEC_SIZE * 4), %rcx + VMOVNT %VMM(1), (VEC_SIZE * 0)(%rdi) + VMOVNT %VMM(2), (VEC_SIZE * 1)(%rdi) + VMOVNT %VMM(3), (VEC_SIZE * 2)(%rdi) + VMOVNT %VMM(4), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpq %rdi, %rdx + ja L(loop_4x_nt_forward) + sfence + + VMOVU (VEC_SIZE * 0)(%rsi, %rdx), %VMM(1) + VMOVU (VEC_SIZE * 1)(%rsi, %rdx), %VMM(2) + VMOVU (VEC_SIZE * 2)(%rsi, %rdx), %VMM(3) + VMOVU (VEC_SIZE * 3)(%rsi, %rdx), %VMM(4) + VMOVU %VMM(1), (VEC_SIZE * 0)(%rdx) + VMOVU %VMM(2), (VEC_SIZE * 1)(%rdx) + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdx) + VMOVU %VMM(4), (VEC_SIZE * 3)(%rdx) + VZEROUPPER_RETURN + + .p2align 4,, 10 +L(large_backward): + leaq (VEC_SIZE * -4 - 1)(%rdi, %rdx), %rcx + VMOVU (VEC_SIZE * -1)(%rsi, %rdx), %VMM(5) + VMOVU %VMM(5), (VEC_SIZE * -1)(%rdi, %rdx) + andq $-(VEC_SIZE), %rcx + subq %rdi, %rsi + leaq (%rsi, %rcx), %rdx + /* Don't use multi-byte nop to align. */ + .p2align 4,, 11 +L(loop_4x_nt_backward): + PREFETCH_ONE_SET (-1, (%rdx), -VEC_SIZE * 8) + VMOVU (VEC_SIZE * 3)(%rdx), %VMM(1) + VMOVU (VEC_SIZE * 2)(%rdx), %VMM(2) + VMOVU (VEC_SIZE * 1)(%rdx), %VMM(3) + VMOVU (VEC_SIZE * 0)(%rdx), %VMM(4) + addq $(VEC_SIZE * -4), %rdx + VMOVNT %VMM(1), (VEC_SIZE * 3)(%rcx) + VMOVNT %VMM(2), (VEC_SIZE * 2)(%rcx) + VMOVNT %VMM(3), (VEC_SIZE * 1)(%rcx) + VMOVNT %VMM(4), (VEC_SIZE * 0)(%rcx) + addq $(VEC_SIZE * -4), %rcx + cmpq %rcx, %rdi + jb L(loop_4x_nt_backward) + + sfence + VMOVU (VEC_SIZE * 3)(%rsi, %rdi), %VMM(4) + VMOVU (VEC_SIZE * 2)(%rsi, %rdi), %VMM(3) + VMOVU (VEC_SIZE * 1)(%rsi, %rdi), %VMM(2) + /* We already loaded VMM(0). */ + VMOVU %VMM(4), (VEC_SIZE * 3)(%rdi) + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdi) + VMOVU %VMM(2), (VEC_SIZE * 1)(%rdi) + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) + VZEROUPPER_RETURN +#endif diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 70d303687c..7c4765286d 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -127,7 +127,7 @@ #endif #ifndef MEMMOVE_VEC_LARGE_IMPL -# define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +# define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large.S" #endif #ifndef SECTION From patchwork Wed Nov 26 01:55:00 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 125295 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0B7413858CB6 for ; Wed, 26 Nov 2025 01:58:19 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0B7413858CB6 Authentication-Results: sourceware.org; dkim=pass (2048-bit key, unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=BiLlrPK6 X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by sourceware.org (Postfix) with ESMTPS id 123333858D32 for ; Wed, 26 Nov 2025 01:55:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 123333858D32 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 123333858D32 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::435 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122114; cv=none; b=cAmjIMrMyWTzRvGji4qPQ0zXu9kNjJosff8oXHliWnH6hFuGVZ+iu091Wg5CXZdo/V6W60/VqBfOVH9ImZQkovwvcsLYxgOvPpKV+0Je1dq0MAcDWRwHzIYyksuN1a2D1yIxryMoWMV6f3WFf90P24/5uNE5dYh91MEfJADcZ4w= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764122114; c=relaxed/simple; bh=RFxxHY3PWY2+WIfyOUm2l5rGiDWBt3gJfJeJcnyUoFA=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=LhPNwQOm22h/YRIAAWKP7dYv8pTZg+S6gUKUPHAe5YblKa4S5o/+PAt71Ekt2yro6ubtY4Ofg1THnY5huIsNK7ji/XV/F5pENL6zQrPIct4YR/GQtWEqD9WfkgNFRsUys+m8sltcjZxbS0tD+0iNU9gz0q1l1mGS5KQwK0on2wg= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 123333858D32 Received: by mail-pf1-x435.google.com with SMTP id d2e1a72fcca58-7bab7c997eeso6842533b3a.0 for ; Tue, 25 Nov 2025 17:55:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764122113; x=1764726913; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=oo8zCwLLSy8CNUWcRxlakkQ55QLD/aYZepRLcoSGyLQ=; b=BiLlrPK6LR/SuwDjgE5asWatjS5nJ1/DHcmbOFqrLkf1fky/eIWjytEwOSFtqS5SZ2 VukoozQRseNAqjiSdpHxMpBQnvPxOWKAECFJS7z1m08EY2D8Z2zhkfXvlcJ+u71T5Ld2 EfWUCw3kd2IjDYyB9gEqeFAb33sHDFyB2EEj4HPeDMyADGqooSbqR2ZU6ELWShmTIJpN qr3A6HQBnbUwMGakgzdHmnQWAMg0gKFVfmYHyPmIc01MSOsN3Rt5ANEdsYulQafo6Pv2 h6WFBH3+NTfR7X5kyr0p38UnbngCSJUCruFSQE8wOABw0zsCrgrDdnSA0d5hRlIyxD1V /Kjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764122113; x=1764726913; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=oo8zCwLLSy8CNUWcRxlakkQ55QLD/aYZepRLcoSGyLQ=; b=pcD506TRKi8k+mMdfuHaki8QBz75yYjiHGvIOqp/UiT/pCw+Zd2tfhBiGC1SykRU0x NL8JEUaD0UOMiEcjEE12wPjZ+7W4REF89QKivKTJx8PlbDGrSxgqfesZDmA7JBkCYWja xSnTNukM9lekNeJWIp5KnVTS6owp3ZbdZQNJCNWiIT7VcPm0LZKgwWTB/m1PUbIGFj3Y jrFwi9t8KISsY2b86GEp45nwiIJ1F7ojq+uElek09Z+Wkm2MU0s2E2AHwBn4Z0y5MgUU gf7ATO4JUlPlX4I1ccevBxJkHW6zHUSw6Tx0wk9LeP4QtkdYCv7bmFpzWhL1LexNVEs2 iSQQ== X-Gm-Message-State: AOJu0YzXbOtHpsZBimQmSNx1i3ftUs6XRsvDIKA04fYhOt0bTN8sVSiD fqnGKE3VzXWDPpRqgjPVW9PMriXSfZAMvMAOmwtUKWsyFUM/2Qhazq1tbX6DGrRav/4= X-Gm-Gg: ASbGncu5GCy6/0Qm+DVXoCv9ScQIQ5nHnsCL+/ZnK2RvFyEyQyb6FAkNflcJ3AS4lNq 666rGVjI66gUr5NJjkCMR+eNuvXVuZ2nHbN8aARHqPYcXMaypDpwH3O0zu6YPTffRFZBMl6U9UI 1xgSqG9mANsilhOOPE4CAKUjq6gJKoSCtBP0JplPHehCjSmmpKoVZmCsFKkF7UYzs715PBc4B9V 35U9ytYLQ8ZGpnoUdg/dWRktbE/8ZhhavBF83xtDZhlA+71jgyaGEPZt2jWNGphMkbjOygCIvi3 xMRy47Hg3KVVU0D5GXsu1Ul3FMo6vph6DOJQgQa+LDlmnKQxC6ByyrMPWzh8jKo7HMMQdxplfLI k/IJT0sYg98+wIbJE1aoXBPgoCl2RA+tvDjkZhm6eOi5GGTdnkyLXURqrq4FD2wOxJWOiEhhm6h zwczZjkgBtijS1vRtZBj1K0EUxs6TRrmcC2L3FbPMWEZBHzXI43cofxUEQQIUetA== X-Google-Smtp-Source: AGHT+IHgVYoy/OO64vFvKjwAMofuRFRhPTzIF3KbpNS4I6/13IFHueDBV5dscDb/5CJcZl09Y7SaDQ== X-Received: by 2002:a05:6a20:939f:b0:359:9d33:df08 with SMTP id adf61e73a8af0-36150e72016mr17605107637.18.1764122112574; Tue, 25 Nov 2025 17:55:12 -0800 (PST) Received: from localhost.localdomain ([203.149.208.29]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bd760dbc99asm17508540a12.30.2025.11.25.17.55.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Nov 2025 17:55:12 -0800 (PST) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org, DJ Delorie Subject: [PATCH v4 3/3] x86/string: Add version of memmove with page unrolled large impl Date: Tue, 25 Nov 2025 20:55:00 -0500 Message-ID: <20251126015500.82591-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20251126015500.82591-1-goldstein.w.n@gmail.com> References: <20251115093318.830179-1-goldstein.w.n@gmail.com> <20251126015500.82591-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org The page unrolled version has been shown to be the best performing on Intel SnB through ICX hardware. Reviewed-by: DJ Delorie --- sysdeps/x86/cpu-features.c | 10 ++ sysdeps/x86/cpu-tunables.c | 6 + ...cpu-features-preferred_feature_index_1.def | 1 + sysdeps/x86/tst-hwcap-tunables.c | 4 +- sysdeps/x86_64/multiarch/Makefile | 3 + sysdeps/x86_64/multiarch/ifunc-impl-list.c | 120 ++++++++++++++++++ sysdeps/x86_64/multiarch/ifunc-memmove.h | 75 +++++++---- ...ove-avx-unaligned-erms-page-unrolled-rtm.S | 5 + ...memmove-avx-unaligned-erms-page-unrolled.S | 5 + .../memmove-avx-unaligned-erms-rtm.S | 2 + ...emmove-evex-unaligned-erms-page-unrolled.S | 5 + 11 files changed, 211 insertions(+), 25 deletions(-) create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S create mode 100644 sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index ecf10ce44d..36803aa53f 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -924,6 +924,11 @@ disable_tsx: case INTEL_BIGCORE_HASWELL: case INTEL_BIGCORE_BROADWELL: cpu_features->cachesize_non_temporal_divisor = 8; + /* Benchmarks indicate page unrolled large implementation + performs better than standard copy loop on HSW (and + presumably SnB). */ + cpu_features->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy] + |= bit_arch_Prefer_Page_Unrolled_Large_Copy; goto default_tuning; /* Newer Bigcore microarch (larger non-temporal store @@ -944,6 +949,11 @@ disable_tsx: case INTEL_BIGCORE_ICELAKE: case INTEL_BIGCORE_TIGERLAKE: case INTEL_BIGCORE_ROCKETLAKE: + /* Benchmarks indicate page unrolled large implementation + performs better than standard copy loop on Skylake/SKX/ICX. */ + cpu_features->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy] + |= bit_arch_Prefer_Page_Unrolled_Large_Copy; + [[fallthrough]]; case INTEL_BIGCORE_RAPTORLAKE: case INTEL_BIGCORE_METEORLAKE: case INTEL_BIGCORE_LUNARLAKE: diff --git a/sysdeps/x86/cpu-tunables.c b/sysdeps/x86/cpu-tunables.c index 74cd5b9377..17fdbf2ff3 100644 --- a/sysdeps/x86/cpu-tunables.c +++ b/sysdeps/x86/cpu-tunables.c @@ -259,6 +259,12 @@ TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp) (n, cpu_features, Prefer_PMINUB_for_stringop, SSE2, 26); } break; + case 31: + { + CHECK_GLIBC_IFUNC_PREFERRED_BOTH ( + n, cpu_features, Prefer_Page_Unrolled_Large_Copy, 31); + } + break; } } } diff --git a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def index 0f14aaf071..7bff2b0441 100644 --- a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def +++ b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def @@ -35,3 +35,4 @@ BIT (Prefer_FSRM) BIT (Avoid_Short_Distance_REP_MOVSB) BIT (Avoid_Non_Temporal_Memset) BIT (Avoid_STOSB) +BIT (Prefer_Page_Unrolled_Large_Copy) diff --git a/sysdeps/x86/tst-hwcap-tunables.c b/sysdeps/x86/tst-hwcap-tunables.c index 3e06048dcc..985153fb38 100644 --- a/sysdeps/x86/tst-hwcap-tunables.c +++ b/sysdeps/x86/tst-hwcap-tunables.c @@ -61,7 +61,7 @@ static const struct test_t "-Prefer_ERMS,-Prefer_FSRM,-AVX,-AVX2,-AVX512F,-AVX512VL," "-SSE4_1,-SSE4_2,-SSSE3,-Fast_Unaligned_Load,-ERMS," "-AVX_Fast_Unaligned_Load,-Avoid_Non_Temporal_Memset," - "-Avoid_STOSB", + "-Avoid_STOSB,-Prefer_Page_Unrolled_Large_Copy", test_1, array_length (test_1) }, @@ -70,7 +70,7 @@ static const struct test_t ",-,-Prefer_ERMS,-Prefer_FSRM,-AVX,-AVX2,-AVX512F,-AVX512VL," "-SSE4_1,-SSE4_2,-SSSE3,-Fast_Unaligned_Load,,-," "-ERMS,-AVX_Fast_Unaligned_Load,-Avoid_Non_Temporal_Memset," - "-Avoid_STOSB,-,", + "-Avoid_STOSB,-Prefer_Page_Unrolled_Large_Copy,-,", test_1, array_length (test_1) } diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile index 696cb66991..381eaef455 100644 --- a/sysdeps/x86_64/multiarch/Makefile +++ b/sysdeps/x86_64/multiarch/Makefile @@ -16,11 +16,14 @@ sysdep_routines += \ memcmpeq-evex \ memcmpeq-sse2 \ memmove-avx-unaligned-erms \ + memmove-avx-unaligned-erms-page-unrolled \ + memmove-avx-unaligned-erms-page-unrolled-rtm \ memmove-avx-unaligned-erms-rtm \ memmove-avx512-no-vzeroupper \ memmove-avx512-unaligned-erms \ memmove-erms \ memmove-evex-unaligned-erms \ + memmove-evex-unaligned-erms-page-unrolled \ memmove-sse2-unaligned-erms \ memmove-ssse3 \ memrchr-avx2 \ diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index c2dcadd1a9..f9add65d24 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -133,23 +133,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX512VL), __memmove_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX512VL), __memmove_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX), __memmove_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX), + __memmove_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX), __memmove_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX), + __memmove_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __memmove_chk, CPU_FEATURE_USABLE (SSSE3), @@ -180,23 +200,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, CPU_FEATURE_USABLE (AVX512VL), __memmove_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, CPU_FEATURE_USABLE (AVX512VL), __memmove_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, CPU_FEATURE_USABLE (AVX), __memmove_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + CPU_FEATURE_USABLE (AVX), + __memmove_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, CPU_FEATURE_USABLE (AVX), __memmove_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + CPU_FEATURE_USABLE (AVX), + __memmove_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, memmove, CPU_FEATURE_USABLE (SSSE3), @@ -1140,23 +1180,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __memcpy_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __memcpy_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX), __memcpy_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX), + __memcpy_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX), __memcpy_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX), + __memcpy_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (SSSE3), @@ -1187,23 +1247,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, CPU_FEATURE_USABLE (AVX512VL), __memcpy_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, CPU_FEATURE_USABLE (AVX512VL), __memcpy_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, CPU_FEATURE_USABLE (AVX), __memcpy_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX), + __memcpy_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, CPU_FEATURE_USABLE (AVX), __memcpy_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX), + __memcpy_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, memcpy, CPU_FEATURE_USABLE (SSSE3), @@ -1234,23 +1314,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX), __mempcpy_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX), + __mempcpy_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX), __mempcpy_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX), + __mempcpy_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (SSSE3), @@ -1281,23 +1381,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX), __mempcpy_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX), + __mempcpy_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX), __mempcpy_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX), + __mempcpy_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, mempcpy, CPU_FEATURE_USABLE (SSSE3), diff --git a/sysdeps/x86_64/multiarch/ifunc-memmove.h b/sysdeps/x86_64/multiarch/ifunc-memmove.h index de0ac73a2a..6d5df8a9eb 100644 --- a/sysdeps/x86_64/multiarch/ifunc-memmove.h +++ b/sysdeps/x86_64/multiarch/ifunc-memmove.h @@ -28,18 +28,27 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned_erms) extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_no_vzeroupper) attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned_erms) - attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_erms) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_erms_page_unrolled) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned) attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms_rtm) - attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_page_unrolled_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_page_unrolled_rtm) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden; @@ -71,40 +80,60 @@ IFUNC_SELECTOR (void) return OPTIMIZE (avx512_no_vzeroupper); } - if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, - AVX_Fast_Unaligned_Load, )) + if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load, )) { if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (evex_unaligned_erms); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (evex_unaligned_erms_page_unrolled); + return OPTIMIZE (evex_unaligned_erms); + } + + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (evex_unaligned_page_unrolled); return OPTIMIZE (evex_unaligned); } if (CPU_FEATURE_USABLE_P (cpu_features, RTM)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (avx_unaligned_erms_rtm); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_erms_page_unrolled_rtm); + return OPTIMIZE (avx_unaligned_erms_rtm); + } + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_page_unrolled_rtm); return OPTIMIZE (avx_unaligned_rtm); } - if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, - Prefer_No_VZEROUPPER, !)) + if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER, !)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (avx_unaligned_erms); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_erms_page_unrolled); + return OPTIMIZE (avx_unaligned_erms); + } + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_page_unrolled); return OPTIMIZE (avx_unaligned); } } if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, SSSE3) /* Leave this as runtime check. The SSSE3 is optimized almost - exclusively for avoiding unaligned memory access during the - copy and by and large is not better than the sse2 - implementation as a general purpose memmove. */ + exclusively for avoiding unaligned memory access during the + copy and by and large is not better than the sse2 + implementation as a general purpose memmove. */ && !CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Copy)) { return OPTIMIZE (ssse3); diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S new file mode 100644 index 0000000000..683d903243 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_page_unrolled_rtm +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-avx-unaligned-erms-rtm.S" diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S new file mode 100644 index 0000000000..57b518e16f --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_page_unrolled +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-avx-unaligned-erms.S" diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S index 20746e6713..36e864e935 100644 --- a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S @@ -2,7 +2,9 @@ # include "x86-avx-rtm-vecs.h" +#ifndef MEMMOVE_SYMBOL # define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_rtm +#endif # include "memmove-vec-unaligned-erms.S" #endif diff --git a/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S new file mode 100644 index 0000000000..371b454819 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_evex_##s##_page_unrolled +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-evex-unaligned-erms.S"