From patchwork Fri Nov 28 08:37:19 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 125476 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 638CC3858C60 for ; Fri, 28 Nov 2025 08:39:51 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 638CC3858C60 Authentication-Results: sourceware.org; dkim=pass (2048-bit key, unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=XPdCESHj X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by sourceware.org (Postfix) with ESMTPS id 1BC7E3858C60 for ; Fri, 28 Nov 2025 08:37:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1BC7E3858C60 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1BC7E3858C60 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::1030 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764319051; cv=none; b=QA6rWizKsDSsdM1cPDs/vwrYo+a7ZuRPNNGFun43cQF1MchXPONb6GHmxCBtB9kCP8d/klTMwVA+fhpwlls7bK0o9lLzIQAzhYgOtRZ7xut33yFNZ2ch1XsosFVa2A6xOWBviwZOfhhOH6DZE8+brfHPA5mTjpmTXLSiQpjK1PQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1764319051; c=relaxed/simple; bh=RFxxHY3PWY2+WIfyOUm2l5rGiDWBt3gJfJeJcnyUoFA=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=kU0uOFSz6Q36Jed5YQ0SizelA+JxUEcPLpxcsouhZc6dJ5uyPS1oUM0zNdBzUSeh16/UxVXFp4854w0q5VyJAhSUWHFfHa4z0k9VzqS9+2Amu18NttNYfPg/hfcrfdjxgZDcfaVaSrkcE/vaF3dsGPRdgjU53js8orBPCPdr7PM= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1BC7E3858C60 Received: by mail-pj1-x1030.google.com with SMTP id 98e67ed59e1d1-340bcc92c7dso1311547a91.0 for ; Fri, 28 Nov 2025 00:37:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764319050; x=1764923850; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=oo8zCwLLSy8CNUWcRxlakkQ55QLD/aYZepRLcoSGyLQ=; b=XPdCESHjWko5x1J/3NUXcduk21xgS5n3+GOaDapWT2ZKggwKKNUw83uCCNKs0dLdTP nyqcLehtu5hzxLfFxHHPsNiFlN9YkeTV8aPN/0x7+aX55RcTlUcBNl9vLQPrf7flSRTb dYA7xMJyBYCaaqnt7tF5PHExTREpG8Gv1TJbETeDb/LG494p1JXObqZrLJeCCEGBiQRJ M0ZHH8BsPh+WuH7IicvpEJ6eYKMFcU+AXpKTSA1BFpKw684eYYnnvsT8Oh7N+hgjdGAO ktaEkdBpeD1Y5zhJK24ONpXaOlIUMQcxHa6VswCjsVd16q2ftq5lzbK9GpcDHHC7zPKE wTpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764319050; x=1764923850; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=oo8zCwLLSy8CNUWcRxlakkQ55QLD/aYZepRLcoSGyLQ=; b=VosuPqDA7FPnLo90agZwWH7aO9/zbNZGwqMXAS0MGRLsNRVJpLPXEYZcYfjObZijfP g/0QIJ4M+mgpzXK8kic0/hmDnBJzUhJxqho5PXO2ZVkjtyKvjAK+squCCPaWw9bhn2ZM dx5F1Tbx/QzVzvbSaId96iCM6zGjfiS/PpR9baIHgn2g6j90JdwZmhA3GzGxj94dIcOr DPvQMSz9gEbAH9cdy8SN3tuyhWGYOk+UsM593ZmymmLYv/AU4o7Uiy+tG+mRMWyLPb6l +wLFhoP85AlbtcqcKY2iZrroQpo+1ENdDYW8xSkEh5OjdMrS6JOjsnCxf5NttTStg3zh 5PFQ== X-Gm-Message-State: AOJu0YwMXoG+gGLMQ3Ol7ZiL1sLx7RRChkmAy8XzHydQRcEh8eunXg+q 1lwoIdHVLxrjux/3Fi1spCA1B6sZ91km6O9UQP1Q101QaEOLY+n8Qn/250hj+aHMaVw= X-Gm-Gg: ASbGncsE7Kk8scoUPo+jBb8f/m1y8IP2ac5KIS62HUPPD/n/HURKcIK1cGveFuaTET9 3thRm2vbAyFOFN4SHkJmu/ShdwDABnJts6NGRl7roPmcUhD00T+5ALTlwqJhigAF0z746cBEIsa Uj6iOB1ipa50XFw7HuYqZiEfd8oThkXHJxBuogBRxHjqLdCcenvL2M+NlBhCa6/c0qPGDTZxccd Iyd5wUCamqh7KobeyajO03cQKlyB8lYheGh0zEfmpnx8hft6cyfkNJXqXQsNoHEknTBrgw6yKz8 nvUVmAdlW8OU+dVQb1uhuliexK/W+rPiGOwPF7n9T/POTB6mUtQ4LSE8jGsPQIEWwZxkjBhCQ/Y eDxTSfL8/cd2xLHa5K2xEmcS2ABc9sunIC6P0Ilslar7Vh0fG6S+CTdvzT3kON6IM6bb2ELEgWQ dH0OISk1zsPcJHC7nFyCkPfi2UU7jhH9pSkljOndyzfeKyfzSnHm5V47swO545TQ== X-Google-Smtp-Source: AGHT+IHnOqNvhdi/Y+yxQ6nQSA9YNeIWHeaFq5/mWcE1npJ1720u+SgymMrxYjEzk8q66XUJLRN53w== X-Received: by 2002:a17:90a:dfcb:b0:343:7711:127d with SMTP id 98e67ed59e1d1-347298a9fabmr29092253a91.9.1764319049607; Fri, 28 Nov 2025 00:37:29 -0800 (PST) Received: from localhost.localdomain ([103.137.210.78]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7d15fb1417csm4143843b3a.60.2025.11.28.00.37.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Nov 2025 00:37:29 -0800 (PST) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org, DJ Delorie Subject: [PATCH v6 3/4] x86/string: Add version of memmove with page unrolled large impl Date: Fri, 28 Nov 2025 03:37:19 -0500 Message-ID: <20251128083720.92561-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20251128083720.92561-1-goldstein.w.n@gmail.com> References: <20251115093318.830179-1-goldstein.w.n@gmail.com> <20251128083720.92561-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org The page unrolled version has been shown to be the best performing on Intel SnB through ICX hardware. Reviewed-by: DJ Delorie --- sysdeps/x86/cpu-features.c | 10 ++ sysdeps/x86/cpu-tunables.c | 6 + ...cpu-features-preferred_feature_index_1.def | 1 + sysdeps/x86/tst-hwcap-tunables.c | 4 +- sysdeps/x86_64/multiarch/Makefile | 3 + sysdeps/x86_64/multiarch/ifunc-impl-list.c | 120 ++++++++++++++++++ sysdeps/x86_64/multiarch/ifunc-memmove.h | 75 +++++++---- ...ove-avx-unaligned-erms-page-unrolled-rtm.S | 5 + ...memmove-avx-unaligned-erms-page-unrolled.S | 5 + .../memmove-avx-unaligned-erms-rtm.S | 2 + ...emmove-evex-unaligned-erms-page-unrolled.S | 5 + 11 files changed, 211 insertions(+), 25 deletions(-) create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S create mode 100644 sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index ecf10ce44d..36803aa53f 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -924,6 +924,11 @@ disable_tsx: case INTEL_BIGCORE_HASWELL: case INTEL_BIGCORE_BROADWELL: cpu_features->cachesize_non_temporal_divisor = 8; + /* Benchmarks indicate page unrolled large implementation + performs better than standard copy loop on HSW (and + presumably SnB). */ + cpu_features->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy] + |= bit_arch_Prefer_Page_Unrolled_Large_Copy; goto default_tuning; /* Newer Bigcore microarch (larger non-temporal store @@ -944,6 +949,11 @@ disable_tsx: case INTEL_BIGCORE_ICELAKE: case INTEL_BIGCORE_TIGERLAKE: case INTEL_BIGCORE_ROCKETLAKE: + /* Benchmarks indicate page unrolled large implementation + performs better than standard copy loop on Skylake/SKX/ICX. */ + cpu_features->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy] + |= bit_arch_Prefer_Page_Unrolled_Large_Copy; + [[fallthrough]]; case INTEL_BIGCORE_RAPTORLAKE: case INTEL_BIGCORE_METEORLAKE: case INTEL_BIGCORE_LUNARLAKE: diff --git a/sysdeps/x86/cpu-tunables.c b/sysdeps/x86/cpu-tunables.c index 74cd5b9377..17fdbf2ff3 100644 --- a/sysdeps/x86/cpu-tunables.c +++ b/sysdeps/x86/cpu-tunables.c @@ -259,6 +259,12 @@ TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp) (n, cpu_features, Prefer_PMINUB_for_stringop, SSE2, 26); } break; + case 31: + { + CHECK_GLIBC_IFUNC_PREFERRED_BOTH ( + n, cpu_features, Prefer_Page_Unrolled_Large_Copy, 31); + } + break; } } } diff --git a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def index 0f14aaf071..7bff2b0441 100644 --- a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def +++ b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def @@ -35,3 +35,4 @@ BIT (Prefer_FSRM) BIT (Avoid_Short_Distance_REP_MOVSB) BIT (Avoid_Non_Temporal_Memset) BIT (Avoid_STOSB) +BIT (Prefer_Page_Unrolled_Large_Copy) diff --git a/sysdeps/x86/tst-hwcap-tunables.c b/sysdeps/x86/tst-hwcap-tunables.c index 3e06048dcc..985153fb38 100644 --- a/sysdeps/x86/tst-hwcap-tunables.c +++ b/sysdeps/x86/tst-hwcap-tunables.c @@ -61,7 +61,7 @@ static const struct test_t "-Prefer_ERMS,-Prefer_FSRM,-AVX,-AVX2,-AVX512F,-AVX512VL," "-SSE4_1,-SSE4_2,-SSSE3,-Fast_Unaligned_Load,-ERMS," "-AVX_Fast_Unaligned_Load,-Avoid_Non_Temporal_Memset," - "-Avoid_STOSB", + "-Avoid_STOSB,-Prefer_Page_Unrolled_Large_Copy", test_1, array_length (test_1) }, @@ -70,7 +70,7 @@ static const struct test_t ",-,-Prefer_ERMS,-Prefer_FSRM,-AVX,-AVX2,-AVX512F,-AVX512VL," "-SSE4_1,-SSE4_2,-SSSE3,-Fast_Unaligned_Load,,-," "-ERMS,-AVX_Fast_Unaligned_Load,-Avoid_Non_Temporal_Memset," - "-Avoid_STOSB,-,", + "-Avoid_STOSB,-Prefer_Page_Unrolled_Large_Copy,-,", test_1, array_length (test_1) } diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile index 696cb66991..381eaef455 100644 --- a/sysdeps/x86_64/multiarch/Makefile +++ b/sysdeps/x86_64/multiarch/Makefile @@ -16,11 +16,14 @@ sysdep_routines += \ memcmpeq-evex \ memcmpeq-sse2 \ memmove-avx-unaligned-erms \ + memmove-avx-unaligned-erms-page-unrolled \ + memmove-avx-unaligned-erms-page-unrolled-rtm \ memmove-avx-unaligned-erms-rtm \ memmove-avx512-no-vzeroupper \ memmove-avx512-unaligned-erms \ memmove-erms \ memmove-evex-unaligned-erms \ + memmove-evex-unaligned-erms-page-unrolled \ memmove-sse2-unaligned-erms \ memmove-ssse3 \ memrchr-avx2 \ diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index c2dcadd1a9..f9add65d24 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -133,23 +133,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX512VL), __memmove_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX512VL), __memmove_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX), __memmove_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX), + __memmove_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX), __memmove_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX), + __memmove_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __memmove_chk, CPU_FEATURE_USABLE (SSSE3), @@ -180,23 +200,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, CPU_FEATURE_USABLE (AVX512VL), __memmove_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, CPU_FEATURE_USABLE (AVX512VL), __memmove_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, CPU_FEATURE_USABLE (AVX), __memmove_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + CPU_FEATURE_USABLE (AVX), + __memmove_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, CPU_FEATURE_USABLE (AVX), __memmove_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + CPU_FEATURE_USABLE (AVX), + __memmove_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, memmove, CPU_FEATURE_USABLE (SSSE3), @@ -1140,23 +1180,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __memcpy_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __memcpy_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX), __memcpy_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX), + __memcpy_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX), __memcpy_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX), + __memcpy_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (SSSE3), @@ -1187,23 +1247,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, CPU_FEATURE_USABLE (AVX512VL), __memcpy_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, CPU_FEATURE_USABLE (AVX512VL), __memcpy_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, CPU_FEATURE_USABLE (AVX), __memcpy_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX), + __memcpy_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, CPU_FEATURE_USABLE (AVX), __memcpy_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX), + __memcpy_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, memcpy, CPU_FEATURE_USABLE (SSSE3), @@ -1234,23 +1314,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX), __mempcpy_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX), + __mempcpy_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX), __mempcpy_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX), + __mempcpy_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (SSSE3), @@ -1281,23 +1381,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX), __mempcpy_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX), + __mempcpy_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX), __mempcpy_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX), + __mempcpy_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, mempcpy, CPU_FEATURE_USABLE (SSSE3), diff --git a/sysdeps/x86_64/multiarch/ifunc-memmove.h b/sysdeps/x86_64/multiarch/ifunc-memmove.h index de0ac73a2a..6d5df8a9eb 100644 --- a/sysdeps/x86_64/multiarch/ifunc-memmove.h +++ b/sysdeps/x86_64/multiarch/ifunc-memmove.h @@ -28,18 +28,27 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned_erms) extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_no_vzeroupper) attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned_erms) - attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_erms) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_erms_page_unrolled) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned) attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms_rtm) - attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_page_unrolled_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_page_unrolled_rtm) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden; @@ -71,40 +80,60 @@ IFUNC_SELECTOR (void) return OPTIMIZE (avx512_no_vzeroupper); } - if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, - AVX_Fast_Unaligned_Load, )) + if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load, )) { if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (evex_unaligned_erms); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (evex_unaligned_erms_page_unrolled); + return OPTIMIZE (evex_unaligned_erms); + } + + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (evex_unaligned_page_unrolled); return OPTIMIZE (evex_unaligned); } if (CPU_FEATURE_USABLE_P (cpu_features, RTM)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (avx_unaligned_erms_rtm); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_erms_page_unrolled_rtm); + return OPTIMIZE (avx_unaligned_erms_rtm); + } + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_page_unrolled_rtm); return OPTIMIZE (avx_unaligned_rtm); } - if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, - Prefer_No_VZEROUPPER, !)) + if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER, !)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (avx_unaligned_erms); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_erms_page_unrolled); + return OPTIMIZE (avx_unaligned_erms); + } + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_page_unrolled); return OPTIMIZE (avx_unaligned); } } if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, SSSE3) /* Leave this as runtime check. The SSSE3 is optimized almost - exclusively for avoiding unaligned memory access during the - copy and by and large is not better than the sse2 - implementation as a general purpose memmove. */ + exclusively for avoiding unaligned memory access during the + copy and by and large is not better than the sse2 + implementation as a general purpose memmove. */ && !CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Copy)) { return OPTIMIZE (ssse3); diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S new file mode 100644 index 0000000000..683d903243 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_page_unrolled_rtm +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-avx-unaligned-erms-rtm.S" diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S new file mode 100644 index 0000000000..57b518e16f --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_page_unrolled +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-avx-unaligned-erms.S" diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S index 20746e6713..36e864e935 100644 --- a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S @@ -2,7 +2,9 @@ # include "x86-avx-rtm-vecs.h" +#ifndef MEMMOVE_SYMBOL # define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_rtm +#endif # include "memmove-vec-unaligned-erms.S" #endif diff --git a/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S new file mode 100644 index 0000000000..371b454819 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_evex_##s##_page_unrolled +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-evex-unaligned-erms.S"