From patchwork Fri Sep 5 00:47:02 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "longwei (I)" X-Patchwork-Id: 119580 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 814DD3858C62 for ; Fri, 5 Sep 2025 00:47:51 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 814DD3858C62 X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by sourceware.org (Postfix) with ESMTPS id 5E2383858D1E; Fri, 5 Sep 2025 00:47:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5E2383858D1E Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=huawei.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 5E2383858D1E Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=45.249.212.189 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1757033229; cv=none; b=t6k/95XbBB+NPHQi/h99CNGCy2cCXHABBDel7ntp8npcs9SZxiTkdY91liKCJowpgR9foWQsnNJuImF5LdhWzeXojAzpt/nwS6joiefmDWk12arE/j+W2h2CrbGeA1CagOtf5IumYXZDWrZuvNRpqA5bqiCRmgZTEoAbh2rKpGg= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1757033229; c=relaxed/simple; bh=pTqM4ZE3yvAxKXE6fe+zss7qjgTdbrZBpdK83jbzSzg=; h=Message-ID:Date:MIME-Version:From:Subject:To; b=NmTWf11CnTHvQp/G5WtHA9j08hLWFWYa6XYrhhDiirBg5s1xX2p+J9fF5cdJHckLJfWWc4/ka+RIwjuErZvujudgPd4cLJIf4WTijvNONUjIOdNIjxEUbbNM7gLVPRka25gi7bufxc53MSumynwQuyXY6GiWnCP0JEH+FX/hsTk= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5E2383858D1E Received: from mail.maildlp.com (unknown [172.19.162.254]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4cHyHF4Pw4zPtM8; Fri, 5 Sep 2025 08:42:33 +0800 (CST) Received: from kwepemk100009.china.huawei.com (unknown [7.202.194.57]) by mail.maildlp.com (Postfix) with ESMTPS id 9D449180486; Fri, 5 Sep 2025 08:47:03 +0800 (CST) Received: from [10.174.184.228] (10.174.184.228) by kwepemk100009.china.huawei.com (7.202.194.57) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 5 Sep 2025 08:47:02 +0800 Message-ID: Date: Fri, 5 Sep 2025 08:47:02 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: "longwei (I)" Subject: [PATCH] aarch64:Modify the copy_long function in the SVE memcpy implementation for 32-byte aligned access To: CC: , , , Carlos O'Donell , , , , "hewenliang (C)" X-Originating-IP: [10.174.184.228] X-ClientProxiedBy: kwepems500001.china.huawei.com (7.221.188.70) To kwepemk100009.china.huawei.com (7.202.194.57) X-Spam-Status: No, score=-13.7 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, RCVD_IN_VALIDITY_RPBL_BLOCKED, RCVD_IN_VALIDITY_SAFE_BLOCKED, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org aarch64: Optimize memcpy_sve by using 32-byte alignment The current memcpy_sve implementation shifts the destination pointer forward to achieve only 16-byte alignment. This can lead to two performance issues: 1.  Cross-cache-line accesses: With 16-byte alignment, a 32-byte store     operation can still straddle two cache lines. This forces the CPU     to perform two separate cache line accesses, effectively doubling     the time for the store. 2.  Cache bank conflicts: On some ARM microarchitectures, L1 cache is     organized into banks. 16-byte alignment can cause stores to frequently     hit the same bank, creating contention and reducing effective memory     bandwidth. Change the implementation of memcpy_sve from shifting forward to 16-byte alignment to shifting forward to 32-byte alignment, which is more cache-friendly. -   All 32-byte SVE vector stores are fully contained within a     single 64-byte cache line, minimizing access latency. -   Stores are distributed across different cache banks more     evenly, preventing conflicts and maximizing throughput. We tested the performance of `memcpy` on Kunpeng servers using the libmicro test suite. The results showed that using 32-byte alignment can reduce the latency of `memcpy`. The test results are in microseconds. 16-byte alignment: memcpy_10    memcpy_32    memcpy_64    memcpy_128    memcpy_256 memcpy_512 0.0028        0.0028        0.0028        0.0035        0.0063 0.0122 memcpy_1k    memcpy_2k    memcpy_4k    memcpy_8k    memcpy_10k memcpy_16k 0.0165        0.0315        0.0605        0.1251        0.1597 0.2458 memcpy_32k    memcpy_64k    memcpy_128k    memcpy_256k memcpy_512k memcpy_1m 0.512        1.024        2.048        4.096        7.936 16.8 memcpy_2m    memcpy_4m    memcpy_8m    memcpy_10m 33.152        66.72        132.096        165.12 32-byte alignment: memcpy_10    memcpy_32    memcpy_64    memcpy_128    memcpy_256 memcpy_512 0.0028        0.0028        0.0028        0.0035        0.0058 0.0096 memcpy_1k    memcpy_2k    memcpy_4k    memcpy_8k    memcpy_10k memcpy_16k 0.0165        0.0315        0.0614        0.121        0.1515 0.2355 memcpy_32k    memcpy_64k    memcpy_128k    memcpy_256k memcpy_512k memcpy_1m 0.512        1.024        2.048        3.84        7.168 15.072 memcpy_2m    memcpy_4m    memcpy_8m    memcpy_10m 29.952        60.032        119.04        147.968 No functional change. sysdeps/aarch64/multiarch/memcpy_sve.S:  Change alignment shifting from 16 bytes to 32 bytes. ---  sysdeps/aarch64/multiarch/memcpy_sve.S | 30 +++++++++++++-------------  1 file changed, 15 insertions(+), 15 deletions(-)         subs    count, count, 64 @@ -127,9 +127,9 @@ L(loop64):         /* Write the last iteration and copy 64 bytes from the end. */  L(copy64_from_end):         ldp     E_q, F_q, [srcend, -64] -       stp     A_q, B_q, [dst, 16] +       stp     A_q, B_q, [dst, 32]         ldp     A_q, B_q, [srcend, -32] -       stp     C_q, D_q, [dst, 48] +       stp     C_q, D_q, [dst, 64]         stp     E_q, F_q, [dstend, -64]         stp     A_q, B_q, [dstend, -32]         ret diff --git a/sysdeps/aarch64/multiarch/memcpy_sve.S b/sysdeps/aarch64/multiarch/memcpy_sve.S index 0ba6358bbd..3418b082b1 100644 --- a/sysdeps/aarch64/multiarch/memcpy_sve.S +++ b/sysdeps/aarch64/multiarch/memcpy_sve.S @@ -103,22 +103,22 @@ L(copy_long):         add     srcend, src, count         add     dstend, dstin, count -       /* Copy 16 bytes and then align src to 16-byte alignment. */ -       ldr     D_q, [src] -       and     tmp1, src, 15 -       bic     src, src, 15 +       /* Copy 32 bytes and then align src to 32-byte alignment. */ +       ldp     G_q, H_q, [src] +       and     tmp1, src, 31 +       bic     src, src, 31         sub     dst, dstin, tmp1 -       add     count, count, tmp1      /* Count is now 16 too large.  */ -       ldp     A_q, B_q, [src, 16] -       str     D_q, [dstin] -       ldp     C_q, D_q, [src, 48] -       subs    count, count, 128 + 16  /* Test and readjust count. */ +       add     count, count, tmp1      /* Count is now 32 too large.  */ +       ldp     A_q, B_q, [src, 32] +       stp     G_q, H_q, [dstin] +       ldp     C_q, D_q, [src, 64] +       subs    count, count, 128 + 32  /* Test and readjust count. */         b.ls    L(copy64_from_end)  L(loop64): -       stp     A_q, B_q, [dst, 16] -       ldp     A_q, B_q, [src, 80] -       stp     C_q, D_q, [dst, 48] -       ldp     C_q, D_q, [src, 112] +       stp     A_q, B_q, [dst, 32] +       ldp     A_q, B_q, [src, 96] +       stp     C_q, D_q, [dst, 64] +       ldp     C_q, D_q, [src, 128]         add     src, src, 64         add     dst, dst, 64