From patchwork Wed Jan 8 09:47:42 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luna Lamb X-Patchwork-Id: 104334 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 4B7063858CD1 for ; Wed, 8 Jan 2025 09:49:00 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4B7063858CD1 Authentication-Results: sourceware.org; dkim=pass (1024-bit key, unprotected) header.d=arm.com header.i=@arm.com header.a=rsa-sha256 header.s=selector1 header.b=KQR2WTqe X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR02-DB5-obe.outbound.protection.outlook.com (mail-db5eur02on20613.outbound.protection.outlook.com [IPv6:2a01:111:f403:2608::613]) by sourceware.org (Postfix) with ESMTPS id A5F633858C5F for ; Wed, 8 Jan 2025 09:48:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A5F633858C5F Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A5F633858C5F Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f403:2608::613 ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1736329685; cv=pass; b=N2WB783qUR7abOTn+hu1+gCrcWBmJImcf+XPcNhhbwEvVi+7afiw/yYl4Il372QNnlFHcmffKKUBj38QpxPgJHOzYhDno3HHPslUswqKXQZe8X4RoAE4AkRRGYCppEETr8fO59MtcOtz9KXZ5aWmI8SHnbD3SGlt6fPZDCx69Ew= ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1736329685; c=relaxed/simple; bh=Kz4lNKM1fU1ty+61VMT7z13eXRFvFmGd9MMFL6C4lIw=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=aaLc/ZutrA9Q5YQeJDp33c3KvWM708EiH6A3Wqr3lx4jarYnAJOLgu2CIdtCOxOywTPwruxGKJCJ5Wg6O9Hxmvip2jfG0GsTfcvvB4nPpb5YRXAa3ek8Wr8Yjj0IEwCYxzffEM4WkGUm0TgWl7vFyNVeekuVjMZS4r3Oac2fGrA= ARC-Authentication-Results: i=2; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A5F633858C5F ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Ed6Su8Q0Ydcd9HyYLTRtWawkhiVZXyN/KPGjumQ1WvJ8YLYrZ43+z0bGWVAbOSgbT4ijre/BF+inpaiV2yt6S8R/3J17mSm8N/tFjE2ehiyGGy31FKUSodwKn+hZm4qH14wFcN6SCq1mi1dqmQItYC+00Ke3EvLUL8dc1WCLHeL/TkIs0j/A3c9GM0+y8oOmGDBlrl6CdKayCwP7rIS3Ps2qaGvbw+9bdSe+0t/GoWKbXXsEGufTBCHrRxO/Dhv73cIEwf8x9+P8R76irkqWoti/U7/XwBxnmoiWNEBbP2K53InH00zolQiN1uH/0gkoeUOxd93qTRCM5yxRCM9O2A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=UC50QjAYI5CqFpZemarv/XiaJaT7soNHoVyZCEfM1/A=; b=BWTkVhGnotAVyRM24b3FNcLEYERfYrgAsT/W+BUVfdjiWVAGNb6Gt68D2iR4yxHg7yLVLmel6rikgR7q0Fkg4DNEfU4uUjkrxzUfT47o7K1aUgfzZBiy1IuZNMfYMMWRbOwR6JabWIB2q33pyjhUrke8Au68GUy/iHZ+sdvmRVfbwDC6BHOmGkXPhvrEcVRJb7ueaVVYRhM+cGWEJg+moXUxk6mwCk84Q/2CPLshn82bFFeuRgs/CzBcEkUXv7XvUdOlE1gOf+maxvPTvojjJ2tMaJBxhr1JaMyA6VREZ3/WpY0RvWOmVNRRUgi969TRy41KUkaNGvWbiZRn4FGHCg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=fail (sender ip is 172.205.89.229) smtp.rcpttodomain=sourceware.org smtp.mailfrom=arm.com; dmarc=fail (p=none sp=none pct=100) action=none header.from=arm.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arm.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=UC50QjAYI5CqFpZemarv/XiaJaT7soNHoVyZCEfM1/A=; b=KQR2WTqeKGJc5DhjZxpy05+0FBcBVTnKBYHUCx1QpaQUeWSj0rI5P4u/vWLSTGY4pCOP/YCorRbAVCTg9lU3mwlZ4VPpElVnTQgZyDijpCJsdfs/5rRHIVrl/rpgngN0Jv9EPtcUSU9TatM/klUafQcVaEpsIBQHPxPtSiv6C8Q= Received: from DU6P191CA0017.EURP191.PROD.OUTLOOK.COM (2603:10a6:10:540::28) by PR3PR08MB5675.eurprd08.prod.outlook.com (2603:10a6:102:8a::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8335.11; Wed, 8 Jan 2025 09:47:54 +0000 Received: from DB3PEPF0000885F.eurprd02.prod.outlook.com (2603:10a6:10:540:cafe::13) by DU6P191CA0017.outlook.office365.com (2603:10a6:10:540::28) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.8335.11 via Frontend Transport; Wed, 8 Jan 2025 09:47:54 +0000 X-MS-Exchange-Authentication-Results: spf=fail (sender IP is 172.205.89.229) smtp.mailfrom=arm.com; dkim=none (message not signed) header.d=none;dmarc=fail action=none header.from=arm.com; Received-SPF: Fail (protection.outlook.com: domain of arm.com does not designate 172.205.89.229 as permitted sender) receiver=protection.outlook.com; client-ip=172.205.89.229; helo=nebula.arm.com; Received: from nebula.arm.com (172.205.89.229) by DB3PEPF0000885F.mail.protection.outlook.com (10.167.242.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.8335.7 via Frontend Transport; Wed, 8 Jan 2025 09:47:52 +0000 Received: from AZ-NEU-EX06.Arm.com (10.240.25.134) by AZ-NEU-EX05.Arm.com (10.240.25.133) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Wed, 8 Jan 2025 09:47:52 +0000 Received: from ip-10-252-30-138.eu-west-1.compute.internal (10.252.30.138) by mail.arm.com (10.240.25.134) with Microsoft SMTP Server id 15.1.2507.39 via Frontend Transport; Wed, 8 Jan 2025 09:47:52 +0000 From: Luna Lamb To: CC: Luna Lamb Subject: [PATCH] aarch64: Improve codegen in SVE exp and users, and update expf_inline. Date: Wed, 8 Jan 2025 09:47:42 +0000 Message-ID: <20250108094742.4205-1-Luna.lamb@arm.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DB3PEPF0000885F:EE_|PR3PR08MB5675:EE_ X-MS-Office365-Filtering-Correlation-Id: c438c650-2794-4132-d162-08dd2fc98567 x-checkrecipientrouted: true NoDisclaimer: true X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; ARA:13230040|36860700013|376014|1800799024|82310400026; X-Microsoft-Antispam-Message-Info: x1fRXA5wrDoC47GXVUwWoORwowPhPF79BX+b06IUC45C4WpFdM7TxTLJQ8GN/+7P3oIvzPsIOPQ4n5zcM2JYF4uMC8Y/QKSrJZHf3X/Yg+grlCf/YBePJ9+2wZSQtf26Z24MONjHSp+Yks643rnjEAAEyqyNGMxdWZnYU37V05ighVn2c6dvhMPY3AWNQaU+QirX7OYEXefwzgiJVl0HjzXiUrffguiP0ATK26qQ80ey+qX868j9X1ILE1lMmtJ4okBLKSBqgBMhe8Bcmu3kH9Zg98MdME6nfzhripFPTR5Fl6dr+RBbSVHhFa0cu+4409boJcecQChWfFamSlf2KZXU6iDCxcHNL79WZSsWK8PChKDuqyMtzb72tv0sWVpfBKPQIEbXRX6kQeEL99gx8WrvJkzFCy3o2xbMcix/0LPPaqGveBMxrLl9XRxkSG7EI7bq5XY/GO/FVKecOyakhevhNEsaAQ9lkljdEKwWntNAsL2jFUOCZZkdfVxX9lFHxDLrPankjgKo/n89LF+efiMQi4vuJU+XCF0HPRD97hmssNDcnhhq8b3m2PdyxRcsu1QHViFdESDt29u9O4CeS1K1vHzvButupasWwG7wNi8yVnDtetJLS0Zu2Eu5UWPlfiMdJBZqLucdI4/2Y7dYkDShf7j7myuN5RJxDK66nn6nUG/0PWyc9fklzuNRZiWt/R1BndGSya3fNCqqUjqzdKOoAhAN3ubyaPY4Hqmvh3upUwzc5pZ4LuEzrXUggeNzPfpNh3fLoAhCxvrm16rv1rtMWR/0/T/EfmFfMb/Ivvy8Coc14cUzm9XnsWMEsA1v16ts1v7k0iSNIFZV6uYoHR/q02u17OiTHcY76UTa2DK58GWCFhTt1txd0Ej0u4XQPDwJmrrp0x2bX+LtnhBtGkNQqQmrvsjOOY118vFhx0Z74aJZhIUJaQ56zlsncNTXvd23PLp3Qm4HrESNPbbKILJH8/9JyV9vHRSjqXHXP8tfkzIxliPgysKf2ZUvTFVeAOitdNQfQdiGiDpKykUB52hZTCiCVQbxx5llK+Hr6QzyKSf1ROTsywevyhB2yEJqvxjgEYjL+7oziazmuxJZNek3X20uNmlqiHLU3WchKJbXyLuXTczgwlEZd+2+ehvTGNDEFk3N7unwhZj5hzZeeIyVyaN1l8hGr+Z/2v1cZUGugT0bkw3UtANbuz9a2ZdC91OmRtspIBfe5S52HuZnDKGmrVzkBVo1ffbtAQDkZieSnH38ciB17C4BHOIgzP2nYUbiNRsmV25qtZ/YltKzVRmq4Rr2u+Cxp7XCeKR9zlqMa/VCiP3bI8sm/hS1SkehKisc/CKm90c92HG6nd7hwHtXCKYPA/ow7KpeX/ljHGEvLceRS7tEiA87gUevbkd2JIy+h3ejqtsnmx0fHkbcKIwkG3xP8DuvCyhBdvB8ZVzsmPOL5GcgmKQk501YijuAVCR4Yu7GgCved2kpHGU+T9tEm0CwYNiVRTqsmahJz50= X-Forefront-Antispam-Report: CIP:172.205.89.229; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:nebula.arm.com; PTR:InfoDomainNonexistent; CAT:NONE; SFS:(13230040)(36860700013)(376014)(1800799024)(82310400026); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 08 Jan 2025 09:47:52.9427 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: c438c650-2794-4132-d162-08dd2fc98567 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[172.205.89.229]; Helo=[nebula.arm.com] X-MS-Exchange-CrossTenant-AuthSource: DB3PEPF0000885F.eurprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PR3PR08MB5675 X-Spam-Status: No, score=-12.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FORGED_SPF_HELO, GIT_PATCH_0, KAM_SHORT, SPF_HELO_PASS, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org Use unpredicted muls, and improve memory access. 7%, 3% and 1% improvement in throughput microbenchmark on Neoverse V1, for exp, exp2 and cosh respectively. --- OK for master? If so please commit for me as I don't have commit rights. Thanks, Luna sysdeps/aarch64/fpu/cosh_sve.c | 18 +++++++------- sysdeps/aarch64/fpu/exp10_sve.c | 25 +++++++++++-------- sysdeps/aarch64/fpu/exp2_sve.c | 26 ++++++++++---------- sysdeps/aarch64/fpu/exp_sve.c | 36 +++++++++++++++------------- sysdeps/aarch64/fpu/sv_expf_inline.h | 5 ++-- 5 files changed, 60 insertions(+), 50 deletions(-) diff --git a/sysdeps/aarch64/fpu/cosh_sve.c b/sysdeps/aarch64/fpu/cosh_sve.c index ca44053535..77e58e123e 100644 --- a/sysdeps/aarch64/fpu/cosh_sve.c +++ b/sysdeps/aarch64/fpu/cosh_sve.c @@ -23,7 +23,7 @@ static const struct data { float64_t poly[3]; float64_t inv_ln2, ln2_hi, ln2_lo, shift, thres; - uint64_t index_mask, special_bound; + uint64_t special_bound; } data = { .poly = { 0x1.fffffffffffd4p-2, 0x1.5555571d6b68cp-3, 0x1.5555576a59599p-5, }, @@ -35,14 +35,16 @@ static const struct data .shift = 0x1.8p+52, .thres = 704.0, - .index_mask = 0xff, /* 0x1.6p9, above which exp overflows. */ .special_bound = 0x4086000000000000, }; static svfloat64_t NOINLINE -special_case (svfloat64_t x, svfloat64_t y, svbool_t special) +special_case (svfloat64_t x, svbool_t pg, svfloat64_t t, svbool_t special) { + svfloat64_t half_t = svmul_x (svptrue_b64 (), t, 0.5); + svfloat64_t half_over_t = svdivr_x (pg, t, 0.5); + svfloat64_t y = svadd_x (pg, half_t, half_over_t); return sv_call_f64 (cosh, x, y, special); } @@ -60,12 +62,12 @@ exp_inline (svfloat64_t x, const svbool_t pg, const struct data *d) svuint64_t u = svreinterpret_u64 (z); svuint64_t e = svlsl_x (pg, u, 52 - V_EXP_TAIL_TABLE_BITS); - svuint64_t i = svand_x (pg, u, d->index_mask); + svuint64_t i = svand_x (svptrue_b64 (), u, 0xff); svfloat64_t y = svmla_x (pg, sv_f64 (d->poly[1]), r, d->poly[2]); y = svmla_x (pg, sv_f64 (d->poly[0]), r, y); y = svmla_x (pg, sv_f64 (1.0), r, y); - y = svmul_x (pg, r, y); + y = svmul_x (svptrue_b64 (), r, y); /* s = 2^(n/N). */ u = svld1_gather_index (pg, __v_exp_tail_data, i); @@ -94,12 +96,12 @@ svfloat64_t SV_NAME_D1 (cosh) (svfloat64_t x, const svbool_t pg) /* Up to the point that exp overflows, we can use it to calculate cosh by exp(|x|) / 2 + 1 / (2 * exp(|x|)). */ svfloat64_t t = exp_inline (ax, pg, d); - svfloat64_t half_t = svmul_x (pg, t, 0.5); - svfloat64_t half_over_t = svdivr_x (pg, t, 0.5); /* Fall back to scalar for any special cases. */ if (__glibc_unlikely (svptest_any (pg, special))) - return special_case (x, svadd_x (pg, half_t, half_over_t), special); + return special_case (x, pg, t, special); + svfloat64_t half_t = svmul_x (svptrue_b64 (), t, 0.5); + svfloat64_t half_over_t = svdivr_x (pg, t, 0.5); return svadd_x (pg, half_t, half_over_t); } diff --git a/sysdeps/aarch64/fpu/exp10_sve.c b/sysdeps/aarch64/fpu/exp10_sve.c index f71bafdf0c..53b28934d9 100644 --- a/sysdeps/aarch64/fpu/exp10_sve.c +++ b/sysdeps/aarch64/fpu/exp10_sve.c @@ -18,21 +18,23 @@ . */ #include "sv_math.h" -#include "poly_sve_f64.h" #define SpecialBound 307.0 /* floor (log10 (2^1023)). */ static const struct data { - double poly[5]; + double c1, c3, c2, c4, c0; double shift, log10_2, log2_10_hi, log2_10_lo, scale_thres, special_bound; } data = { /* Coefficients generated using Remez algorithm. rel error: 0x1.9fcb9b3p-60 abs error: 0x1.a20d9598p-60 in [ -log10(2)/128, log10(2)/128 ] max ulp err 0.52 +0.5. */ - .poly = { 0x1.26bb1bbb55516p1, 0x1.53524c73cd32ap1, 0x1.0470591daeafbp1, - 0x1.2bd77b1361ef6p0, 0x1.142b5d54e9621p-1 }, + .c0 = 0x1.26bb1bbb55516p1, + .c1 = 0x1.53524c73cd32ap1, + .c2 = 0x1.0470591daeafbp1, + .c3 = 0x1.2bd77b1361ef6p0, + .c4 = 0x1.142b5d54e9621p-1, /* 1.5*2^46+1023. This value is further explained below. */ .shift = 0x1.800000000ffc0p+46, .log10_2 = 0x1.a934f0979a371p1, /* 1/log2(10). */ @@ -70,9 +72,9 @@ special_case (svbool_t pg, svfloat64_t s, svfloat64_t y, svfloat64_t n, /* |n| > 1280 => 2^(n) overflows. */ svbool_t p_cmp = svacgt (pg, n, d->scale_thres); - svfloat64_t r1 = svmul_x (pg, s1, s1); + svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); svfloat64_t r2 = svmla_x (pg, s2, s2, y); - svfloat64_t r0 = svmul_x (pg, r2, s1); + svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); return svsel (p_cmp, r1, r0); } @@ -103,11 +105,14 @@ svfloat64_t SV_NAME_D1 (exp10) (svfloat64_t x, svbool_t pg) comes at significant performance cost. */ svuint64_t u = svreinterpret_u64 (z); svfloat64_t scale = svexpa (u); - + svfloat64_t c24 = svld1rq (svptrue_b64 (), &d->c2); /* Approximate exp10(r) using polynomial. */ - svfloat64_t r2 = svmul_x (pg, r, r); - svfloat64_t y = svmla_x (pg, svmul_x (pg, r, d->poly[0]), r2, - sv_pairwise_poly_3_f64_x (pg, r, r2, d->poly + 1)); + svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); + svfloat64_t p12 = svmla_lane (sv_f64 (d->c1), r, c24, 0); + svfloat64_t p34 = svmla_lane (sv_f64 (d->c3), r, c24, 1); + svfloat64_t p14 = svmla_x (pg, p12, p34, r2); + + svfloat64_t y = svmla_x (pg, svmul_x (svptrue_b64 (), r, d->c0), r2, p14); /* Assemble result as exp10(x) = 2^n * exp10(r). If |x| > SpecialBound multiplication may overflow, so use special case routine. */ diff --git a/sysdeps/aarch64/fpu/exp2_sve.c b/sysdeps/aarch64/fpu/exp2_sve.c index a37c33092a..6db85266ca 100644 --- a/sysdeps/aarch64/fpu/exp2_sve.c +++ b/sysdeps/aarch64/fpu/exp2_sve.c @@ -18,7 +18,6 @@ . */ #include "sv_math.h" -#include "poly_sve_f64.h" #define N (1 << V_EXP_TABLE_BITS) @@ -27,15 +26,15 @@ static const struct data { - double poly[4]; + double c0, c2; + double c1, c3; double shift, big_bound, uoflow_bound; } data = { /* Coefficients are computed using Remez algorithm with minimisation of the absolute error. */ - .poly = { 0x1.62e42fefa3686p-1, 0x1.ebfbdff82c241p-3, 0x1.c6b09b16de99ap-5, - 0x1.3b2abf5571ad8p-7 }, - .shift = 0x1.8p52 / N, - .uoflow_bound = UOFlowBound, + .c0 = 0x1.62e42fefa3686p-1, .c1 = 0x1.ebfbdff82c241p-3, + .c2 = 0x1.c6b09b16de99ap-5, .c3 = 0x1.3b2abf5571ad8p-7, + .shift = 0x1.8p52 / N, .uoflow_bound = UOFlowBound, .big_bound = BigBound, }; @@ -67,9 +66,9 @@ special_case (svbool_t pg, svfloat64_t s, svfloat64_t y, svfloat64_t n, /* |n| > 1280 => 2^(n) overflows. */ svbool_t p_cmp = svacgt (pg, n, d->uoflow_bound); - svfloat64_t r1 = svmul_x (pg, s1, s1); + svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); svfloat64_t r2 = svmla_x (pg, s2, s2, y); - svfloat64_t r0 = svmul_x (pg, r2, s1); + svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); return svsel (p_cmp, r1, r0); } @@ -99,11 +98,14 @@ svfloat64_t SV_NAME_D1 (exp2) (svfloat64_t x, svbool_t pg) svuint64_t top = svlsl_x (pg, ki, 52 - V_EXP_TABLE_BITS); svfloat64_t scale = svreinterpret_f64 (svadd_x (pg, sbits, top)); + svfloat64_t c13 = svld1rq (svptrue_b64 (), &d->c1); /* Approximate exp2(r) using polynomial. */ - svfloat64_t r2 = svmul_x (pg, r, r); - svfloat64_t p = sv_pairwise_poly_3_f64_x (pg, r, r2, d->poly); - svfloat64_t y = svmul_x (pg, r, p); - + /* y = exp2(r) - 1 ~= C0 r + C1 r^2 + C2 r^3 + C3 r^4. */ + svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); + svfloat64_t p01 = svmla_lane (sv_f64 (d->c0), r, c13, 0); + svfloat64_t p23 = svmla_lane (sv_f64 (d->c2), r, c13, 1); + svfloat64_t p = svmla_x (pg, p01, p23, r2); + svfloat64_t y = svmul_x (svptrue_b64 (), r, p); /* Assemble exp2(x) = exp2(r) * scale. */ if (__glibc_unlikely (svptest_any (pg, special))) return special_case (pg, scale, y, kd, d); diff --git a/sysdeps/aarch64/fpu/exp_sve.c b/sysdeps/aarch64/fpu/exp_sve.c index 37de751f90..dc049482ed 100644 --- a/sysdeps/aarch64/fpu/exp_sve.c +++ b/sysdeps/aarch64/fpu/exp_sve.c @@ -21,12 +21,15 @@ static const struct data { - double poly[4]; + double c0, c2; + double c1, c3; double ln2_hi, ln2_lo, inv_ln2, shift, thres; + } data = { - .poly = { /* ulp error: 0.53. */ - 0x1.fffffffffdbcdp-2, 0x1.555555555444cp-3, 0x1.555573c6a9f7dp-5, - 0x1.1111266d28935p-7 }, + .c0 = 0x1.fffffffffdbcdp-2, + .c1 = 0x1.555555555444cp-3, + .c2 = 0x1.555573c6a9f7dp-5, + .c3 = 0x1.1111266d28935p-7, .ln2_hi = 0x1.62e42fefa3800p-1, .ln2_lo = 0x1.ef35793c76730p-45, /* 1/ln2. */ @@ -36,7 +39,6 @@ static const struct data .thres = 704.0, }; -#define C(i) sv_f64 (d->poly[i]) #define SpecialOffset 0x6000000000000000 /* 0x1p513. */ /* SpecialBias1 + SpecialBias1 = asuint(1.0). */ #define SpecialBias1 0x7000000000000000 /* 0x1p769. */ @@ -56,20 +58,20 @@ special_case (svbool_t pg, svfloat64_t s, svfloat64_t y, svfloat64_t n) svuint64_t b = svdup_u64_z (p_sign, SpecialOffset); /* Inactive lanes set to 0. */ - /* Set s1 to generate overflow depending on sign of exponent n. */ - svfloat64_t s1 = svreinterpret_f64 ( - svsubr_x (pg, b, SpecialBias1)); /* 0x70...0 - b. */ - /* Offset s to avoid overflow in final result if n is below threshold. */ + /* Set s1 to generate overflow depending on sign of exponent n, + ie. s1 = 0x70...0 - b. */ + svfloat64_t s1 = svreinterpret_f64 (svsubr_x (pg, b, SpecialBias1)); + /* Offset s to avoid overflow in final result if n is below threshold. + ie. s2 = as_u64 (s) - 0x3010...0 + b. */ svfloat64_t s2 = svreinterpret_f64 ( - svadd_x (pg, svsub_x (pg, svreinterpret_u64 (s), SpecialBias2), - b)); /* as_u64 (s) - 0x3010...0 + b. */ + svadd_x (pg, svsub_x (pg, svreinterpret_u64 (s), SpecialBias2), b)); /* |n| > 1280 => 2^(n) overflows. */ svbool_t p_cmp = svacgt (pg, n, 1280.0); - svfloat64_t r1 = svmul_x (pg, s1, s1); + svfloat64_t r1 = svmul_x (svptrue_b64 (), s1, s1); svfloat64_t r2 = svmla_x (pg, s2, s2, y); - svfloat64_t r0 = svmul_x (pg, r2, s1); + svfloat64_t r0 = svmul_x (svptrue_b64 (), r2, s1); return svsel (p_cmp, r1, r0); } @@ -103,16 +105,16 @@ svfloat64_t SV_NAME_D1 (exp) (svfloat64_t x, const svbool_t pg) svfloat64_t z = svmla_x (pg, sv_f64 (d->shift), x, d->inv_ln2); svuint64_t u = svreinterpret_u64 (z); svfloat64_t n = svsub_x (pg, z, d->shift); - + svfloat64_t c13 = svld1rq (svptrue_b64 (), &d->c1); /* r = x - n * ln2, r is in [-ln2/(2N), ln2/(2N)]. */ svfloat64_t ln2 = svld1rq (svptrue_b64 (), &d->ln2_hi); svfloat64_t r = svmls_lane (x, n, ln2, 0); r = svmls_lane (r, n, ln2, 1); /* y = exp(r) - 1 ~= r + C0 r^2 + C1 r^3 + C2 r^4 + C3 r^5. */ - svfloat64_t r2 = svmul_x (pg, r, r); - svfloat64_t p01 = svmla_x (pg, C (0), C (1), r); - svfloat64_t p23 = svmla_x (pg, C (2), C (3), r); + svfloat64_t r2 = svmul_x (svptrue_b64 (), r, r); + svfloat64_t p01 = svmla_lane (sv_f64 (d->c0), r, c13, 0); + svfloat64_t p23 = svmla_lane (sv_f64 (d->c2), r, c13, 1); svfloat64_t p04 = svmla_x (pg, p01, p23, r2); svfloat64_t y = svmla_x (pg, r, p04, r2); diff --git a/sysdeps/aarch64/fpu/sv_expf_inline.h b/sysdeps/aarch64/fpu/sv_expf_inline.h index f208d33896..f9965fc423 100644 --- a/sysdeps/aarch64/fpu/sv_expf_inline.h +++ b/sysdeps/aarch64/fpu/sv_expf_inline.h @@ -61,7 +61,7 @@ expf_inline (svfloat32_t x, const svbool_t pg, const struct sv_expf_data *d) /* scale = 2^(n/N). */ svfloat32_t scale = svexpa (svreinterpret_u32 (z)); - /* y = exp(r) - 1 ~= r + C0 r^2 + C1 r^3 + C2 r^4 + C3 r^5 + C4 r^6. */ + /* poly(r) = exp(r) - 1 ~= C0 r + C1 r^2 + C2 r^3 + C3 r^4 + C4 r^5. */ svfloat32_t p12 = svmla_lane (sv_f32 (d->c1), r, lane_consts, 2); svfloat32_t p34 = svmla_lane (sv_f32 (d->c3), r, lane_consts, 3); svfloat32_t r2 = svmul_x (svptrue_b32 (), r, r); @@ -71,5 +71,4 @@ expf_inline (svfloat32_t x, const svbool_t pg, const struct sv_expf_data *d) return svmla_x (pg, scale, scale, poly); } - -#endif +#endif \ No newline at end of file